Smart Movie Recommender using Linear SVC

Smart Movie Recommender using Linear SVC

Leader posted 2 min read

Project Overview

This project focuses on building an end-to-end Machine Learning pipeline to analyze user sentiment from IMDB movie reviews. Unlike standard binary classifiers, this system incorporates Decision Logic to identify neutral feedback, providing a more nuanced recommendation: "Recommend," "Do Not Recommend," or "Maybe Watch."

How I Built It (Technical Workflow)

  1. Advanced Text Preprocessing:
    Raw text is often noisy. I developed a cleaning pipeline using NLTK and Regex:

    • Noise Removal: Stripped HTML tags and non-alphabetic characters.
    • Normalization: Converted all text to lowercase for consistency.
    • Stopword Removal & Lemmatization: Removed common fillers (e.g., "the", "a") and reduced words to their dictionary root (e.g., "watched" becomes "watch") to reduce dimensionality.
  2. Feature Engineering (TF-IDF):
    To translate text into a language machines understand, I used TF-IDF (Term Frequency-Inverse Document Frequency). I limited the vocabulary to the top 5,000 features to ensure the model focuses on the most impactful words while ignoring rare noise.

  3. Model Implementation:
    I chose Linear SVC (Support Vector Classification). It is highly efficient for high-dimensional text data and provides a clear decision boundary between sentiments.

Challenges & Solutions

1. The Neutral Data Gap
  • The Challenge: The IMDB dataset is strictly binary (Positive/Negative). This meant the model had never "seen" a neutral review, causing it to force-classify "average" movies as either good or bad.
  • The Solution: Instead of using model.predict(), I utilized model.decision_function(). By calculating how close a review lies to the decision boundary (Thresholding), I was able to manually trigger a Neutral (0) classification for reviews that lacked strong positive or negative conviction.
2. Balancing Overfitting
  • The Challenge: Initially, the model showed high training accuracy but lower performance on new data.
  • The Solution: I leveraged L2 Regularization (inherent in Linear SVC). By monitoring the gap between training and testing scores, I ensured the model generalized well. Limiting the TF-IDF features also helped prevent the model from "memorizing" specific review noise.

How To Use This Project

  1. Setup: Ensure scikit-learn, pandas, and nltk are installed in your Python environment.
  2. Input: Pass a raw string (movie review) into the get_recommendation function.
  3. Inference Example:

    review = "The visuals were stunning, but the pacing was a bit slow and mediocre."
    print(get_recommendation(review)) 
        
    # Result -> Score: 0.04 | Sentiment: 0 | Action: Maybe watch (Neutral)
    

Final Performance

  • Test Accuracy: 88%
  • Generalization: The model maintains a balanced F1-score across both classes, proving it is robust against bias and ready for real-world unseen data.

More Posts

Movie Review Sentiment Analysis

Urooj Fatima | EE Student - Apr 19

Sentiment Analysis Using NLP: Visualizing Emotions in Text with Python and Power BI

Fady-Desoky-Saeed-Abdelaziz - Apr 16

The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance

Ken W. Algerverified - Apr 28

Movie Recommender System

Urooj Fatima | EE Student - Apr 18

Your App Feels Smart, So Why Do Users Still Leave?

kajolshah - Feb 2
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

10 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!