Project Overview
This project focuses on building an end-to-end Machine Learning pipeline to analyze user sentiment from IMDB movie reviews. Unlike standard binary classifiers, this system incorporates Decision Logic to identify neutral feedback, providing a more nuanced recommendation: "Recommend," "Do Not Recommend," or "Maybe Watch."
How I Built It (Technical Workflow)
Advanced Text Preprocessing:
Raw text is often noisy. I developed a cleaning pipeline using NLTK and Regex:
- Noise Removal: Stripped HTML tags and non-alphabetic characters.
- Normalization: Converted all text to lowercase for consistency.
- Stopword Removal & Lemmatization: Removed common fillers (e.g., "the", "a") and reduced words to their dictionary root (e.g., "watched" becomes "watch") to reduce dimensionality.
Feature Engineering (TF-IDF):
To translate text into a language machines understand, I used TF-IDF (Term Frequency-Inverse Document Frequency). I limited the vocabulary to the top 5,000 features to ensure the model focuses on the most impactful words while ignoring rare noise.
Model Implementation:
I chose Linear SVC (Support Vector Classification). It is highly efficient for high-dimensional text data and provides a clear decision boundary between sentiments.
Challenges & Solutions
1. The Neutral Data Gap
- The Challenge: The IMDB dataset is strictly binary (Positive/Negative). This meant the model had never "seen" a neutral review, causing it to force-classify "average" movies as either good or bad.
- The Solution: Instead of using
model.predict(), I utilized model.decision_function(). By calculating how close a review lies to the decision boundary (Thresholding), I was able to manually trigger a Neutral (0) classification for reviews that lacked strong positive or negative conviction.
2. Balancing Overfitting
- The Challenge: Initially, the model showed high training accuracy but lower performance on new data.
- The Solution: I leveraged L2 Regularization (inherent in Linear SVC). By monitoring the gap between training and testing scores, I ensured the model generalized well. Limiting the TF-IDF features also helped prevent the model from "memorizing" specific review noise.
How To Use This Project
- Setup: Ensure
scikit-learn, pandas, and nltk are installed in your Python environment.
- Input: Pass a raw string (movie review) into the
get_recommendation function.
Inference Example:
review = "The visuals were stunning, but the pacing was a bit slow and mediocre."
print(get_recommendation(review))
# Result -> Score: 0.04 | Sentiment: 0 | Action: Maybe watch (Neutral)
- Test Accuracy: 88%
- Generalization: The model maintains a balanced F1-score across both classes, proving it is robust against bias and ready for real-world unseen data.