Hands-on Assignment: Sentiment Classification with Naive Bayes

April 17, 2018

In this hands-on assignment, we’ll apply naive bayes to predict the sentiment of movie reviews. The tutorial will guide you through the process of implementing naive bayes in Python from scratch. Let’s get started!

Dataset

Below is the code for loading and splitting the dataset. The dataset is a subset of the data from Stanford’s Sentiment Treebank. It includes sentiment labels (positive or negative) for phrases in the parse trees of sentences from movie reviews. You can download the data at this link: sentiment_data.pkl.

The data can be loaded with the following code.

import pickle
import numpy as np
f = open('sentiment_data.pkl', 'rb')
train_positive, train_negative, test_positive, test_negative = pickle.load(f)
f.close()
print('Data description ... ')
print(len(train_positive), len(train_negative), len(test_positive), len(test_negative))
print('='*120)
print(train_positive[:10])
print('='*120)
print(train_negative[:10])
print('='*120)

which produces the following output:

Data description ...
(2881, 2617, 721, 655)
========================================================================================================================
[ 'With Dirty Deeds , David Caesar has stepped into the mainstream of filmmaking with an assurance worthy of international acclaim and with every cinematic tool well under his control -- driven by a natural sense for what works on screen .'
 "Still , the updated Dickensian sensibility of writer Craig Bartlett 's story is appealing ."
 'Forget about one Oscar nomination for Julianne Moore this year - she should get all five .'
 'and your reward will be a thoughtful , emotional movie experience .'
 'In the end there is one word that best describes this film : honest .'
 'Deserves a place of honor next to Nanook as a landmark in film history .'
 'This movie is to be cherished .'
 '... Wallace is smart to vary the pitch of his movie , balancing deafening battle scenes with quieter domestic scenes of women back home receiving War Department telegrams .'
 'This is a fascinating film because there is no clear-cut hero and no all-out villain .'
 'Features one of the most affecting depictions of a love affair ever committed to film .']
========================================================================================================================
["It 's a strange film , one that was hard for me to warm up to ."
 'Terrible .'
 "Build some robots , haul 'em to the theatre with you for the late show , and put on your own Mystery Science Theatre 3000 tribute to what is almost certainly going to go down as the worst -- and only -- killer website movie of this or any other year ."
 'Like an Afterschool Special with costumes by Gianni Versace , Mad Love looks better than it feels .'
 "The abiding impression , despite the mild hallucinogenic buzz , is of overwhelming waste -- the acres of haute couture ca n't quite conceal that there 's nothing resembling a spine here ."
 'A crass and insulting homage to great films like Some Like It Hot and the John Wayne classics .'
 'Instead of making his own style , director Marcus Adams just copies from various sources -- good sources , bad mixture'
 "The feature-length stretch ... strains the show 's concept ."
 'The end result is like cold porridge with only the odd enjoyably chewy lump .'
 "Maybe you 'll be lucky , and there 'll be a power outage during your screening so you can get your money back ."]
========================================================================================================================

The train dataset has about 5500 reviews (2881 + 2617), and the test dataset has about 1400. There are slightly more positive reviews than negative reviews (53% of the reviews are positive).

The output above also shows samples of 10 positive reviews and 10 negative reviews from the training data.

Task 1: Implementing Naive Bayes

To implement Naive Bayes, you basically need to implement the Algorithm steps section in the Naive Bayes tutorial.

Step 1: Split each review into tokens. For example, we could define any consecutive sequence of alphabets as a token, and convert all tokens to lowercase.
Step 2: Calculate P(positive) and P(negative).
Step 3: Count the number of times each token appears in positive reviews and negative reviews in the train data. From the counts, infer P(positive|token) and P(negative|token).
Make sure to add some smoothening, i.e. P(positive|token) or P(negative|token) should never be 0. (Why?)
Step 4: For each review in the test data, split it into tokens. And use the above P’s to calculate the prediction for this review.

Here’s a template for the code.

import re
from sklearn.metrics import classification_report, confusion_matrix
###############################################################################
## helper functions
def review_tokens(review):
    return [token.lower() for token in re.findall('[A-Za-z]+', review)]
###############################################################################
## naive bayes
nneg, npos = len(train_negative), len(train_positive)
## train
# P(positive) and P(negative)
pos_prob = ... 
neg_prob = ... 
# P(positive|token) and P(negative|token)
...
# predict
all_test = np.concatenate((test_positive, test_negative))
labels = [True]*len(test_positive) + [False]*len(test_negative)
predictions = list()
for review in all_test:
    pos, neg = pos_prob, neg_prob
    for token in review_tokens(review):
        ... # update pos, neg 
    predictions.append(pos > neg)
# evaluate
print(classification_report(labels, predictions))
print('='*120)
print(confusion_matrix(labels, predictions))
print('='*120)
###############################################################################

Here’s the output I got. Your numbers might not match exactly if you change the pre-processing (review to tokens), but make sure you get about 80% accuracy or more.

========================================================================================================================
             precision    recall  f1-score   support
      False       0.87      0.66      0.75       655
       True       0.75      0.91      0.82       721
avg / total       0.81      0.79      0.79      1376
========================================================================================================================
[[430 225]
 [ 63 658]]
========================================================================================================================

Task 2: Inspect the model

Look at the list of most positive and most negative tokens. Here’s the code template:

###############################################################################
## most positive and most negative words
vocab = set([token for review in np.concatenate((train_positive, train_negative)) for token in review_tokens(review)])
positivity = dict()
for token in all_tokens:
    ...
    positivity[token] = ...
print('Most positive tokens', sorted(positivity.keys(), key=positivity.get, reverse=True)[:10])
print('Most negative tokens', sorted(positivity.keys(), key=positivity.get, reverse=False)[:10])
###############################################################################

Here’s the output I got:

('Most positive tokens', ['powerful', 'solid', 'wonderful', 'touching', 
    'eyes', 'provides', 'inventive', 'portrait', 'refreshing', 'means'])
('Most negative tokens', ['suffers', 'stupid', 'poorly', 'car', 
    'generic', 'mess', 'queen', 'joke', 'ill', 'disguise'])

Solution on Google Colaboratory

Notebook Link

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your own system.

Next steps: Additional processing and hyper-parameter tuning

Here are some additional ideas to improve accuracy.

Filter out stop words from review tokens - words like ‘the’, ‘an’, ‘if’, etc. Read more about text pre-processing here: Text Classification (Topic Categorization, Spam filtering, etc)
Play around with how much smoothening to add (this effectively acts as regularization in Naive Bayes).

These should get your accuracy to about 85-90% or more.

Hope you had fun.

Happy (machine) learning! :)