In this hands-on assignment, we'll apply naive bayes to predict the sentiment of movie reviews. The tutorial will guide you through the process of implementing naive bayes in Python from scratch. Let's get started!
Dataset
Below is the code for loading and splitting the dataset. The dataset is a subset of the data from Stanford's Sentiment Treebank. It includes sentiment labels (positive or negative) for phrases in the parse trees of sentences from movie reviews. You can download the data at this link: sentiment_data.pkl.
The data can be loaded with the following code.
import pickleimport numpy as npf = open('sentiment_data.pkl', 'rb')train_positive, train_negative, test_positive, test_negative = pickle.load(f)f.close()print('Data description ... ')print(len(train_positive), len(train_negative), len(test_positive), len(test_negative))print('='*120)print(train_positive[:10])print('='*120)print(train_negative[:10])print('='*120)
which produces the following output:
Data description ...(2881, 2617, 721, 655)========================================================================================================================[ 'With Dirty Deeds , David Caesar has stepped into the mainstream of filmmaking with an assurance worthy of international acclaim and with every cinematic tool well under his control -- driven by a natural sense for what works on screen .'"Still , the updated Dickensian sensibility of writer Craig Bartlett 's story is appealing ."'Forget about one Oscar nomination for Julianne Moore this year - she should get all five .''and your reward will be a thoughtful , emotional movie experience .''In the end there is one word that best describes this film : honest .''Deserves a place of honor next to Nanook as a landmark in film history .''This movie is to be cherished .''... Wallace is smart to vary the pitch of his movie , balancing deafening battle scenes with quieter domestic scenes of women back home receiving War Department telegrams .''This is a fascinating film because there is no clear-cut hero and no all-out villain .''Features one of the most affecting depictions of a love affair ever committed to film .']========================================================================================================================["It 's a strange film , one that was hard for me to warm up to ."'Terrible .'"Build some robots , haul 'em to the theatre with you for the late show , and put on your own Mystery Science Theatre 3000 tribute to what is almost certainly going to go down as the worst -- and only -- killer website movie of this or any other year ."'Like an Afterschool Special with costumes by Gianni Versace , Mad Love looks better than it feels .'"The abiding impression , despite the mild hallucinogenic buzz , is of overwhelming waste -- the acres of haute couture ca n't quite conceal that there 's nothing resembling a spine here ."'A crass and insulting homage to great films like Some Like It Hot and the John Wayne classics .''Instead of making his own style , director Marcus Adams just copies from various sources -- good sources , bad mixture'"The feature-length stretch ... strains the show 's concept ."'The end result is like cold porridge with only the odd enjoyably chewy lump .'"Maybe you 'll be lucky , and there 'll be a power outage during your screening so you can get your money back ."]========================================================================================================================
The train dataset has about 5500 reviews (2881 + 2617), and the test dataset has about 1400. There are slightly more positive reviews than negative reviews (53% of the reviews are positive).
The output above also shows samples of 10 positive reviews and 10 negative reviews from the training data.