CommonLounge Archive

Naive Bayes algorithm and text classification

May 27, 2017

Naive Bayes is a widely used classification algorithm. It is a supervised learning algorithm based on Bayes’ Theorem. The word naive comes from the assumption of independence among features. That is, if our input vector is (x1, x2,…,xn), then xi’s are conditionally independent given y.

Conditional Probability

Before we move on through, let’s do a short review of Conditional Probability.

Deriving the algorithm

Let’s start with Bayes’ theorem (for naive bayes, x is the input and y is the output):

$$ P( y | x ) = \frac{P(y)P(x | y)}{P(x)} $$

When we have more than one feature, we can rewrite Bayes’ theorem as:

$$ P( y | x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n | y)}{P(x_1,x_2,...,x_n)} $$

Since we are making the assumption that xi’s are conditionally independent given y, we can rewrite the above as

$$ P(y|x_1,...,x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i|y)}{P(x_1,x_2,...,x_n)} $$

but we also know that P(x1, x2, .., xn) is a constant given the input, i.e.

$$ P(y|x_1,...,x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y) \quad \quad (1) $$

Notice that

  • the left hand side is the term we are interested in, probability distribution of the output y given input x.
  • P(y) can be estimated by counting the number of times each class y appears in our training data (this is called Maximum a Posteriori estimation).
  • P(xi|y) can be estimated by counting the number of times each value of xi appears for each class y in our training data.

Algorithm steps


  1. Estimate P(y): P(y=t) = number of times class t appears in the dataset / size of dataset
  2. Estimate P(xi|y): P(xi=k|y=t) = number of times xi has value k and y has value t / number of data points of class t


  1. Estimate P(y|x1,…,xn): Use above estimated values of P(y) and P(xi|y) and equation (1). Thereafter, normalize the values.


There are several variants of naive bayes which use different distributions for P(xi|y) such as gaussian distribution (gaussian naive bayes), multinomial distribution (multinomial naive bayes) and bernoulli distribution (bernoulli naive bayes).

Scikit-learn implementation

# We will use the iris dataset:
# The iris flower data set consists of 50 samples from each of three 
# species of Iris (Iris setosa, Iris virginica and Iris versicolor). 
# Four features were measured from each sample: the length and the width 
# of the sepals and petals, in centimeters.
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
import numpy as np
# load the dataset
data = load_iris()
model = GaussianNB(),
# evalaute
# output = 0.96
# predict
model.predict([[4.2, 3, 0.9, 2.1]])
# 0 = setosa, 1 = versicolor, and 2 = virginica

How to run this code on Google Colaboratory

Notebook Link

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory.


Naive bayes is one of the simplest yet effective algorithms for

  • Text classification: For example, we have a number of news articles, and we want to learn to classify if the article is about politics, health, technology, sports or lifestyle.
  • Spam filtering: We have a number of emails, and we want to learn to classify if the email is spam or not.
  • Gender classification: Given features such as height, weight, etc, predict whether the person is male or female.

© 2016-2022. All rights reserved.