# Naive Bayes algorithm and text classification

May 27, 2017

Naive Bayes is a widely used classification algorithm. It is a supervised learning algorithm based on Bayes’ Theorem. The word naive comes from the assumption of independence among features. That is, if our input vector is (x1, x2,…,xn), then xi’s are conditionally independent given y.

# Conditional Probability

Before we move on through, let’s do a short review of Conditional Probability.

# Deriving the algorithm

Let’s start with Bayes’ theorem (for naive bayes, x is the input and y is the output):

$$P( y | x ) = \frac{P(y)P(x | y)}{P(x)}$$

When we have more than one feature, we can rewrite Bayes’ theorem as:

$$P( y | x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n | y)}{P(x_1,x_2,...,x_n)}$$

Since we are making the assumption that xi’s are conditionally independent given y, we can rewrite the above as

$$P(y|x_1,...,x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i|y)}{P(x_1,x_2,...,x_n)}$$

but we also know that P(x1, x2, .., xn) is a constant given the input, i.e.

$$P(y|x_1,...,x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y) \quad \quad (1)$$

Notice that

• the left hand side is the term we are interested in, probability distribution of the output y given input x.
• P(y) can be estimated by counting the number of times each class y appears in our training data (this is called Maximum a Posteriori estimation).
• P(xi|y) can be estimated by counting the number of times each value of xi appears for each class y in our training data.

# Algorithm steps

Training:

1. Estimate P(y): P(y=t) = number of times class t appears in the dataset / size of dataset
2. Estimate P(xi|y): P(xi=k|y=t) = number of times xi has value k and y has value t / number of data points of class t

Predicting:

1. Estimate P(y|x1,…,xn): Use above estimated values of P(y) and P(xi|y) and equation (1). Thereafter, normalize the values.

# Variants

There are several variants of naive bayes which use different distributions for P(xi|y) such as gaussian distribution (gaussian naive bayes), multinomial distribution (multinomial naive bayes) and bernoulli distribution (bernoulli naive bayes).

# Scikit-learn implementation

# We will use the iris dataset:
# The iris flower data set consists of 50 samples from each of three
# species of Iris (Iris setosa, Iris virginica and Iris versicolor).
# Four features were measured from each sample: the length and the width
# of the sepals and petals, in centimeters.
from sklearn.naive_bayes import GaussianNB
import numpy as np
model = GaussianNB()
model.fit(data.data, data.target)
# evalaute
print(model.score(data.data, data.target))
# output = 0.96
# predict
model.predict([[4.2, 3, 0.9, 2.1]])
# 0 = setosa, 1 = versicolor, and 2 = virginica