# Naive Bayes algorithm and text classification

May 27, 2017

**Naive Bayes** is a widely used classification algorithm. It is a supervised learning algorithm based on Bayes’ Theorem. The word **naive** comes from the assumption of independence among features. That is, if our input vector is (x_{1}, x_{2},…,x_{n}), then x_{i}’s are conditionally independent given *y*.

# Conditional Probability

Before we move on through, let’s do a short review of Conditional Probability.

# Deriving the algorithm

Let’s start with Bayes’ theorem (for naive bayes, x is the input and y is the output):

$$ P( y | x ) = \frac{P(y)P(x | y)}{P(x)} $$When we have more than one feature, we can rewrite Bayes’ theorem as:

$$ P( y | x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n | y)}{P(x_1,x_2,...,x_n)} $$Since we are making the assumption that x_{i}’s are conditionally independent given *y*, we can rewrite the above as

but we also know that *P*(*x _{1}, x_{2}, .., x_{n}*) is a constant given the input, i.e.

Notice that

- the left hand side is the term we are interested in, probability distribution of the output y given input x.
- P(y) can be estimated by counting the number of times each class y appears in our training data (this is called
__Maximum a Posteriori__estimation). - P(x
_{i}|y) can be estimated by counting the number of times each value of x_{i}appears for each class y in our training data.

# Algorithm steps

Training:

**Estimate P(y)**: P(y=t) = number of times class t appears in the dataset / size of dataset**Estimate P(x**: P(x_{i}|y)_{i}=k|y=t) = number of times x_{i}has value k and y has value t / number of data points of class t

Predicting:

**Estimate P(y|x**: Use above estimated values of P(y) and P(x_{1},…,x_{n})_{i}|y) and equation (1). Thereafter, normalize the values.

# Variants

There are several variants of naive bayes which use different distributions for P(x_{i}|y) such as gaussian distribution (gaussian naive bayes), multinomial distribution (multinomial naive bayes) and bernoulli distribution (bernoulli naive bayes).

# Scikit-learn implementation

```
# We will use the iris dataset:
# The iris flower data set consists of 50 samples from each of three
# species of Iris (Iris setosa, Iris virginica and Iris versicolor).
# Four features were measured from each sample: the length and the width
# of the sepals and petals, in centimeters.
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
import numpy as np
# load the dataset
data = load_iris()
model = GaussianNB()
model.fit(data.data, data.target)
# evalaute
print(model.score(data.data, data.target))
# output = 0.96
# predict
model.predict([[4.2, 3, 0.9, 2.1]])
# 0 = setosa, 1 = versicolor, and 2 = virginica
```

# How to run this code on Google Colaboratory

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory.

# Applications

Naive bayes is one of the simplest yet effective algorithms for

**Text classification:**For example, we have a number of news articles, and we want to learn to classify if the article is about politics, health, technology, sports or lifestyle.**Spam filtering:**We have a number of emails, and we want to learn to classify if the email is spam or not.**Gender classification:**Given features such as height, weight, etc, predict whether the person is male or female.