**Naive Bayes** is a widely used classification algorithm. It is a supervised learning algorithm based on Bayes’ Theorem. The word **naive** comes from the assumption of independence among features. That is, if our input vector is (x_{1}, x_{2},...,x_{n}), then x_{i}'s are conditionally independent given *y*.

# Deriving the algorithm

Let's start with Bayes' theorem (for naive bayes, x is the input and y is the output):

P( y | x ) = \frac{P(y)P(x | y)}{P(x)}

When we have more than one feature, we can rewrite Bayes' theorem as:

P( y | x_1,...,x_n) = \frac{P(y)P(x_1,...,x_n | y)}{P(x_1,x_2,...,x_n)}

Since we are making the assumption that x_{i}'s are conditionally independent given *y*, we can rewrite the above as

P(y|x_1,...,x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i|y)}{P(x_1,x_2,...,x_n)}

but we also know that *P*(*x*_{1}, x_{2}, .., x_{n}) is a constant given the input, i.e.

P(y|x_1,...,x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y) \quad \quad (1)

Notice that

- the left hand side is the term we are interested in, probability distribution of the output y given input x.
- P(y) can be estimated by counting the number of times each class y appears in our training data (this is called
__Maximum a Posteriori__ estimation). - P(x
_{i}|y) can be estimated by counting the number of times each value of x_{i} appears for each class y in our training data.

# Pseudocode

Training:

**Estimate P(y)**: P(y=t) = number of times class t appears in the dataset / size of dataset**Estimate P(x**_{i}|y): P(x_{i}=k|y=t) = number of times x_{i} has value k and y has value t / number of data points of class t

Predicting:

**Estimate P(y|x**_{1},...,x_{n}): Use above estimated values of P(y) and P(x_{i}|y) and equation (1). Thereafter, normalize the values.

# Variants

There are several variants of naive bayes which use different distributions for P(x_{i}|y) such as gaussian distribution (gaussian naive bayes), multinomial distribution (multinomial naive bayes) and bernoulli distribution (bernoulli naive bayes).

# Scikit-learn implementation

# We will use the iris dataset:

# The iris flower data set consists of 50 samples from each of three

# species of Iris (Iris setosa, Iris virginica and Iris versicolor).

# Four features were measured from each sample: the length and the width

# of the sepals and petals, in centimeters.

from sklearn.datasets import load_iris

from sklearn.naive_bayes import GaussianNB

import numpy as np

# load the dataset

data = load_iris()

model = GaussianNB()

model.fit(data.data, data.target)

# evalaute

print(model.score(data.data, data.target))

# output = 0.96

# predict

model.predict([[4.2, 3, 0.9, 2.1]])

# 0 = setosa, 1 = versicolor, and 2 = virginica

# Applications

Naive bayes is one of the simplest yet effective algorithms for

**Text classification:** For example, we have a number of news articles, and we want to learn to classify if the article is about politics, health, technology, sports or lifestyle. **Spam filtering: **We have a number of emails, and we want to learn to classify if the email is spam or not.**Gender classification: **Given features such as height, weight, etc, predict whether the person is male or female.