Bayesian Machine Learning

November 29, 2017

Introduction

The idea behind Bayesian approach is to incorporate into machine learning algorithms some prior beliefs about the model θ by applying the Bayes’ rule. It is highly useful when data is scarce or difficult to obtain, which is often the case in practice. In Bayesian analysis, data D is not assumed to be right, but is allowed to become “less wrong with size”. The process consists on recursively updating our initial belief or knowledge (prior) as more evidence is obtained (data). Goals can be to either find the most probable model θ* (Bayesian inference) or to directly compute optimal predictions y* (Bayesian prediction).

Bayes’ Rule

Lets say we have an empirical dataset D = {(x₁, y₁), …, (x_n, y_n)} and a model θ. Then, by Bayes’ theorem, we have

$$ P(\theta | D) = \dfrac{P(D|\theta) \times P(\theta)}{P(D)} $$

P(θ|D): posterior. What we look for.
P(D): normalizing constant independent of θ. So, it disappears when applying the Bayes’ rule: P(θ|D) ⧜ P(D|θ) x P(θ).
P(D|θ): likelihood. Model-specific, it is established by assigning higher likelihood to data with better model results.
P(θ): prior. Initial belief on how model parameters might be, specified in terms of parametrized distributions (uniform, normal, etc). Inference should converge to probable θ.

Example: Coin Tosses

Let us exemplify the outcomes of the Bayesian versus the non-Bayesian approach. For the binary random variable of tossing a coin, let θ be the probability of tossing a head. If only 10 observations are available with 8 “heads”, the estimated non-Bayesian model θ = 0.8 would not be very intuitive. However, our prior belief about fair coins can be encoded: the two outputs “head” and “tail” are equally probable (0.5). The likelihood P(D|θ) follows a Bernoulli distribution. Let us choose the prior P(θ) as the Beta distribution B(α, β), where parameters α and β stand for the number of previously seen heads and tails. Since it is the conjugate prior of the Bernoulli distribution, P(D) can be analytically solved. For the next 10 flips, let h be the number of heads, and t that of tails. The posterior P(θ|D) can then be calculated as the PDF of B(α+h, β+t). For h = 8, maximal P(θ|D) is found for θ = 0.75. Although it is still close to 0.8, the computed value is now biased towards 0.5. If in the next 10 tosses we observe 8 more heads, θ will move even closer to 0.8. That is, as more observations are gathered, the posterior will be updated. Check Bayesian Coin Flips for specific formulas and code in Python.

Bayesian Inference

The goal is to learn the optimal model θ* using P(θ|D) ⧜ P(D|θ) x P(θ). The initial belief (prior) is updated with the evidence (data) to outcome the most probable hypothesis (posterior). The process can be recursively repeated as new observations become available to update model parameters, by using the old posterior as prior. There are two approaches:

Maximum-a-posteriori estimation (MAP):

$$ θ^* = \text{arg}\max_θ P(θ|D) = \text{arg}\max_θ P(D|θ)\times P(θ) $$

Maximum-likelihood estimation (MLE), when all hypotheses are assumed equally probable (i.e. follow uniform distribution):

$$ θ^* = \text{arg}\max_θ P(D|θ) $$

Some examples of encoding prior beliefs for realistic machine learning methods are:

1: Probabilistic linear regression:

Let be the linear regression problem y(x) = w^Tx + e with weight vector w. If we assume Gaussian noise e ∼ N (0, σ²), then response variable y is Gaussian as well: y ∼ N (w^Tx, σ²). Let θ ∼ N (0, γ²), the model with single parameter w.

Recall that

$$ \mathcal{N} (y; \mu, \sigma^2) = \dfrac{1}{\sqrt{2\pi \sigma ^2}} e^{- \frac{(y-\mu)^2}{2\sigma ^2}} $$

Furthermore, we apply the naive assumption on conditional independence: P(D|θ) = P(y₁|x₁, θ) x … x P(y_n|x_n, θ).

Now, the trick is applying logarithms to the probabilities. Rewritting the MAP formula:

$$ \begin{aligned} θ^* &= \text{arg}\max_θ P(D|θ)\times P(θ) \\ &= \text{arg}\max_θ \prod_{i=1}^{n} P(y_i|x_i, θ) \times P(θ)\\ &= \text{arg}\max_θ \sum_{i=1}^{n} log\ P(y_i|x_i, θ) + log\ P(θ)\\ &= \text{arg}\max_θ -\dfrac{1}{2\sigma^2} \sum_{i=1}^{n}(y_i-w^Tx_i)^2 - \dfrac{1}{\gamma^2} ||w||^2\\ &= \text{arg}\min_θ \dfrac{1}{2\sigma^2} \sum_{i=1}^{n}(y_i-w^Tx_i)^2 + \dfrac{1}{\gamma^2} ||w||^2 \end{aligned} $$

Therefore, the MAP solution for the probabilistic linear regression corresponds to that of ridge regression (regularized least squares). Note that the MLE solution lacks the regularization term issued from P(θ) and corresponds to the ordinary least squares solution.

2: Probabilistic logistic regression:

Let be the logistic regression problem with labels y ∈ {-1, +1}. The decision boundary for the two classes corresponds to w^Tx = 0. Now the logistic function is used for the likelihood:

$$ P(y|x, \theta) = \dfrac{1}{1+e^yw^Tx} $$

Following the same assumptions as for linear regression, the MAP solution can be written:

$$ \begin{aligned} θ^* &= \text{arg}\max_θ \sum_{i=1}^{n} log\ P(y_i|x_i, θ) + log\ P(θ)\\ &= \text{arg}\max_θ - \sum_{i=1}^{n} log(1+e^{-y_iw^Tx_i}) - \dfrac{1}{\gamma^2} ||w||^2\\ &= \text{arg}\min_θ \sum_{i=1}^{n} log(1+e^{-y_iw^Tx_i}) + \dfrac{1}{\gamma^2} ||w||^2 \end{aligned} $$

Although there is no close-form solution, it can be approached by gradient descent.

3: Bayesian network:

Networks are typically used for medical diagnosis. Let us say we want to compute the probability of a disorder or disease. Real-world knowledge can be incorporated into the model, for example by using the probability issued from a published journal study to encode the prior belief.

4: Bayesian hierarchical modeling:

When approaching a specific problem, information available from related problems or issued from different observation stages can be incorporated into the model.

Bayesian Prediction =============================================

So far, we saw how to calculate the optimal model θ. However, given that our final goal is to predict new instances, we could directly compute the optimal predictions. The formula for the Bayesian prediction is:

$$ \begin{aligned} y^* &= \text{arg}\max_y P(y|x,D)\\ &= \text{arg}\max_y \int P(y|x,\theta) P(\theta|D) d\theta \end{aligned} $$

In general, solving the integral is difficult. For the previous example of linear regression, the properties of the normal distribution allow finding a close-form solution. Kernelization can also be applied by means of the dual formulation. However, for the logistic regression (classification), there is no close-form solution for the integral. The best option is to find the optimal model (Bayesian inference) and to use it to predict new instances.

List of other Bayesian models and methods

Latent variable models (which can be trained with the EM algorithm)
Variational inference: for latent Dirichlet allocation (LDA, for text-mining)
Markov chain Monte Carlo (sampling-based method): for LDA and Bayesian neural networks
Variational autoencoders, based on variational inference
Non-parametric Bayesian methods: HLA (Hierarchical version of LDA) and Gaussian processes

Famous examples

1: Turing cracking Enigma in WWII:

Turing and colleagues estimated the probabilities of the letters in an Enigma message based on Bayesian methods. They encoded their belief and incorporated more guesses as new information arrived. The process was know as Banburismus. Check this explanation on the scoring system they used.

2: 2008 US presidential elections

During the 2008 US presidential election, Nate Silver was able to predict the outcomes of the elections correctly for all the 50 US states. Nate Silver’s approach for predicting the presidential elections was to first incorporate all components relative to the population (state, race, wealth) to a baseline of estimated percentage vote for Obama. Then, to use a temporal component which encoded changes in people vote intention based on changes that occur at both state and national level (unemployment rate drop, tax increase) for updating the probabilities using the Bayes’ rule. More information can be found here.