Hidden Markov Models

October 03, 2017

Let us imagine a dynamic system whose future states only depend on the current one. Such a stochastic process is called a Markov process, and it is said to satisfy the Markov property (“memoryless”):

$$ p(h_{t+1} |h_1,..., h_t) = p(h_{t+1}|h_t) $$

Now, let us assume that each real state is not directly available. Instead, an observation of it that works as an indicator is available to us. In fact, this is a realistic scenario. For example, a system with noise-corrupted measurements or a process that cannot be completely measured. There is an uncertainty about the real state of the world, which is referred to as hidden. A Hidden Markov Model (HMM) serves as a probabilistic model of such a system.

Let H be the latent, hidden variable that evolves over time. Let O be the random variable over its observations, also known as the output sequence. Graphically, the system at time steps {1, …, T} can be seen as follows:

Parameters of the model are θ = (π, A, B), with:

the initial state vector π of elements p(h₁)
the state transition matrix A of probabilities p(h_t | h_t-1)
the emission or observation matrix B of elements p(o_t | h_t)

Weather example

Let us see a simple example. David and his twin brother Eric talk daily over phone about what they did during the day. Eric is considering moving back to their hometown (he currently lives abroad) and would like to know whether it was a sunny or a rainy day. Let H be the random variable that describes the weather on a given day, whose states are hidden to Eric. There are two possibles states: rainy and sunny. Let O={walk, shop, clean} be the possible activities David performed and later told his brother. Following figure describes the system with all transition and emission probabilities. For example, given that it was sunny yesterday, there is a 40% probability that today it will rain (transition probability).

Problems

There are four major types of questions one can ask when modeling a situation using HMMs

Inference problems (known θ)

a) Likelihood of an observation sequence: p(O) = p(o₁, o₂, …, o_T)
b) Hidden state inference (most probable state given a set of observations): h_t* = argmax_{h_t} p(h_t | O)
c) Most likely explanation (most probable path of hidden states that results in a sequence of observations): H* = argmax_H p(H, O)

Learning (unknown θ)

d) Train the model parameters, i.e., given an output sequence, find the best set of transition and emission probabilities: θ* = argmax_θ p(O)

The HMM model has been designed precisely such that there are efficient ways of computing the answers to all the above problems.

Forward-backward algorithm: Problems (a) and (b)

This is a procedure based on dynamic programming to solve some of the problems above. Let’s introduce the following auxiliary variables:

$$ \alpha_t (h_t) = p(h_t, o_1, ..., o_t),\ \ \beta_t (h_t)= p(o_{t+1}, ..., o_T | h_t) $$

The idea is to compute in two passes their values for all t. The equations below can be derived using Markov property. First pass goes forward in time:

$$ \begin{aligned} &\alpha_1 (h_1) = p(h_1)\ p(o_1|h_1) \\ &\alpha_{t+1} (h_{t+1}) = \sum_{h_t} \alpha_t(h_t) \ p(h_{t+1} | h_t) \ p(o_{t+1} | h_{t+1}) \end{aligned} $$

And second pass goes backward:

$$ \begin{aligned} &\beta_1 (h_T) = 1; \\ &\beta_{t-1} (h_{t-1}) = \sum_{h_t} p(h_t|h_{t-1}) \ p(o_t | h_t) \ \beta_t(h_t) \end{aligned} $$

Solution to problem (a): (only uses the forward-pass)

$$ p(O) = p(o_1, ..., o_T) = \sum_{h_T} p(o_1, ...,o_T|h_T)= \sum_{h_T} \alpha_T(h_T) $$

Solution to problem (b):

$$ \begin{aligned} &p(h_t|O)p(O) = p(h_t, O) = \alpha_t(h_t)\beta_t(h_t) \\ &\Rightarrow p(h_t|O) = \dfrac{\alpha_t(h_t)\ \beta_t(h_t)}{\sum_{h_T} \alpha_T} \end{aligned} $$

Viterbi algorithm: Problem (c)

Let us define the probability of the most likely path through state h_t = k

$$ \begin{aligned} &V_1(k) = p(o_1|h_1=k)\ p(h_1=k) &(1) \\ &V_t(k) = p(o_t|h_t=k)\ \max_x p(h_t=k|h_{t-1}=x) \ V_{t-1}(x) &(2) \end{aligned} $$

Solution to c):

This technique works by recursively computing V_t(k) until T to find the most likely path based on a sequence of observations:

# Initialization
calculate V1(k) using (1)
# Iteration
for t=2, ..., T
    calculate Vt(k) by applying (2)
# Termination
calculate p(H, O) as the maximum over all k at step T: max VT(k)

Baum-Welch algorithm: Problem (d)

This procedure instantiates the expectation-maximization algorithm by using the forward-backward inference in the Expectation-step. It is used to solve the learning problem (d), which is approached by deriving the maximum likelihood estimate of the parameters θ.

Given: (Long) observed sequence O
Start with θ = transition and emission probabilities = uniform distribution
1. Compute the expected values of the hidden states given θ [E-step]
2. Re-estimate θ = maximum likelihood estimate given hidden states [M-step]
Repeat till convergence

In step 2, the maximum likelihood estimate for each probability distribution is simply obtained by counting and normalizing.

Applications

HMMs are one of the best known algorithms for

automatic speech recognition systems, where the goal is to reconstruct spoken words (hidden states) from an acoustic signal (observations)
part-of-speech tagging in natural language texts, which aims to assign grammatical categories such as noun, verb, adjective (hidden states) to words in a text (observations)