# Introduction

Conditional Random Fields are **probabilistic graphical models for sequential or structured data**. They allow us to perform classification taking into account the context delivered by the sequence. We talk about a *structured prediction*, where segments are assumed to be related with each other.

By doing so, valuable contextual information, which would be lost in individual classifications, can be given to the model. For example, words in a sentence are grammatically connected: after an adjective it is more likely to find a noun than a verb. And this hint can be used to label the noun *books* in the sentence “The woman carefully carried the two red __books__”.

CRFs are mostly used in NLP tasks such as part-of-speech tagging or sequence labeling (for extracting specific words). It also has applications in computer vision: from image segmentation (label each frame with an activity) to object recognition (gesture recognition or classifying objects in a video). In these scenarios, *sequences* are sentences or videos, and *segments* are words or frames.

# How the algorithm works

Let us explain the construction of a CRF for a **part-of-speech (POS) tagging** task. Let *x* be the input sentence of length *N*: *x = (x _{1}, ..., x_{N})*. Let

*x*be the word at position

_{i}*i*. Let

*y*be the label-vector of the entire sentence and

*y*the label for word

_{i}*x*.

_{i}The first part of the problem deals with the data representation or **feature extraction**. For this matter, a set of *m* feature functions *f _{j}* is defined. Each function

*f*in the set is applied to every word

_{j}*x*in the sentence. We write

_{i}*f*(

_{j}*y,x,i*) to denote the dependency of the function on vectors

*x*and

*y*, and the application at position

*i*. Secondly,

**the model is trained**in order to learn weights

*w*on them. Next, weighted features are added up across all words and functions. Outputs are

_{j}**scores**

*s*for each possible vector

*y:*

Finally, scores are transformed into **probabilities**:

Since the feature values *f* can depend on not just the input *x* but also on output *y*, we can make coherent predictions. If each label was predicted independently, predictions might often not make sense as a whole. However, because of the same reason, this makes the training and prediction process much more computationally intensive. The value of *f* depends on *y*, and hence in the most general setting, every possible value of the sequence *y* needs to be tried for the score to be calculated. In-order to circumvent this computational issue, we often make *f* depend on small parts of *y*, so that we only need to try all possible values of that part of *y.* (We'll see a concrete example below).

## Aside: Relation to Logistic Regression

Note that CRFs are the **sequential version of logistic regression**, i.e. if we change *x* and *y* to not be sequences, and allow *f*(*x*) to only look at individual dimensions of *x*, then we get standard logistic regression.

## Linear-chain CRFs: Feature functions

Feature functions evaluate the likelihood of individual labels for each word. They usually take values between 0 and 1. For the simple case of **linear-chain CRFs**, the general form can be written as: *f _{j}*(

*y*). Beside the entire sentence

_{i-1}, y_{i}, x, i*x*and the current word position

*i*, the function only depends on the labels of previous and current word.

Note that not all functions make use of every input. That is, a function might only look at the previous label *y _{i-1}*, whereas another may focus on the sentence structure

*x*and the current word

*x*, and even others might look at both in combination. Let us see some examples.

_{i}First function *f*_{1} encodes the idea that questions should start with verbs (*x _{N}* is the last word in the sentence). Second function

*f*

_{2}gives a maximal value to words ending in "ly" if they are labeled as adverbs. Note that neither

*f*

_{1}nor

*f*

_{2}make use of the previous word's label

*y*

_{i-1}. But

*f*does: adjectives are likely to be succeeded by nouns.

_{3}Following picture exemplifies the graphical structure of a linear conditional random field, seen as a factor graph. Factor graphs represent the factorization of a function. Typically, variable nodes such as *x* and *y* are depicted as circles. In addition, factor nodes are represented by shaded squares, which in this case correspond to feature functions *f*. In our case, we see that *y*_{i} may depend on *x*, or on *y*_{i-1}. By exploiting the structure of a factor graph, computations can be performed efficiently.

# Training, aka Parameter Estimation

During training, we seek to obtain the optimal weights that minimize the negative log-likelihood of the training dataset {( *x ^{k}, y^{k }*)}:

Superscripts denote samples in the dataset. Gradient-based methods are typically used in this task with random weight initialization.

When learning the weights for the features in the linear example above, a large positive value for *w _{2}* would corroborate that words with suffix “ly” are adjectives. The third function might not get such a high weight, since adjectives can also be followed by other adjectives, and not only by nouns. Note that weights can take negative values.

## Inference

Once the model is trained, labels on words in unseen sentences can be predicted. For a sentence *x* of length *N* and a set of *k* labels, computing all *k ^{N}* possible probabilities

*p*(

*y|x*) for finding the most probable assignment

*y*(

^{*}= max p*y|x*) is very time-consuming (N is the length of the sequence).

Instead, this can be done with *dynamic programming* in a similar way as the Viterbi algorithm is applied to Hidden Markov Models. For example, in the linear CRF example above, since y_{i} only depends on y_{i-1} and not the rest of the y's, we can make the predictions one word at a time, and we only need to keep track of the best prediction score so far for each value of y_{i-1}. If y_{i} depended on all the y's, then we'd have to keep track of all the scores for the entire vector y, i.e. (y_{1}, y_{2}, y_{3}, ... y_{i-1}). Once we compute the scores recursively, then finding the maximizing assignment *y ^{*}* can be done via a traceback (backward recursion). Further details can be found in this paper.

## General CRFs

Now that we know how to train and predict using a linear CRF, let's look at the structure of a general CRF.

Graphical structures of general CRFs increase in complexity. We define a factors *Ψ _{t}*. Each factor,

*Ψ*

_{1},

*Ψ*

_{2},

*Ψ*

_{3}etc defines a group of variables that is available for a feature function to use at the same time. For example, if

*Ψ*

_{1}

*=*{y

_{1}, y

_{3}, y

_{7}}, then we can define feature functions which make use of all 3 values y

_{1}, y

_{3}and y

_{7}at the same time.

Below is an example of a graph *G* with three factors *Ψ _{1}*,

*Ψ*and

_{2}*Ψ*. Each represents an individual factor graph

_{3}*G*separated by dashed lines.

_{t}Here, weights and feature functions can vary for each factor *Ψ _{t}. *They are now written with two subscripts:

*w*and

_{jt }*f*. In linear-chain CRFs, weights and features were shared across the entire graph

_{jt}*G: w*and

_{j }*f*. As noted above, these features can take more general forms by including extra dependencies.

_{j}## Python libraries

Two libraries are available in Python for modeling CRFs: *PyStruct* and *python-crfsuite*. An example on **named-entity recognition** makes use of the latter in this tutorial. These libraries take care of implementing the CRF model, and to use them we only need to define the feature functions.

# Comparison to Hidden Markov Models

*Generative classifiers* apply **Bayes' rule** on the joint probability *p*(*x, y*) to solve the posterior *p*(*y|x*)*.* This is the case of Hidden Markov Models. Recall their equation:

On the other hand, *discriminative technique*s such as **CRFs** model the posterior directly. In fact, by allowing any kind of feature function and an arbitrary number of them, CRFs are more powerful. Furthermore, they can examine parts of the input sequence which are actually hidden to the Markov model. HMMs can be seen as a special case of CRFs, where constant probabilities are applied to model binary state transitions *p*(*y _{i}|y_{i-1}*) and emissions

*p*(

*x*).

_{i}|y_{i}