Introduction
Conditional Random Fields are probabilistic graphical models for sequential or structured data. They allow us to perform classification taking into account the context delivered by the sequence. We talk about a structured prediction, where segments are assumed to be related with each other.
By doing so, valuable contextual information, which would be lost in individual classifications, can be given to the model. For example, words in a sentence are grammatically connected: after an adjective it is more likely to find a noun than a verb. And this hint can be used to label the noun books in the sentence “The woman carefully carried the two red books”.
CRFs are mostly used in NLP tasks such as part-of-speech tagging or sequence labeling (for extracting specific words). It also has applications in computer vision: from image segmentation (label each frame with an activity) to object recognition (gesture recognition or classifying objects in a video). In these scenarios, sequences are sentences or videos, and segments are words or frames.
How the algorithm works
Let us explain the construction of a CRF for a part-of-speech (POS) tagging task. Let x be the input sentence of length N: x = (x1, ..., xN). Let xi be the word at position i. Let y be the label-vector of the entire sentence and yi the label for word xi.
The first part of the problem deals with the data representation or feature extraction. For this matter, a set of m feature functions fj is defined. Each function fj in the set is applied to every word xi in the sentence. We write fj(y,x,i) to denote the dependency of the function on vectors x and y, and the application at position i. Secondly, the model is trained in order to learn weights wj on them. Next, weighted features are added up across all words and functions. Outputs are scores s for each possible vector y:
Finally, scores are transformed into probabilities:
Since the feature values f can depend on not just the input x but also on output y, we can make coherent predictions. If each label was predicted independently, predictions might often not make sense as a whole. However, because of the same reason, this makes the training and prediction process much more computationally intensive. The value of f depends on y, and hence in the most general setting, every possible value of the sequence y needs to be tried for the score to be calculated. In-order to circumvent this computational issue, we often make f depend on small parts of y, so that we only need to try all possible values of that part of y. (We'll see a concrete example below).
Aside: Relation to Logistic Regression
Note that CRFs are the sequential version of logistic regression, i.e. if we change x and y to not be sequences, and allow f(x) to only look at individual dimensions of x, then we get standard logistic regression.
Linear-chain CRFs: Feature functions
Feature functions evaluate the likelihood of individual labels for each word. They usually take values between 0 and 1. For the simple case of linear-chain CRFs, the general form can be written as: fj(yi-1, yi, x, i). Beside the entire sentence x and the current word position i, the function only depends on the labels of previous and current word.
Note that not all functions make use of every input. That is, a function might only look at the previous label yi-1, whereas another may focus on the sentence structure x and the current word xi, and even others might look at both in combination. Let us see some examples.
First function f1 encodes the idea that questions should start with verbs (xN is the last word in the sentence). Second function f2 gives a maximal value to words ending in "ly" if they are labeled as adverbs. Note that neither f1 nor f2 make use of the previous word's label yi-1. But f3 does: adjectives are likely to be succeeded by nouns.
Following picture exemplifies the graphical structure of a linear conditional random field, seen as a factor graph. Factor graphs represent the factorization of a function. Typically, variable nodes such as x and y are depicted as circles. In addition, factor nodes are represented by shaded squares, which in this case correspond to feature functions f. In our case, we see that yi may depend on x, or on yi-1. By exploiting the structure of a factor graph, computations can be performed efficiently.
Training, aka Parameter Estimation
During training, we seek to obtain the optimal weights that minimize the negative log-likelihood of the training dataset {( xk, yk )}:
Superscripts denote samples in the dataset. Gradient-based methods are typically used in this task with random weight initialization.
When learning the weights for the features in the linear example above, a large positive value for w2 would corroborate that words with suffix “ly” are adjectives. The third function might not get such a high weight, since adjectives can also be followed by other adjectives, and not only by nouns. Note that weights can take negative values.
Inference
Once the model is trained, labels on words in unseen sentences can be predicted. For a sentence x of length N and a set of k labels, computing all kN possible probabilities p(y|x) for finding the most probable assignment y* = max p(y|x) is very time-consuming (N is the length of the sequence).
Instead, this can be done with dynamic programming in a similar way as the Viterbi algorithm is applied to Hidden Markov Models. For example, in the linear CRF example above, since yi only depends on yi-1 and not the rest of the y's, we can make the predictions one word at a time, and we only need to keep track of the best prediction score so far for each value of yi-1. If yi depended on all the y's, then we'd have to keep track of all the scores for the entire vector y, i.e. (y1, y2, y3, ... yi-1). Once we compute the scores recursively, then finding the maximizing assignment y* can be done via a traceback (backward recursion). Further details can be found in this paper.
General CRFs
Now that we know how to train and predict using a linear CRF, let's look at the structure of a general CRF.
Graphical structures of general CRFs increase in complexity. We define a factors Ψt. Each factor, Ψ1, Ψ2, Ψ3 etc defines a group of variables that is available for a feature function to use at the same time. For example, if Ψ1 = {y1, y3, y7}, then we can define feature functions which make use of all 3 values y1, y3 and y7 at the same time.
Below is an example of a graph G with three factors Ψ1, Ψ2 and Ψ3. Each represents an individual factor graph Gt separated by dashed lines.
Here, weights and feature functions can vary for each factor Ψt. They are now written with two subscripts: wjt and fjt. In linear-chain CRFs, weights and features were shared across the entire graph G: wj and fj. As noted above, these features can take more general forms by including extra dependencies.
Python libraries
Two libraries are available in Python for modeling CRFs: PyStruct and python-crfsuite. An example on named-entity recognition makes use of the latter in this tutorial. These libraries take care of implementing the CRF model, and to use them we only need to define the feature functions.
Comparison to Hidden Markov Models
Generative classifiers apply Bayes' rule on the joint probability p(x, y) to solve the posterior p(y|x). This is the case of Hidden Markov Models. Recall their equation:
On the other hand, discriminative techniques such as CRFs model the posterior directly. In fact, by allowing any kind of feature function and an arbitrary number of them, CRFs are more powerful. Furthermore, they can examine parts of the input sequence which are actually hidden to the Markov model. HMMs can be seen as a special case of CRFs, where constant probabilities are applied to model binary state transitions p(yi|yi-1) and emissions p(xi|yi).