Table of contents

- Introduction
- How the algorithm works
- Aside: Relation to Logistic Regression
- Linear-chain CRFs: Feature functions
- Training, aka Parameter Estimation
- Inference
- General CRFs
- Python libraries
- Comparison to Hidden Markov Models

NaN.

TutorialMachine LearningLast updated

Conditional Random Fields[ Edit ]

Conditional Random Fields are **probabilistic graphical models for sequential or structured data**. They allow us to perform classification taking into account the context delivered by the sequence. We talk about a *structured prediction*, where segments are assumed to be related with each other.

By doing so, valuable contextual information, which would be lost in individual classifications, can be given to the model. For example, words in a sentence are grammatically connected: after an adjective it is more likely to find a noun than a verb. And this hint can be used to label the noun *books* in the sentence “The woman carefully carried the two red __books__”.

CRFs are mostly used in NLP tasks such as part-of-speech tagging or sequence labeling (for extracting specific words). It also has applications in computer vision: from image segmentation (label each frame with an activity) to object recognition (gesture recognition or classifying objects in a video). In these scenarios, *sequences* are sentences or videos, and *segments* are words or frames.

Let us explain the construction of a CRF for a **part-of-speech (POS) tagging** task. Let *x* be the input sentence of length *N*: *x = (x _{1}, ..., x_{N})*. Let

The first part of the problem deals with the data representation or **feature extraction**. For this matter, a set of *m* feature functions *f _{j}* is defined. Each function

s(y|x) = \sum_{j=1}^m \sum_{i=1}^N w_j f_j(y,x,i)

Finally, scores are transformed into **probabilities**:

p(y|x) = \dfrac{e^{s(y|x)}}{\sum_{y^\prime} e^{s(y^\prime|x)}}

Since the feature values *f* can depend on not just the input *x* but also on output *y*, we can make coherent predictions. If each label was predicted independently, predictions might often not make sense as a whole. However, because of the same reason, this makes the training and prediction process much more computationally intensive. The value of *f* depends on *y*, and hence in the most general setting, every possible value of the sequence *y* needs to be tried for the score to be calculated. In-order to circumvent this computational issue, we often make *f* depend on small parts of *y*, so that we only need to try all possible values of that part of *y.* (We'll see a concrete example below).

Note that CRFs are the **sequential version of logistic regression**, i.e. if we change *x* and *y* to not be sequences, and allow *f*(*x*) to only look at individual dimensions of *x*, then we get standard logistic regression.

Feature functions evaluate the likelihood of individual labels for each word. They usually take values between 0 and 1. For the simple case of **linear-chain CRFs**, the general form can be written as: *f _{j}*(

Note that not all functions make use of every input. That is, a function might only look at the previous label *y _{i-1}*, whereas another may focus on the sentence structure

\begin{aligned}
f_1(y_{i-1}, y_i, x, i) &= \begin{cases}
1 &\text{if } y_i=\text{verb},\ x_N=\text{"?"}, i=1 \\
0 &\text{otherwise}
\end{cases}\\
f_2(y_{i-1}, y_i, x, i) &= \begin{cases}
1 &\text{if } y_i=\text{adverb},\ x_i=\text{ends with "ly"}
\\
0 &\text{otherwise}
\end{cases}\\
f_3(y_{i-1}, y_i, x, i) &= \begin{cases}
1 &\text{if } y_{i-1}=\text{adjective},\ y_i=\text{noun}
\\
0 &\text{otherwise}
\end{cases}
\end{aligned}

First function *f*_{1} encodes the idea that questions should start with verbs (*x _{N}* is the last word in the sentence). Second function

Following picture exemplifies the graphical structure of a linear conditional random field, seen as a factor graph. Factor graphs represent the factorization of a function. Typically, variable nodes such as *x* and *y* are depicted as circles. In addition, factor nodes are represented by shaded squares, which in this case correspond to feature functions *f*. In our case, we see that *y*_{i} may depend on *x*, or on *y*_{i-1}. By exploiting the structure of a factor graph, computations can be performed efficiently.

During training, we seek to obtain the optimal weights that minimize the negative log-likelihood of the training dataset {( *x ^{k}, y^{k }*)}:

w^* = \min_w \Big\{-\log \Big( \prod_k p(y^k | x^k, w) \Big) \Big\}

Superscripts denote samples in the dataset. Gradient-based methods are typically used in this task with random weight initialization.

When learning the weights for the features in the linear example above, a large positive value for *w _{2}* would corroborate that words with suffix “ly” are adjectives. The third function might not get such a high weight, since adjectives can also be followed by other adjectives, and not only by nouns. Note that weights can take negative values.

Once the model is trained, labels on words in unseen sentences can be predicted. For a sentence *x* of length *N* and a set of *k* labels, computing all *k ^{N}* possible probabilities

Instead, this can be done with *dynamic programming* in a similar way as the Viterbi algorithm is applied to Hidden Markov Models. For example, in the linear CRF example above, since y_{i} only depends on y_{i-1} and not the rest of the y's, we can make the predictions one word at a time, and we only need to keep track of the best prediction score so far for each value of y_{i-1}. If y_{i} depended on all the y's, then we'd have to keep track of all the scores for the entire vector y, i.e. (y_{1}, y_{2}, y_{3}, ... y_{i-1}). Once we compute the scores recursively, then finding the maximizing assignment *y ^{*}* can be done via a traceback (backward recursion). Further details can be found in this paper.

Now that we know how to train and predict using a linear CRF, let's look at the structure of a general CRF.

Graphical structures of general CRFs increase in complexity. We define a factors *Ψ _{t}*. Each factor,

Below is an example of a graph *G* with three factors *Ψ _{1}*,

Here, weights and feature functions can vary for each factor *Ψ _{t}. *They are now written with two subscripts:

Two libraries are available in Python for modeling CRFs: *PyStruct* and *python-crfsuite*. An example on **named-entity recognition** makes use of the latter in this tutorial. These libraries take care of implementing the CRF model, and to use them we only need to define the feature functions.

*Generative classifiers* apply **Bayes' rule** on the joint probability *p*(*x, y*) to solve the posterior *p*(*y|x*)*.* This is the case of Hidden Markov Models. Recall their equation:

p(x,y)=p(y_1) \prod_i p(y_i|y_{i-1})p(x_i|y_i)

On the other hand, *discriminative technique*s such as **CRFs** model the posterior directly. In fact, by allowing any kind of feature function and an arbitrary number of them, CRFs are more powerful. Furthermore, they can examine parts of the input sequence which are actually hidden to the Markov model. HMMs can be seen as a special case of CRFs, where constant probabilities are applied to model binary state transitions *p*(*y _{i}|y_{i-1}*) and emissions

Read more…(1505 words)

About the contributors:

Marta EnescoData Scientist, Graduate Research Assistant at University of Potsdam

77%

Keshav DhandhaniaMSc in Deep Learning @ MIT (2014)

23%

Loading…

Join the discussion. Add a reply…

Post

Table of contents

- Introduction
- How the algorithm works
- Aside: Relation to Logistic Regression
- Linear-chain CRFs: Feature functions
- Training, aka Parameter Estimation
- Inference
- General CRFs
- Python libraries
- Comparison to Hidden Markov Models

Contributors

Marta EnescoData Scientist, Graduate Research Assistant at University of Potsdam

77%

Keshav DhandhaniaMSc in Deep Learning @ MIT (2014)

23%

Ready to join our community?

Sign up below to automatically get notified of new courses, get **reminders** to finish ones you subscribe to, and **bookmark** lessons to read later.

Continue with Facebook

— OR —

Your Full Name

Email address

I have an account. Log in instead

By signing up, you agree to our Terms and our Privacy Policy.

Popular Courses

New Courses

About Us

Get in touch

Copyright 2016-18, Compose Labs Inc. All rights reserved.