Introduction to Named Entity Recognition with Examples and Python Code for training Machine Learning model

January 18, 2018

Introduction

Named Entity Recognition is one of the very useful information extraction technique to identify and classify named entities in text. These entities are pre-defined categories such a person’s names, organizations, locations, time representations, financial elements, etc.

Apart from these generic entities, there could be other specific terms that could be defined given a particular problem. These terms represent elements which have a unique context compared to the rest of the text. For example, it could be anything like operating systems, programming languages, football league team names etc. The machine learning models could be trained to categorize such custom entities which are usually denoted by proper names and therefore are mostly noun phrases in text documents.

NER Categories

Broadly NER has three top-level categorizations - entity names, temporal expressions, and number expressions:

Entity Names represent the identity of an element, for example name of a person, title, organization, any living or nonliving thing etc.
A temporal expression is some sequence of words with time related elements for example calendar dates, times of day, durations etc.
A numerical expression is a mathematical sentence involving only numbers and or operation symbols. It could depict, financial numbers, tangible entities, mathematical expressions etc.

Named entities are often not simply singular words, but are chunks of text, e.g. “University of British Columbia” or “Bank of America”. Therefore, some chunking and parsing prediction model is required to predict whether a group of tokens belong in the same entity.

Illustration

Let’s review the default implementation of NER available in NLTK on a text snippet from Wikipedia. Later in the article we will train a model to build our custom NER chunker.

import nltk
doc = '''Andrew Yan-Tak Ng is a Chinese American computer scientist.
He is the former chief scientist at Baidu, where he led the company's
Artificial Intelligence Group. He is an adjunct professor (formerly 
associate professor) at Stanford University. Ng is also the co-founder
and chairman at Coursera, an online education platform. Andrew was born
in the UK in 1976. His parents were both from Hong Kong.'''
# tokenize doc
tokenized_doc = nltk.word_tokenize(doc)
# tag sentences and use nltk's Named Entity Chunker
tagged_sentences = nltk.pos_tag(tokenized_doc)
ne_chunked_sents = nltk.ne_chunk(tagged_sentences)
# extract all named entities
named_entities = []
for tagged_tree in ne_chunked_sents:
    if hasattr(tagged_tree, 'label'):
        entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #
        entity_type = tagged_tree.label() # get NE category
        named_entities.append((entity_name, entity_type))
print(named_entities)
[('Andrew', 'PERSON'), ('Chinese', 'GPE'), ('American', 'GPE'), 
('Baidu', 'ORGANIZATION'), ('Artificial Intelligence Group', 'ORGANIZATION'),
('Stanford University', 'ORGANIZATION'), ('Coursera', 'ORGANIZATION'),
('Andrew', 'PERSON'), ('Hong Kong', 'GPE')]

In the above example, NER Chunker uses Parts-of-Speech annotations to find named entities in the sentence. While the other entities in the output are clear, GPE stands for “Geo-political entity”, that is, a location.

How to run this code on Google Colaboratory

Notebook Link

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory.

Approaches to NER

To implement NER chunker to tag specific elements in the text, there are two different approaches. The classical approach is knowledge/rule based and the other way to solve the problem is by using supervised machine learning. However sometimes, the combination of both gives better results than the individual ones.

Classical approaches

A rule based NER system uses predefined language dependent rules based on linguistics which helps in the identification of named entities in a document. Rule based systems perform well but are limited to a particular language and are not flexible to changes. Most of these entities are either proper nouns or proper nouns in coalition with numbers.

An example of rule based NER to extract date and time can be found in NLTK’s time expression tagger. A code snippet from the top of the file is included below. Notice the regular expressions and complex rules to extract different variants of date and time.

# Predefined strings.
numbers = "(^a(?=\s)|one|two|three|four|five|six|seven|eight|nine|ten| \
          eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| \
          eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty| \
          ninety|hundred|thousand)"
day = "(monday|tuesday|wednesday|thursday|friday|saturday|sunday)"
week_day = "(monday|tuesday|wednesday|thursday|friday|saturday|sunday)"
...
...
regxp1 = "((\d+|(" + numbers + "[-\s]?)+) " + dmy + "s? " + exp1 + ")"
regxp2 = "(" + exp2 + " (" + dmy + "|" + week_day + "|" + month + "))"
...
...
        # If timex matches ISO format, remove 'time' and reorder 'date'
        if re.match(r'\d+[/-]\d+[/-]\d+ \d+:\d+:\d+\.\d+', timex):
            dmy = re.split(r'\s', timex)[0]
            dmy = re.split(r'/|-', dmy)
            timex_val = str(dmy[2]) + '-' + str(dmy[1]) + '-' + str(dmy[0])
...
...
        # Weekday in the previous week.
        elif re.match(r'last ' + week_day, timex, re.IGNORECASE):
            day = hashweekdays[timex.split()[1]]
            timex_val = str(base_date + RelativeDateTime(weeks=-1, \
                            weekday=(day,0)))
...
...
        # Month in the following year.
        elif re.match(r'next ' + month, timex, re.IGNORECASE):
            month = hashmonths[timex.split()[1]]
            timex_val = str(base_date.year + 1) + '-' + str(month)
...
...

Code snippets from NLTK’s time expression tagger: [timex.py · nltk](https://github.com/nltk/nltkcontrib/blob/master/nltkcontrib/timex.py)

Machine Learning approach

Rule based NER can be sometimes very complex and less accurate, in such cases machine learning approach is helpful. In this approach, by using supervised learning on labelled data, machines can predict custom entities on a given text.

For example, let’s consider a problem where the system has to identify all the Operating Systems given in the text. To train such a supervised learning model, we would need a labelled corpus labelling Operating Systems in many sentences. Usually to achieve this, sequence tagging is a common approach. It is a type of pattern recognition task where categorical labels are assigned to each member of a sequence of words (in the sentence) for a given category.

For our Operating System NER, an assigned label for the words that form an Operating System name is required (OS), while rest of the words can be marked irrelevant (IR) as shown in the example below:

Sent: ['Linux', 'is', 'the', 'best', 'OS']
Labels: ['OS','IR','IR','IR','IR']
Sent: ['Ubuntu', 'is', 'my', 'favorite', 'OS']
Labels: ['OS','IR','IR','IR','IR']

Although the examples above are toy examples, the rest of the details below of the algorithm and features are quite realistic.

Next let’s train a supervised learning model to build an Operating System entity recognizer.

Conditional Random Fields (CRF)

To take advantage of the surrounding context of the tokens labelled in a sequence, a commonly used method is conditional random field (CRF). It is a type of probabilistic undirected graphical model that can be used to model sequential data. CRF calculates the conditional probability of values on classified output nodes given values assigned to the classified input nodes.

Conditionally trained CRFs can easily include large number of non independent features for example POS tags, lower/title/uppercase flags etc. but the power of model increases by adding features that are concurrent or sequential in nature.

Let’s prepare our stub corpus for Operating System tagging and extract sequential features. In order to prepare the data set for training, we need to label every word (or token) in the sentences to be either irrelevant or part of a named entity.

Note that you can find the Google Colab notebook for Named Entity Recognition using CRF here: Notebook Link

data = [(['Linux', 'is', 'the', 'best', 'OS'], ['OS','IR','IR','IR','IR']),
(['Ubuntu', 'is', 'my', 'favourite', 'OS'], ['OS','IR','IR','IR','IR'])]
corpus = []
for (doc, tags) in data:
    doc_tag = []
    for word, tag in zip(doc,tags):
        doc_tag.append((word, tag))
    corpus.append(doc_tag)
print(corpus)
>>> [[('Linux', 'OS'), ('is', 'IR'), ('the', 'IR'), ('best', 'IR'), 
('OS', 'IR')], [('Ubuntu', 'OS'), ('is', 'IR'), ('my', 'IR'), 
('favorite', 'IR'), ('OS', 'IR')]]

Training CRF Model

To train a CRF Model, we need to extract features from all the sentences in our stub corpus. The features are the token sequence of the words.

def doc2features(doc, i):
    word = doc[i][0]
    # Features from current word
    features={
        'word.word': word,
    }
    # Features from previous word
    if i > 0:
        prevword = doc[i-1][0]
        features['word.prevword'] = prevword
    else:
        features['BOS'] = True # Special "Beginning of Sequence" tag
    # Features from next word
    if i < len(doc)-1:
        nextword = doc[i+1][0]
        features['word.nextword'] = nextword
    else:
        features['EOS'] = True # Special "End of Sequence" tag
    return features
def extract_features(doc):
    return [doc2features(doc, i) for i in range(len(doc))]
X = [extract_features(doc) for doc in corpus]
print(X)
[[{'BOS': True, 'word.word': 'Linux', 'word.nextword': 'is'},
{'word.nextword': 'the', 'word.word': 'is', 'word.prevword': 'Linux'},
{'word.nextword': 'best', 'word.word': 'the', 'word.prevword': 'is'},
{'word.nextword': 'OS', 'word.word': 'best', 'word.prevword': 'the'},
{'EOS': True, 'word.word': 'OS', 'word.prevword': 'best'}],
[{'BOS': True, 'word.word': 'Ubuntu', 'word.nextword': 'is'},
{'word.nextword': 'my', 'word.word': 'is', 'word.prevword': 'Ubuntu'},
{'word.nextword': 'favorite', 'word.word': 'my', 'word.prevword': 'is'},
{'word.nextword': 'OS', 'word.word': 'favorite', 'word.prevword': 'my'},
{'EOS': True, 'word.word': 'OS', 'word.prevword': 'favourite'}]]

In this example only one previous and one next word is considered in the features, however the features can include 2 previous and 2 next words extending the sequence window to 5. Next prepare the labels corresponding to the training set

def get_labels(doc):
    return [tag for (token,tag) in doc]
y = [get_labels(doc) for doc in corpus]
print(y)
[['OS', 'IR', 'IR', 'IR', 'IR'], ['OS', 'IR', 'IR', 'IR', 'IR']]

Giving the training features and corresponding labels, let’s train our Operating System NER model using sklearn-crfsuite which is a scikit compatible wrapper around CRFsuite (python-crfsuite) library. During model training, CRF determines the weights of different feature functions that will maximize the likelihood of the labels in the training data.

import sklearn_crfsuite
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(X, y);

Next, let us test our model on a new sentence and see if it is able to tag a new Operating System.

test = [['CentOS', 'is', 'my', 'favourite', 'OS']]
X_test = extract_features(test)
print(crf.predict_single(X_test))
['OS', 'IR', 'IR', 'IR', 'IR']

In the output, check that first token has been labelled as OS. This example could be extended by a big corpus and a bigger window of sequence features to build a real time Operating System entity recognizer.

Final notes

The example we saw does not handle multi-word entity names (or chunks). To incorporate that, our dataset needs to use IOB tags as labels to train the model. In IOB tags, each word is tagged with one of three special chunk tags, I (Inside), O (Outside), or B (Begin). A word is tagged as B if it marks the beginning of a chunk, subsequent words within the chunk are tagged I and all other words are tagged O.

If there are multiple entities, then we’ll have B and I tags for each entity type. For example, if we want to label people (P) and organizations (O), our tag set will be - begin person (BP), inside person (IP), begin organization (BO), in organization (IO) and outside (O). Below is an example sentence with these tags.

Andrew      BP
Yan-Tak     IP
Ng          IP
was         O
a           O
professor   O
at          O
Stanford    BO
University  IO

For more on chunking, read this article: Chunking (Shallow Parsing): Understanding Text Syntax and Structures

Applications and use cases of NER

NER is extensively used in question and answer systems, document clustering, textual entailment, and text analytics applications. An understanding of the named entities involved in a document provides much richer analytical frameworks and cross-referencing.

For example, news and publishing houses generate large amounts of online content on a daily basis and managing them correctly is challenging. To get maximum out of each article, entity extraction can be used to automatically scan entire articles and reveal which major people, organizations, and places are being talked about. Knowing the relevant tags for each article would also help in automatically categorizing and discovering the articles.

NER also plays an important role in automation of customer support. Automatically tagged locations and product names can help smoothly route customer queries to the right locations and people in a company.

Bonus: Exercise

As a bonus, we have a hands-on exercise. Here is a large Entity Recognition dataset: Annotated Corpus for Named Entity Recognition. See if you can train your own model, analyze the predictions made by the model, observe patterns in its mistakes, and measure overall accuracy.