Chunking (Shallow Parsing): Understanding Text Syntax and Structures, Part 2

December 27, 2017

We got introduced to text syntax and structures and took a detailed look at part of speech tagging in part 1 of this tutorial series. In this tutorial, we will learn about phrasal structure and shallow parsing.

Phrasal Structures

A phrase can be a single word or a combination of words based on the syntax and position of the phrase in a clause or sentence. For example, in the following sentence

My dog likes his food.

there are three phrases. “My dog” is a noun phrase, “likes” is a verb phrase, and “his food” is also a noun phrase.

There are five major categories of phrases:

Noun phrase (NP): These are phrases where a noun acts as the head word. Noun phrases act as a subject or object to a verb or an adjective. In some cases a noun phrase can be replaced by a pronoun without changing the syntax of the sentence. Some examples of Noun phrases are “little boy”, “hard rock”, etc.
Verb phrase (VP): These phrases are lexical units that have a verb acting as the head word. Usually there are two forms of verb phrases. One form has the verb components as well as other entities such as nouns, adjectives, or adverbs as parts of the object. The verb here is known as a finite verb. For example in the sentence “The boy is playing football”, “playing football” is the finite verb phrase. The second form of this includes verb phrases which consist strictly of verb components only. For example, “is playing” in the same sentence is such a verb phrase.
Adjective phrase (ADJP): These are phrases with an adjective as the head word. Their main role is to describe or qualify nouns and pronouns in a sentence, and they will be either placed before or after the noun or pronoun. The sentence, “The cat is too cute” has an adjective phrase, “too cute”, qualifying “cat”.
Adverb phrase (ADVP): These are phrases where adverb acts as the head word in the phrase. Adverb phrases are used as modifiers for nouns, verbs, or adverbs themselves by providing further details that describe or qualify them. In the sentence “The train should be at the station pretty soon”, the adverb phrase “pretty soon” describes when the train would be arriving.
Prepositional phrase (PP): These phrases usually contain a preposition as the head word and other lexical components like nouns, pronouns, and so on. It acts like an adjective or adverb describing other words or phrases. The phrase “going up the stairs” contains a prepositional phrase “up”, describing the direction of the stairs.

These five major syntactic categories of phrases can be generated from words using several rules, utilizing syntax and grammars of different types.

Shallow Parsing with illustration

Shallow parsing, also known as light parsing or chunking, is a technique for analyzing the structure of a sentence in-order to identify these phrases or chunks. We start by first breaking the sentence down into its smallest constituents (which are tokens such as words) and then grouping them together into higher-level phrases.

In python the pattern package uses shallow parsing to extract meaningful chunks out of sentences. The following code snippet shows how to perform shallow parsing on our sample sentence:

sentence='Pattern library can extract good chunks from a sentence'
from pattern.en import parsetree
tree=parsetree(sentence)
# print the chunks from shallow parsed sentence tree
for node in tree:
    for chunk in node.chunks:
        print chunk.type, [(word.string, word.type) for word in chunk.words]
>>>
NP [(u'Pattern', u'NN'), (u'library', u'NN')]
VP [(u'can', u'MD'), (u'extract', u'VB')]
NP [(u'good', u'JJ'), (u'chunks', u'NNS')]
PP [(u'from', u'IN')]
NP [(u'a', u'DT'), (u'sentence', u'NN')]

The preceding output is the chunks extracted from the sentence using shallow parsing. Each line begins with the phrase type, and is followed by the list of words in the phrase along with their part-of-speech tags.

How Shallow Parsing works

Let’s take an example of building a basic noun phrase chunker. As explained in the previous section, in a noun phrase, noun acts as a subject or object to a verb or an adjective. In order to create a noun phrase chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. For simplicity let’s assume a single rule grammar which says that a noun phrase chunk should be formed whenever the chunker finds an optional determiner (DET) followed by any number of adjectives (ADJ) and then a noun (NOUN):

grammar = "NP: {<DET>?<ADJ>*<NOUN>}"

Using this grammar, we create a chunk parser, and test it on a sentence. The result is a tree of phrases around noun chunks

import nltk
sentence = "The famous algorithm produced accurate results"
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens, tagset='universal')
cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged_sent)
print(result)

Output:

(S
  (NP The/DET famous/ADJ algorithm/NOUN)
  produced/VERB
  (NP accurate/DET results/NOUN))

Likewise you can define multiple grammar rules based on the grammatical phrases you want to extract, for example:

grammar = '''
    NP: {<DET>? <ADJ>* <NOUN>*}
    P: {<PREP>}
    V: {<VERB.*>}
    PP: {<PREP> <NOUN>}
    VP: {<VERB> <NOUN|PREP>*}
    '''

Machine Learning approach for chunking

Another way to build a Parser is to train a classifier using any commonly used supervised classification algorithms such as SVM, Logistic Regression, etc.

Dataset description

You can use IOB tags as features to train a model that can extract chunks. In IOB tags, each word is tagged with one of three special chunk tags, I (Inside), O (Outside), or B (Begin). A word is tagged as B if it marks the beginning of a chunk, subsequent words within the chunk are tagged I and all other words are tagged O.

Fortunately, NLTK provides a labelled training corpus to train such a classifier chunker. CoNLL-2000 data consist of three columns. The first column contains the word, the second its part-of-speech tag and the third its IOB chunk tag. For more details on this corpus, you can read the following paper: Introduction to the CoNLL-2000 Shared Task: Chunking

Let’s try an example where we would use CoNLL-2000 corpus to randomly pick 80% of data set for training and remaining 20% to test our classifier.

from nltk.corpus import conll2000
import random
conll_data = list(conll2000.chunked_sents())
random.shuffle(conll_data)
train_sents = conll_data[:int(len(conll_data) * 0.8)]
test_sents = conll_data[int(len(conll_data) * 0.8 + 1):]

Defining features

Next, lets define a custom feature extractor we would use to train our model. The features would consist of a sequence of tags based on the co-occurrence of the words with the token.

from nltk.stem.porter import PorterStemmer
def features(tokens, index, history):
    # tokens are tagged words of a sentence
    # Index is the index of token for which the features to be extracted
    # history is the previous predicted tags
    stemmer = PorterStemmer()
    # Build the sequence of words for training
    tokens = [('__PREVSEQ2__', '__PREVSEQ2__'), 
        ('__PREVSEQ1__', '__PRESEQ1__')] + list(tokens) + [('__END1__', '__END1__'), 
        ('__END2__', '__END2__')]
    history = ['__PREVSEQ2__', '__PREVSEQ2__'] + list(history)
    # shift the index with 2 to point to current token
    index += 2
    word, pos = tokens[index]
    prevword, prevpos = tokens[index - 1]
    prev2word, prev2pos = tokens[index - 2]
    nextword, nextpos = tokens[index + 1]
    next2word, next2pos = tokens[index + 2]
    return {
        'word': word,
        'lemma': stemmer.stem(word),
        'pos': pos,
        'next-word': nextword,
        'next-pos': nextpos,
        'next-next-word': next2word,
        'next-next-pos': next2pos,
        'prev-word': prevword,
        'prev-pos': prevpos,
        'prev-prev-word': prev2word,
        'prev-prev-pos': prev2pos,
    }

Training and evaluation

NLTK also provides in its package a sequential tagger that uses a classifier to choose the tag for each token in a sentence. NLTK’s ClassifierBasedTagger can be trained on custom features extracted from CoNLL-2000 or a similar data set.

Let’s train NLTK’s ClassifierBasedTagger using our custom features on the training data set and evaluate the model on the test sample data.

from nltk import ChunkParserI, ClassifierBasedTagger
from nltk.chunk import conlltags2tree, tree2conlltags
class FooChunkParser(ChunkParserI):
    def __init__(self, chunked_sents, **kwargs):
        # Transform the trees in IOB annotated sentences [(word, pos, chunk)]
        chunked_sents = [tree2conlltags(sent) for sent in chunked_sents]
        # Make tags compatible with the tagger interface [((word, pos), chunk)]
        def get_tagged_pairs(chunked_sent):
            return [((word, pos), chunk) for word, pos, chunk in chunked_sent]
        chunked_sents = [get_tagged_pairs(sent) for sent in chunked_sents]
        self.feature_detector = features
        self.tagger = ClassifierBasedTagger(
            train=chunked_sents,
            feature_detector=features,
            **kwargs)
    def parse(self, tagged_sent):
        chunks = self.tagger.tag(tagged_sent)
        iob_triplets = [(word, token, chunk) for ((word, token), chunk) in chunks]
        # Transform the list of triplets to nltk.Tree format
        return conlltags2tree(iob_triplets)
chunker = FooChunkParser(train_sents)
print(chunker.evaluate(test_sents))
>>>
ChunkParse score:
    IOB Accuracy:  93.1%%
    Precision:     87.8%%
    Recall:        90.8%%
    F-Measure:     89.2%%

Comparison of machine learning vs. rule-based chunker

Different experiments have shown that the performance of the classifier based chunker is very similar to the results obtained by the rule based chunker. At times it could be considerably hard to define regular expressions to extract chunks which may have very complex structures. In such a cases, the machine learning approach is helpful.

Conclusion

The next tutorial in the series explains what deep parsing is, and how we can use context free grammar to automatically predict the parse tree of a sentence.