We took a detailed look at part of speech tagging in part 1 and chunking in part 2 of this tutorial series. In this tutorial, we will learn about what parsing is, its different types, and techniques to automatically infer the parse tree of sentences.
Natural language parsing (also known as deep parsing) is a process of analyzing the complete syntactic structure of a sentence. This includes how different words in a sentence are related to each other, for example, which words are the subject or object of a verb. Probabilistic parsing uses language understanding such as grammatical rules. Alternatively, it may also use supervised training set of hand-parsed sentences to try to infer the most likely syntax and structure of new sentences.
Parsing is used to solve various complex NLP problems such as conversational dialogues and text summarization. It is different from ‘shallow parsing’ in that it yields more expressive structural representations which directly capture long-distance dependencies and underlying predicate-argument structures.
There are two main types of parse tree structures - constituency parsing and dependency parsing.
Constituent-based grammars are used to analyze and determine the components which a sentence is composed of. There are usually several rules for different types of phrases based on the type of components they can contain, and this can be used to build a parse tree. The non-terminals nodes in the tree are types of phrases and the terminal nodes are the words in the sentence which are constituents of the phrase. For example, consider the following sentence and its constituency parse tree
Harry met Sejal
Sentence | +-------------+------------+ | | Noun Phrase Verb Phrase | | Harry +-------+--------+ | | Verb Noun Phrase | | met Sejal
The dependency-based grammar is based on the notion that linguistic units, e.g. words, are connected to each other by directed links (one-to-one mappings) between words which signify their dependencies.
The resulting parse tree representation is a labelled directed graph where the nodes are the lexical tokens and the labelled edges show dependency relationships between the heads and their dependents. The labels on the edges indicate the grammatical role of the dependent. For example, consider the same sentence and its dependency parse tree
Harry met Sejal
met | +--------------+ subject | | subject | | Harry Sejal
In this section, we’ll explain how parsers based on probabilistic context free grammars work. Let’s start by considering the following example sentence:
I went to the market in my shorts
If we analyze the sentence carefully, there is an ambiguity in phrases - “I went to market” + “in my shorts” and “I went to” + “the market in my shorts”. While it may to be obvious to us that the first phrasal interpretation is sensible, both phrasal structures are syntactically correct. The latter would mean there is some market which is in my shorts (analogous to the sentence, “I went to the market in the city”).
A Context Free Grammar (CFG) is a set of rules which can be repeatedly applied to generate a sentence. Given a sentence and a particular CFG, we can also infer the possible parse tree structures. Let’s illustrate this with a toy CFG for the above example sentence, and check the possible parse trees:
grammar = nltk.CFG.fromstring("""S -> NP VP PP -> P NP NP -> Det N | Det N PP | 'I' VP -> V NP | VP PP Det -> 'to' | 'my' N -> 'market' | 'shorts' V -> 'went' P -> 'in' """)
Here, S is the start symbol, NP, VP, etc represent noun phrase, verb phrase, etc, and N, V, etc represent noun, verb, etc.
This grammar permits the sentence to be analyzed in two ways, depending on whether the prepositional phrase “in my shorts” describes the subject “market” or going to market.
import nltk sent = 'I went to market in my shorts'.split() parser = nltk.ChartParser(groucho_grammar) for tree in parser.parse(sent): print(tree) (S (NP I) (VP (VP (V went) (NP (Det to) (Det the) (N market))) (PP (P in) (NP (Det my) (N shorts))))) (S (NP I) (VP (V went) (NP (Det to) (Det the) (N market) (PP (P in) (NP (Det my) (N shorts))))))
In-order to disambiguate between the above possible trees, we can use a Probabilistic Context Free Grammar (PCFG). In this setting, each of the rules are probabilistic. For example, if we take the rule
N -> 'market' | 'shorts', our grammar would specify with what probabilities corresponding to ‘market’ and ‘shorts’. In general, the probabilities and rules are inferred from annotated datasets.
Now, we will build a custom constituency parsers by creating our own PCFG rules and then using NLTK’s
ViterbiParser to train a parser. We will use treebank corpus that provides annotated parse trees for sentences.
import nltk from nltk.grammar import Nonterminal from nltk.corpus import treebank # get training data training_set = treebank.parsed_sents() # example training sentence print(training_set) (S (NP-SBJ (NNP Mr.) (NNP Vinken)) (VP (VBZ is) (NP-PRD (NP (NN chairman)) (PP (IN of) (NP (NP (NNP Elsevier) (NNP N.V.)) (, ,) (NP (DT the) (NNP Dutch) (VBG publishing) (NN group)))))) (. .))
Next, we will build the rules for our grammar by extracting them from the annotated training sentences.
# Extract the rules for all annotated training sentences rules = list(set(rule for sent in training_set for rule in sent.productions())) print(rules[0:5]) [VBZ -> 'cites', VBD -> 'spurned', PRN -> , ADVP-TMP ,, NNP -> 'ACCOUNT', JJ -> '36-day']
Some of the words in our sample sentence are not covered by the treebank-based grammar, therefore we will add the token and POS tags for our sample sentence in our grammar and build the parser
sentence = "The famous algorithm produced accurate results" # get tokens and their POS tags from pattern.en import tag as pos_tagger tagged_sent = pos_tagger(sentence) for word, tag in tagged_sent: t = nltk.Tree.fromstring("("+ tag + " " + word +")") for rule in t.productions(): rules.append(rule)
Now that we have our necessary grammar with rules, we will build the parser using our grammar and then evaluate it on the sample sentence.
# build the parser viterbi_parser = nltk.ViterbiParser(treebank_grammar) # get sample sentence tokens tokens = nltk.word_tokenize(sentence) # get parse tree result = list(viterbi_parser.parse(tokens)) print(result) (S (NP-SBJ (DT The) (JJ famous) (NNP algorithm)) (VP (VBP produced) (NP-TMP-CLR (JJ accurate) (NNS results)))) (p=3.61545e-25)
The output is the parse tree for our sample sentence. Note that this is a probabilistic parser, and you can see the overall probability of this tree mentioned in the output.
NLTK provides some excellent information on parsing in its official book, available at Analyzing Sentence Structure.
Stanford University’s Natural Language Processing group maintains the most popular library for parsing. Since the models are created in Java, it requires downloading of
jar files to run.
StanfordParser performs constituency parsing. The following code builds the parse tree
from nltk.parse.stanford import StanfordParser # create parser object scp = StanfordParser(path_to_jar='/path/to/stanford-parser.jar', path_to_models_jar='path/to/stanford-parser-models.jar') # get parse tree result = list(scp.raw_parse(sentence))
StanfordDependencyParser can be used to perform dependency parsing. The following code builds the parse tree
from nltk.parse.stanford import StanfordDependencyParser # create parser object scp = StanfordDependencyParser(path_to_jar='/path/to/stanford-parser.jar', path_to_models_jar='path/to/stanford-parser-models.jar') # get parse tree result = list(scp.raw_parse(sentence))
You can decide which technique to use based on the use case, for example if you are interested in sub-phrases within the sentence, you probably want to use constituency parsing. On the other hand, if the goal is to find dependency relationships between words, then dependency parsing is the right choice.
This was the final part of the understanding text syntax and structure tutorial. You should now have a good grasp on different parsing techniques to understand the syntax and structure of a sentence.