CommonLounge Archive

Part of Speech tagging: Understanding Text Syntax and Structures, Part 1

December 18, 2017

Language Syntax and Structure

Syntax and structure of a natural language such as English are tied with a set of specific rules, conventions, and principles which dictate how words are combined into phrases, phrases get combined into clauses, and clauses get combined into sentences. All these constituents exist together in any sentence and are related to each other in a hierarchical structure.

Let’s consider a very basic example of language structure which explains a specific example in the light of subject and predicate relationship. Consider a simple sentence:

Harry is playing football

This sentence is talking about two subjects - Harry and football. To find the subject of the sentence, it is easier to first find the verb and then find “who” or “what” around it. In the above sentence, “playing” is the verb of predicate. If you ask “Who is playing?”, the answer is “Harry” which gives the first subject, and “What is he playing?” gives us “football” as the other subject. An extensive combination of similar rules allows us to define the entities (subjects), intent (predicates), the relationship between intent and entity, etc.

Such an analysis is very useful in any NLP application since it defines some meaning of the text. In a collection of words without any relation or structure, it is very difficult to ascertain what it might be trying to convey or what it means.

We’ll approach the language syntax and structure problem in 3 parts:

  1. Part of Speech tagging (this tutorial): analyzing syntax of single words
  2. Chunking / shallow parsing (part 2): analyzing multi-word phrases (or chunks) of text
  3. Parsing (part 3): analyzing sentence structure as a whole, and the relation of words to one another

Part of Speech tags

Parts of speech (POS) tags are specific lexical categories to which words are assigned based on their syntactic context and role. In English language there are broadly 8 parts of speech: nouns, adjectives, pronouns, interjections, conjunctions, prepositions, adverbs, verbs

For instance, in the sentence:

I am learning NLP

the POS tags are:

('I'/’PRONOUN' 'am'/'VERB' 'learning'/'VERB' 'NLP'/'NOUN')

However, there could be additional detailed tags apart from the generic tags. In Penn Treebank, a commonly used dataset for language syntax and structure, there are 47 tags defined which are widely used in text analytics and NLP applications. You can find more information on specific POS tags and their notations at: Penn Treebank Tagset.pdf

Part of Speech tagging and NLTK illustration

The process of classifying and labelling POS tags is called POS tagging. Let’s try the python example of most commonly used POS tagger using nltk’s pos_tag() function, which is based on the Penn Treebank dataset:

import nltk
sentence = 'The brown fox is quick and he is jumping over the lazy dog'
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens, tagset='universal')
[('The', u'DET'), ('brown', u'ADJ'), ('fox', u'NOUN'), ('is', u'VERB'),
('quick', u'ADJ'), ('and', u'CONJ'), ('he', u'PRON'), ('is', u'VERB'),
('jumping', u'VERB'), ('over', u'ADP'), ('the', u'DET'), ('lazy', u'ADJ'),
('dog', u'NOUN')]

The preceding output shows us the POS tag for each word in the sentence.

Implementation of POS taggers

Some prominent approaches used to build a POS tagger are described below:

Rule based

Rule-based tagging is the oldest approach to POS tagging. It uses predefined rules to get possible tags for each word. The tagger uses information from context (surrounding words) and morphology (within the word), and might also include rules pertaining to such factors as capitalization and punctuation, etc. A couple of examples are:

  1. If a word X is preceded by a determiner and followed by a noun, tag it as an adjective (contextual rule). Eg, “The brown fox”.
  2. If a word ends with -ous, tag it as an adjective (morphological rule). Eg, adventurous

An old but useful paper was published by Eric Brill in 1992: A Simple Rule-Based POS tagger. It is the basis of Brill’s Tagger. See section 2 (about 1 page, easy read) for detailed description of a rule-based POS tagger.

Statistics based

Statistics based taggers have obtained a high accuracy without requiring manually crafted linguistic rules. There are many methods in statistical model, the most notable for POS tagging being Hidden Markov Models (HMMs) and the maximum entropy approach. In the HMM model the word-tag probabilities are estimated from a manually annotated corpus (training set). It is a stochastic model in which the tagger is assumed to be a Markov Process with unobservable states and observable outputs. Here, the POS tags are the states and the words are the outputs. Hence, the POS tagger consists of:

  1. Ps(Ti): Probability of the sequence starting in tag Ti
  2. Pt(Tj|Ti): Probability of the sequence transitioning from tag Ti to tag Tj
  3. PE(Wj|Ti): Probability of the sequence emitting word Wj on Tag Ti

The Tagger makes two simplifying assumptions:

  1. The probability of a word depends only on its tag, i.e. given its tag, it is independent of other words and other tags.
  2. The probability of a tag depends only on its previous tag, i.e. given the previous tag, it is independent of next tags and tags before the previous tag.

Given a sequence of words, the POS tagger is interested in finding the most likely sequence of tags that generates that sequence of words.

Supervised learning based

The supervised learning based approach to build a POS tagger yields the current most accurate taggers. It is based on a neural network which is proved to be faster and more accurate than rule based or statistical models. This approach considers POS tagging as a “supervised learning problem” where manually annotated training data is given to the machine learning model and it learns to predict the missing tags by finding the correlations from the training data.

For example given the predictors (features) as ”POS of word i-1” or ”last three letters of word at i+1” etc, can a neural network be trained to predict the ”POS of word i”. In some sense, it can be looked at as a generalization of the rule based approach, where the supervised learning algorithm is learning the importance of each rule.


Some applications of POS tagging include narrowing down the nouns to focus on the most prominent ones, or performing qualifier-subject analysis, word sense disambiguation, grammar analysis, etc. The most important use case is to extract phrases from the sentence. In fact, it serves as an input to various more complex analysis such as chunking and parsing, discussed in part 2 and part 3 of this tutorial set.

© 2016-2022. All rights reserved.