CommonLounge Archive

Text Classification (Topic Categorization, Spam filtering, etc)

December 05, 2017

Text Classification (or Categorization) has been in high demand and all the way has become more important with the increasing scale of text getting generated online. Moreover, different contextual information in different domains has raised the challenge in improving the accuracy and performance of traditional ways of doing text classification.


Some example applications of text classification include:

  1. Assigning multiple topics to documents
  2. Grouping of documents into a fixed number of predefined classes
  3. Segregating the contextual details from a multi-domain corpus
  4. Spam filtering of emails
  5. Sentiment Analysis to determine the viewpoint/polarity of a writer with respect to some topic
  6. Language identification by automatically determining the linguistic of the text

Problem setup

All of these problems can be solved with supervised learning for Text Classification. The objective is, given a training set of pre-classified text, how can we build a classifier model to predict the class for a given text. We feed labeled data to the machine learning algorithm to work on, the algorithm is trained on the labeled dataset and gives the desired output (the pre-defined categories). During the testing phase, the algorithm is fed with unobserved data and it classifies them into categories based on the training.

Workflow for text classification

Based on the number of classes, we could have a binary classifier or a multi-class classifier. But in both the cases the steps to build up the classifier are more or less the same:

  1. Text preprocessing, including text cleanup and text normalization
  2. Vector Representation / Feature Extraction - Bag of words
  3. Building a model for classification
  4. Evaluating the classifier with precision and recall
  5. Applying the model to a query to get the predicted label

In this rest of this tutorial, we will walk through each step in detail and understand its importance in building an efficient text classifier model.

Text preprocessing, including text cleanup and text normalization

Preprocessing the data is the process of cleaning and preparing the text for classification. The text usually contains some noise and less meaningful data. This has to be identified and cleaned up to build stronger, faster and accurate classifiers. If such insignificant words are not removed, it would result in increasing the complexity by adding up more dimensions in feature extraction. The text cleanup and normalization is task specific, i.e. based on the problem and quality of the training text corpus. The following are possible steps for text preprocessing:

  1. White space and markup removal (e.g. HTML, markdown)
  2. Handling punctuation like commas, apostrophes, quotes, question marks, etc.
  3. Handling the case sensitivity of text
  4. Handling text encoding (Unicode, ISO etc)
  5. Stop words removal
  6. Stemming and Lemmatization
  7. Identify and tag non-standard words such as numbers, dates, abbreviations etc.
  8. Applying normalization techniques such as Tokenizing, Parsing and Chunking

Code illustration: Following code snippet demonstrates text preprocessing for topic categorization using python (+ NLTK library)

# Split text into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# Convert words to lower case
tokens = [w.lower() for w in tokens]
# Remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# Remove tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# Filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
# Stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]

Vector Representation / Feature Extraction: Bag of words

The next step is to transform cleaned and normalized data into some vector form. As machine learning algorithms cannot work with raw text directly, the text must be converted into meaningful numbers. One of the simpler ways of representing text data (feature extraction) when modeling text is bag-of-words model. The bag-of-words model is simple to understand and easy to implement and has seen great success in document classification problems.

The bag-of-words model describes the occurrence of words within a document. It is built using the vocabulary of the words and a measure of the presence of these words in the documents. The vocabulary is the list of all the unique words in the cleaned up corpus and each word is given a score in each document. The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present. Better techniques to score could be term-frequency of the words or could be the TF-IDF score.

Putting the vectors of all the documents together, we get a matrix representation of the entire corpus. Element (i, j) of the matrix describes the presence of the j-th item in the vocabulary in the i-th document. Note that most words in the vocabulary would be missing in each individual document, hence resulting in a lot of zeros in the matrix. Therefore, the resulting model would be a sparse matrix where each word in the vocabulary would be represented by its score in each document.

This representation is called “bag” of words, because any information about the order of words in the document is lost. The model is only concerned with which set of words occur in the document and does not care of its position or co-occurrence with other words.

Code illustration using python and scikit-learn

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["this is test doc", "this is another test doc"]
# create the transform
vector = CountVectorizer()
# tokenize and build vocab
# Print the summary
# Transform document
X_Train = vector.transform(text)
# Print summary of transformed vector

Building a model for classification

There is a huge list of classifier algorithms which could be used to train the model using a training set from the Bag of Words (BOW) matrix. Some of the algorithms that could be used for supervised classification are:

  1. Decision Tree
  2. Random Forest
  3. K-Nearest Neighbors
  4. Support Vector Machine
  5. Naive Bayes

Since Naive Bayes is one of the most basic text classification techniques used in various applications, let’s discuss it in detail. The Naive Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem to predict the class of unknown data set. Some of its characteristics are:

  • Extremely fast relative to other classification algorithms
  • Works on Bayes theorem of probability
  • Assumes independence among predictors (features) so that all the properties independently contribute to the probability

The mathematical equation for Naive Bayes is:

$$ P(C|X) = \dfrac{P(X|C) \times P(C)}{P(X)} $$

Since the predictors are assumed to be independent, the equation could be further reduced to

$$ P(C|X) = P(X1|C) \times P(X2|C) \times ... \times P(C) $$
  • P(C|X) is the posterior probability of class (c, target) given predictor (x, attributes).
  • P(C) - probability of class = frequency of class instances / total instances
  • P(X|C) - probability of predictor given class = (frequency of words in the class + 1) / (total number of words in the class + unique words (vocabulary))
  • P(X) - probability of predictor = 1 (random occurrence of word independent of each other)

Following python example could be use to build a multinomial Naive Bayes model

from sklearn.naive_bayes import MultinomialNB
# Instantiate a Multinomial Naive Bayes model
nb = MultinomialNB()
# train data set
X_train = ['Good product of high quality', 'Costlier than the market']
y_train = ['P', 'N']
# fit and transform train data set
X_train = vector.transform(X_train), y_train)

Applying the model to a query to get the predicted label

Once the model is built, it is used to make predictions that involve calculating the probability of a data instance belonging to each class, and then based on the highest probability, selecting the class as the final prediction.

A function can be used to estimate the probability of a given predictor (feature) using the mean and standard deviation for all the features evaluated from the training data. The result is the conditional probability of class given a feature.

Once the probability of each feature in the query instance is calculated, the final class prediction is then done by combining (multiplying) the probabilities of all of the features of the query.

All the calculations are abstracted in the predict function of scikit’s MultinomialNB classifier:

# test on new data
X = vector.transform(['High quality product'])
predict_class = nb.predict(X)
print (predict_class)
# Return probability estimates for the test vector X
prob_class = nb.predict_proba(X)

Evaluating the classifier with precision and recall

A confusion matrix is a technique for summarizing the performance of a classification algorithm. Element (i, j) of the confusion matrix is the number of times the actual answer is i and the prediction was j. If there are k classes, then the confusion matrix is of size k x k. Each row represents an actual class while each column represents a predicted class. A confusion matrix contains information about the accuracy of the model as well as what types of errors it is making.

An example confusion matrix with 2 classes:

           | Predicted No     |  Predicted Yes   |
Actual No  | True Negative,   |  False Positive  |
Actual Yes | False Negative,  |  True Positive   |

The scikit-learn metrics module provides a function to build the confusion matrix

from sklearn.metrics import confusion_matrix
expected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]
results = confusion_matrix(expected, predicted)

From the confusion matrix, following evaluations parameters could also be calculated and are relevant based on the classification problem:

  • False Positive Rate: When it’s actually no, how often does classifier predict yes? FPR = FP/TN+FP
  • Specificity: When it’s actually no, how often does classifier predict no? Specificity = TN/TN+FP
  • Recall or Sensitivity: Accuracy of right class being predicted. Recall = TP/ TP+FN
  • Precision: Accuracy of predicting a right class. Precision = TP/TP+FP
  • F1 Score: Harmonic average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0. F1 Score = 2 x ((PxR) / (P+R))

Scikit-learn’s classification_report module can be used to find the precision, recall and F1 Score.

from sklearn.metrics import classification_report
report = classification_report(expected, predicted)


  • The text usually contains some noise and less meaningful data which has to be identified and cleaned up to improve accuracy and reducing the complexity of the model.
  • The bag-of-words model describes the occurrence of words within a document which is a sparse matrix where each word in the vocabulary would be represented by its score in each document.
  • Naive Bayes is one of the most basic text classification techniques which is a simple probabilistic classifier based on Bayes theorem to predict the class of unknown data set.
  • The performance of the classifier is evaluated by using confusion matrix

© 2016-2022. All rights reserved.