Text Classification (or Categorization) has been in high demand and all the way has become more important with the increasing scale of text getting generated online. Moreover, different contextual information in different domains has raised the challenge in improving the accuracy and performance of traditional ways of doing text classification.
Some example applications of text classification include:
- Assigning multiple topics to documents
- Grouping of documents into a fixed number of predefined classes
- Segregating the contextual details from a multi-domain corpus
- Spam filtering of emails
- Sentiment Analysis to determine the viewpoint/polarity of a writer with respect to some topic
- Language identification by automatically determining the linguistic of the text
All of these problems can be solved with supervised learning for Text Classification. The objective is, given a training set of pre-classified text, how can we build a classifier model to predict the class for a given text. We feed labeled data to the machine learning algorithm to work on, the algorithm is trained on the labeled dataset and gives the desired output (the pre-defined categories). During the testing phase, the algorithm is fed with unobserved data and it classifies them into categories based on the training.
Based on the number of classes, it could be a binary classifier or a multi-class classifier. But in both the cases the steps to build up the classifier is more or less the same:
- Text preprocessing, including text cleanup and text normalization
- Vector Representation / Feature Extraction - Bag of words
- Building a model for classification
- Evaluating the classifier with precision and recall
- Applying the model to a query to get the predicted label
In this article, we will walk through each step in detail and understand its importance in building an efficient text classifier model.
Preprocessing the data is the process of cleaning and preparing the text for classification. The text usually contains some noise and less meaningful data. This has to be identified and cleaned up to build stronger, faster and accurate classifiers. If such insignificant words are not removed, it would result in increasing the complexity by adding up more dimensions in feature extraction. The text cleanup and normalization is task specific, i.e. based on the problem and quality of the training text corpus. The following are possible steps for text preprocessing:
- White space and markup removal (e.g. HTML, markdown)
- Handling punctuation like commas, apostrophes, quotes, question marks, etc.
- Handling the case sensitivity of text
- Handling text encoding (Unicode, ISO etc)
- Stop words removal
- Stemming and Lemmatization
- Identify and tag non-standard words such as numbers, dates, abbreviations etc.
- Applying normalization techniques such as Tokenizing, Parsing and Chunking
Code illustration: Following code snippet demonstrates text preprocessing for topic categorization using python (+ NLTK library)
# Split text into wordsfrom nltk.tokenize import word_tokenizetokens = word_tokenize(text)# Convert words to lower casetokens = [w.lower() for w in tokens]# Remove punctuation from each wordimport stringtable = str.maketrans('', '', string.punctuation)stripped = [w.translate(table) for w in tokens]# Remove tokens that are not alphabeticwords = [word for word in stripped if word.isalpha()]# Filter out stop wordsfrom nltk.corpus import stopwordsstop_words = set(stopwords.words('english'))words = [w for w in words if not w in stop_words]# Stemming of wordsfrom nltk.stem.porter import PorterStemmerporter = PorterStemmer()stemmed = [porter.stem(word) for word in tokens]
The next step is to transform cleaned and normalized data into some vector form. As machine learning algorithms cannot work with raw text directly, the text must be converted into meaningful numbers. One of the simpler ways of representing text data (feature extraction) when modeling text is bag-of-words model. The bag-of-words model is simple to understand and easy to implement and has seen great success in document classification problems.
The bag-of-words model describes the occurrence of words within a document. It is built using the vocabulary of the words and a measure of the presence of these words in the documents. The vocabulary is the list of all the unique words in the cleaned up corpus and each word is given a score in each document. The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present. Better techniques to score could be term-frequency of the words or could be the TF-IDF score.
Putting the vectors of all the documents together, we get a matrix representation of the entire corpus. Element (i, j) of the matrix describes the presence of the j-th item in the vocabulary in the i-th document. Note that most words in the vocabulary would be missing in each individual document, hence resulting in a lot of zeros in the matrix. Therefore, the resulting model would be a sparse matrix where each word in the vocabulary would be represented by its score in each document.
This representation is called “bag” of words, because any information about the order of words in the document is lost. The model is only concerned with which set of words occur in the document and does not care of it’s position or co-occurrence with other words.
Code illustration using python and scikit-learn
from sklearn.feature_extraction.text import CountVectorizer# list of text documentstext = ["this is test doc", "this is another test doc"]# create the transformvector = CountVectorizer()# tokenize and build vocabvector.fit(text)# Print the summaryprint(vector.vocabulary_)# Transform documentX_Train = vector.transform(text)# Print summary of transformed vectorprint(X_Train.shape)print(type(X_Train))
There is a huge list of classifier algorithms which could be used to train the model using a training set from the BOW matrix. Some of the algorithms that could be used for supervised classification are:
- Decision Tree
- Random Forest
- K-Nearest Neighbors
- Support Vector Machine
- Naive Bayes
Since Naive Bayes is one of the most basic text classification techniques used in various applications, let’s discuss it in detail. The Naive Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem to predict the class of unknown data set. Some of it's characteristics are:
- Extremely fast relative to other classification algorithms
- Works on Bayes theorem of probability
- Assumes independence among predictors (features) so that all the properties independently contribute to the probability
The mathematical equation for Naive Bayes is:
Since the predictors are assumed to be independent, the equation could be further reduced to
- P(C|X) is the posterior probability of class (c, target) given predictor (x, attributes).
- P(C) - probability of class = frequency of class instances / total instances
- P(X|C) - probability of predictor given class = (frequency of words in the class + 1) / (total number of words in the class + unique words (vocabulary))
- P(X) - probability of predictor = 1 (random occurrence of word independent of each other)
Following python example could be use to build a multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB# Instantiate a Multinomial Naive Bayes modelnb = MultinomialNB()# Train the modelnb.fit_transform(X_train, y_train)
Once the model is built, it is used to make predictions that involve calculating the probability of a data instance belonging to each class, and then based on the highest probability, selecting the class as the final prediction.
A function can be used to estimate the probability of a given predictor (feature) using the mean and standard deviation for all the features evaluated from the training data. The result is the conditional probability of class given a feature.
Once the probability of each feature in the query instance is calculated, the final class prediction is then done by combining (multiplying) the probabilities of all of the features of the query.
All the above calculations are abstracted in the predict function of scikit's MultinomialNB classifier:
# Perform classification on an array of test vectors Xpredict_class = nb.predict(X)# Return probability estimates for the test vector Xprob_class = nb.predict_proba(X)
A confusion matrix is a technique for summarizing the performance of a classification algorithm. Element (i, j) of the confusion matrix is the number of times the actual answer is i and the prediction was j. If there are k classes, then the confusion matrix is of size k x k. Each row represents an actual class while each column represents a predicted class. A confusion matrix contains information about the accuracy of the model as well as what types of errors it is making.
An example confusion matrix with 2 classes:
| Predicted No | Predicted Yes |Actual No | True Negative, | False Positive |Actual Yes | False Negative, | True Positive |
The scikit-learn metrics module provides a function to build the confusion matrix
from sklearn.metrics import confusion_matrixexpected = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0]results = confusion_matrix(expected, predicted)print(results)
From the confusion matrix, following evaluations parameters could also be calculated and are relevant based on the classification problem:
- False Positive Rate: When it's actually no, how often does classifier predict yes? FPR = FP/TN+FP
- Specificity: When it's actually no, how often does classifier predict no? Specificity = TN/TN+FP
- Recall or Sensitivity: Accuracy of right class being predicted. Recall = TP/ TP+FN
- Precision: Accuracy of predicting a right class. Precision = TP/TP+FP
- F1 Score: Harmonic average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0. F1 Score = 2 x ((PxR) / (P+R))
Scikit-learn's classification_report module can be used to find the precision, recall and F1 Score.
from sklearn.metrics import classification_reportreport = classification_report(expected, predicted)print(report)