Topic Modeling is an efficient way to organize, understand and summarize large volumes of text. With huge amount of text data getting generated everyday, it becomes challenging to access the most relevant information. Topic modeling helps us in efficient text browsing by:
- Discovering hidden topical patterns present across the corpus
- Annotating each document according to these topics
- And finally, these annotations can be used to organize, search and summarize texts
Topic Modeling is a method for finding a group of words from a collection of documents that best represents the information in the collection. It can also be thought of as a form of text mining - a way to obtain recurring patterns of words in text. While there are many different algorithms for topic modeling, the most common is Latent Dirichlet Allocation, or LDA.
LDA is a generative probabilistic model for collections of discrete data and therefore more appropriate for text data. An unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.
The LDA model assumes that each document is a mixture of topics, each word is attributed to each of the topics with certain probability. Also, assume that documents cover only a small set of topics and that topics use only a small set of words frequently. That means there is no document that is taking about everything like politics, science, technology, business, entertainment, etc (for example an RSS feed of news).
To further make it simple, lets take an example of documents with only two topics - Automobile and Finance. Each topic has a list of words associated with it, for example a set of documents related to automobile would have terms such as car, steering, tires, mileage, etc and similarly the financial documents would have terms such as cash, budget, cost, debt, etc. However, there would be some documents which may be talking about “car loans” which would have a mix of both the terms for automobiles as well as finance. The LDA model discovers these different topics that the documents represent and how much of each topic is present in a document. Each topic assigns a probability to each word so that it is possible to use the word to predict the most appropriate topic.
LDA tends to find the probability distributions over words i.e. it finds clusters of words that come together with certain probability. Each such cluster is marked as a “topic”. You then feed the model a new document and it tries to infer topics for it based on the learning.
Let's assume we have a text corpus with N documents and K topics. Each document is assumed to cover a small number of topics out of these K topics. The process for training the LDA model (using collapsed Gibbs sampling) is as follows:
- Go through each document and randomly assign each word in the document to one of K topics
- This random assignment gives random topic representations of all documents and word distributions of all the topics
- For training, we repeat the following set of steps until convergence
- For each document d, compute P( topic t | document d ) := proportion of words in document d that are assigned to topic t
- For each topic t, P( word w | topic t ) := proportion of assignments to topic t that come from word w (across all documents)
- For each word w, reassign topic t’, where we choose topic t’ with probability P( topic t’ | word w ) = P( topic t’ | document d ) * P( word w | topic t’ )
This generative model predicts the probability that topic t’ generated word w. Due to the random initialization, the initial assignment will have small biases for related words being assigned to one topic vs another. As we repeatedly re-estimate the probabilities and re-assign the word to topic attribution, these biases will increase over time till at the end the process will converge with each topic being very different from the others, and related words which appear together (in the same documents) belonging to the same topic. Hence, the converged solution will produce relevant topic assignments. Once the model has converged, we also have the final values of P( word w | topic t ) and P( topic t | document d ).
LDA comes build in with some python libraries such as gensim.
Once the documents are cleaned by removing stop words, punctuation marks, html tags, etc, construct the word-id mapping using gensim library
from gensim import corpora, modelswordIdMap = corpora.Dictionary(aCleanDocumentList)
To train model all documents need to be converted into bag of words:
corpus = [wordIdMap.doc2bow(doc) for doc in aCleanDocumentList]
Use gensim’s ldamodel to build the model
ldamodel = ldamodel.LdaModel(corpus=corpus, id2word=wordIdMap, num_topics=num_topics)
num_topics is a number of topics that need to be modeled on the documents. Usually it could be any number between 10 to 100 based on the corpus and nature of problem. To convert any document in the form of bag of words to its topics representation in form of sparse matrix:
x = ldamodel[doc_index]
Following are a few hyper parameters which have be tuned to achieve rest results.
- Number of Topics – Number of topics to be obtained from the corpus.
- Alpha and Beta hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and higher the beta, topics are composed of a larger set of words.
- Number of Iterations – How long do we wait for the algorithm to converge (or how do we detect that the algorithm has converged)?
Although LDA is widely used for topic modeling, there are certain limitations in the algorithm which are listed below:
- The number of hidden topics in the corpus (K) has to be defined before starting the training of the model and this remains constant throughout.
- Algorithm convergence requires a lot of iterations as the perplexity (degree of generalization or the error) reduces very slowly. Moreover, the perplexity is not a good indicator of the overall quality of the model.
- The algorithm takes a long time for training specially when the training data set is big.
- Since LDA is a probabilistic model, it needs more observations to make statistical inference. Hence, LDA doesn't work on very small documents such as tweets or sentence classification.
- LDA does not consider correlation of topics or words which could give more meaningful information to improve quality of topics.
Labeled LDA, is an extension of LDA that can be trained with supervised learning with multi-labeled document corpus. It constraints Latent Dirichlet Allocation by defining a one-to-one correspondence between LDA’s latent topics and human labeled topics. This allows LLDA to directly learn word-tag correspondences.
LLDA also assumes that the documents have more than one topic, hence the training corpus should be assigned more than one label. Using this corpus, LLDA automatically learns the posterior distribution of each word in the document conditioned on the document’s label set.
As a bonus, we have a hands-on exercise for this tutorial. Train an LDA model using gensim on the following dataset A Million News Headlines. You might find it useful to use some subset of the headlines instead of the entire dataset. If you train it with 10 clusters, you should be able to identify common themes in the news articles, similar to what we usually see, like sports etc. If you increase the number the clusters to a much larger number, you might find clusters specific to a single event which unfolded over several weeks, or specific sports.
Note: Some clusters will be messy and difficult to interpret. But some of them should be easy.