TF-IDF: Vector representation of Text

November 29, 2017

TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency and is a very common algorithm to transform text into a meaningful representation of numbers. The technique is widely used to extract features across various NLP applications. This article would help you understand the importance of TF-IDF, and how to compute and apply the algorithm in your applications.

Vector representation of Text

To use a machine learning algorithm or a statistical technique on any form of text, it is prescribed to transform the text into some numeric or vector representation. This numeric representation should depict significant characteristics of the text. There are many such techniques, for example, occurrence, term-frequency, TF-IDF, word co-occurrence matrix, word2vec and GloVe.

Occurrence based vector representation

Since TF-IDF is an occurrence based numeric representation of text, let us understand the other primitive occurrence based techniques and how TF-IDF has evolved over them. One of the simplest ways to represent text in the form of numbers is how many times the word occurs in the entire corpus.

Term Frequency

We assume that higher values of a word would simply mean greater importance in the given text. This is true but what if the documents across our corpus are of different sizes? In that case, the bigger size documents would naturally have more occurrences of words than smaller documents. Therefore, a better representation would be to normalize the occurrence of the word with the size of the document and is called term-frequency.

Numerically, term frequency of a word is defined as follows:

tf(w) = doc.count(w)/total words in doc

Inverse Document Frequency

While computing term-frequency, each term is considered equally important and given a chance to participate in vector representation. But, there would be certain words which are so common across documents that they may contribute very little in deciding the meaning of it. Term frequency of such words for example ‘the’, ‘a’, ‘in’, ‘of’ etc might suppress the weights of more meaningful words. Therefore, to reduce this effect, the term frequency is discounted by a factor called inverse document frequency.

idf(w) = log(total number of documents/number of documents containing word w)

Term Frequency-Inverse Document Frequency

As a result, we have a vector representation which gives high value for a given term if that term occurs often in that particular document and very rarely anywhere else. If the term occurs in all the documents, idf computed would be 0. TF-IDF is the product of term-frequency and inverse document frequency.

Tf-idf(w) = tf(w)*idf(w)

The more important a word is in the document, it would get a higher tf-idf score and vice versa.

Illustration

Toy corpus and desired behavior

Let’s take an example of a corpus consisting of following 5 documents:

This car got the excellence award
Good car gives good mileage
This car is very expensive
This company is financially good
The company is growing with very high production

The first three documents talk about cars and the remaining two about a company. In the first category, though the car defines the context but significant meaning of each document comes from the attributes related to the car. Similarly, in the other category, the financial and production status of the company would be more useful features. The rest of the common words, however, are less meaningful and therefore should get a low score.

Code illustration

The following python code would build a TF-IDF model for the given corpus:

from sklearn.feature_extraction.text import TfidfVectorizer 
import operator
corpus=["this car got the excellence award",\
         "good car gives good mileage",\
         "this car is very expensive",\
         "the company is growing with very high production",\
         "this company is financially good"]
vocabulary = set()
for doc in corpus:
    vocabulary.update(doc.split())
vocabulary = list(vocabulary)
word_index = {w: idx for idx, w in enumerate(vocabulary)}
tfidf = TfidfVectorizer(vocabulary=vocabulary)
# Fit the TfIdf model
tfidf.fit(corpus)
tfidf.transform(corpus)
for doc in corpus:
    score={}
    print(doc)
    # Transform a document into TfIdf coordinates
    X = tfidf.transform([doc])
    for word in doc.split():
        score[word] = X[0, tfidf.vocabulary_[word]]
    sortedscore = sorted(score.items(), key=operator.itemgetter(1), reverse=True)
    print("\t", sortedscore)

We can see that the model learns to give lesser importance to words like is and this. Unfortunately, it also gives a low importance to important words like car and a fairly high importance to unwanted words like gives. With a larger corpus, these issues would be resolved when a lot more documents would have words like gives but not car.

Applications

There are many use cases where TF-IDF based representations help in empowering the NLP algorithms:

Document classification: TF-IDF forms a fundamental feature vector to train various classifiers such as LSI, SVM etc.
Topic Modeling: For auto tagging the documents, one way is to use TF-IDF directly where we train a model by computing the vector for each document and setting a threshold. Terms with scores above this threshold can participate in predicting the topics for new documents. Alternatively, TF-IDF features as an input to the algorithm such as LDA, LLDA etc. helps to obtain better accuracy and performance. We will learn all about these in the upcoming tutorials.
Information retrieval systems: TF-IDF complements text mining and search algorithms by assigning a score representing how important the word is in defining the meaning of the document. The search results would be better if these important words closely relate to the search query.
Stop word filtering: TF-IDF is also a highly useful tool to filter out less important common words and can remove the requirement to manually maintain an extensive list of stop words.

Summary

TF-IDF algorithm transforms text into a meaningful representation of numbers, which is widely used to extract features across various NLP applications.
Term frequency of a word is defined as the ratio of count of the word in the document to total number of words in the document. It measures how frequently a term (word) occurs in a document.
tf(w) = doc.count(w)/total words in doc
Inverse document term frequency measures how important a term (word) is.
idf(w) = log(total number of documents/number of documents containing word w)
TF-IDF is the product of term-frequency and inverse document frequency.
Tf-idf(w) = tf(w)*idf(w)
TF-IDF based representations help in empowering the NLP algorithms which have huge applications in today’s world.