Word cloud for terms in “Big Data” field
A word cloud (also called tag cloud) is a data visualization technique which highlights the important textual data points from a big text corpus. The approach used creates a meaningful visualization of text which could really help to understand high prominence of words that appear more frequently. This type of visualization can assist in exploratory text analysis by identifying important textual data points (which may be potential features) and contextual themes appearing in a set of documents.
In a word cloud visual, the more common words in the documents appear larger and bolder. Word Cloud generators break down the text into word tokens and count how frequently they appear in the entire corpus. The font point size is assigned to each word based on the frequency it appears in the text. Therefore, the more frequently the word appears, the larger the word is shown in the cloud. The frequency can also be replaced by TF-IDF score of the words which filters out common words across the document and gives a relatively more meaningful representation. Finally, all the words are arranged in a cluster or cloud of words which might also be arranged in any form such as horizontal lines, columns or within a shape.
Word clouds can also be used to display words that have meta-data assigned to them. For example, in a word cloud of countries, the population could be assigned to each country to determine its size. Colors used in a word cloud are usually for aesthetic, but they can also be used to denote categories.
There are a few word cloud generators freely available on the internet. Let’s use Python’s wordcloud library and build a word cloud for the paragraphs of this article so far. Make sure you have wordcloud package installed
sudo pip install wordcloud
The following code, creates a word cloud given some text. You can try this code by assigning the first three paragraphs of this tutorial to the text variable or any other text of your choice.
import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS text = '''Copy & Paste the text''' wordcloud = WordCloud(relative_scaling = 1.0, stopwords = set(STOPWORDS) ).generate(text) plt.imshow(wordcloud) plt.axis("off") plt.show()
This is what the output word cloud looks like:
Word Cloud output of the first three paragraphs of this tutorial
Wow! As you can see, the image not only tells that the text is about word clouds in just one glance, but it also highlights the contextual elements such as “visualization”, “appear”, “frequently”, etc.
The python word cloud library also provides following configurable parameters to customize your word cloud:
- Choose any of the available fonts. The default font used is DroidSansMono on a linux machine.
- Size of the canvas by defining the width and height. The default canvas size is 400x200.
- Minimum and maximum font sizes.
- Maximum number of words on the canvas.
- Background color, the user can also use alpha for transparent backgrounds.
- Matplotlib colormap to randomly choose colors from for each word.
Let’s look at the algorithm behind the word clouds, which would help us understand the implementation of some of the common available libraries to build word clouds.
We will create the list of words we want to plot, along with the associated weights which measure the importance of each word.
- The first step to create a word cloud image is to tokenize the text in the form of words and filter out any stopwords (very commonly used words like the, if, of, etc).
- Next, we have to give a weight to each word, this could be based either on plain frequency count or tf-idf score.
- To normalize these weights, divide each of the weight with the maximum weight.
- Sort the words based on their weights in descending order, so that the most important one would be placed first.
- Choose the first N important words and discard the remaining.
A spiral. Start with center of the word at center of spiral. If there is an overlap, keep moving outwards at discrete steps and retrying.
Now the challenge is to place the words on the canvas.
- Divide the canvas into small rectangles.
- Consider each word one by one.
- The font-size of the word is proportional to its weight.
- Start at the center of a spiral and try to place current word. If there is an overlap, continue following the spiral.
- Dividing the canvas into small rectangles makes it easier to detect overlaps (you need to be able to calculate which rectangles each word touches).
- If there is not enough room to draw all the words on the canvas, try all over again by using a smaller font size.
In the right setting, word cloud visualization is a powerful tool which could help in analyzing textual information (feedbacks, tweets, posts etc.) in a single glance. Another powerful use case is to build a word cloud of the website and identify potential keywords to target for SEO. And finally, it could be used to understand the context of blogs, articles and other bigger text and to discover critical textual features.