A word cloud (also called tag cloud) is a data visualization technique which highlights the important textual data points from a big text corpus. The approach used creates a meaningful visualization of text which could really help to understand high prominence of words that appear more frequently. This type of visualization can assist in exploratory text analysis by identifying important textual data points (may be potential features) and contextual themes appearing in a set of documents.
In a word cloud visual, the more common words in the documents appear larger and bolder. Word Clouds generators break down the text into word tokens and count how frequently they appear in the entire corpus. The font point size is assigned to each word based on the frequency it appears in the text. Therefore, the more frequently the word appears, the larger the word is shown in the cloud. The frequency can also be replaced by tf-idf score of the words which filters out common words across the document and gives relatively more meaningful representation. Finally, all the words are then arranged in a cluster or cloud of words which might also be arranged in any form such as horizontal lines, columns or within a shape.
Word clouds can also be used to display words that have meta-data assigned to them. For example, in a word cloud of countries, the population could be assigned to each country to determine its size. Colour used in a word clouds is usually for aesthetic, but it can also be used to categorize words.
There are a few word cloud generators freely available on the internet, let's use python’s wordcloud library and build a word cloud for the above paragraphs of this article. Make sure you have wordcloud package installed
sudo pip install wordcloud
The following code, creates a word cloud given some text. You can try this code by assigning the first three paragraphs of the article to the text variable or any other text of your choice.
import matplotlib.pyplot as pltfrom wordcloud import WordCloud, STOPWORDStext = '''Copy Paste the above text'''wordcloud = WordCloud(relative_scaling = 1.0,stopwords = set(STOPWORDS)).generate(text)plt.imshow(wordcloud)plt.axis("off")plt.show()
This is what the output word cloud looks like:
Wow! as you see, only in one glance of the output image not only tells that the text is about word clouds but also highlights the contextual elements such as “visualization”, “appear”, “frequently”, etc.
The python word cloud library also provides following configurable parameters to customize your word cloud:
- Choose any of the available font. The default font used is DroidSansMono on a linux machine.
- Size of the canvas by defining the width and height, the default canvas size is 400x200.
- Minimum and maximum font size word cloud should use.
- Maximum number of words on the canvas.
- Background color, the user can also use alpha for transparent backgrounds.
- Matplotlib colormap to randomly choose colors from for each word.
Let us understand the algorithm behind the word clouds, which would help us understand the implementation of some of the common available libraries to build word clouds.
Creating the list of words we want to plot, along with the associated weights which measure the importance of each word.
- The first step to create a word cloud image is to tokenize the text in the form of words and filter out any stopwords (very commonly used words like the, if, of, etc).
- Next, we have to give a weight to each word, this could be either based on plain frequency count or tf-idf score.
- To normalize these weights, divide each of the weight with the maximum weight.
- Sort the words based on their weights in descending order, so that the most important one would be placed first.
- Choose the first N important words and discard the remaining.
Now the real challenge is to place the words on the canvas.
- Divide the canvas into small rectangles.
- Consider each word one by one.
- The font-size of the word is proportional to its weight.
- Start at the center of a spiral and try to place current word. If there is an overlap, continue following the spiral.
- Dividing the canvas into small rectangles makes it easier to detect overlaps (you need to be able to calculate which rectangles each word touches).
- If there is not enough room to draw all the words on the canvas, try all over again by using a smaller font size.
In the right setting, word cloud visualization is a powerful tool which could help in analyzing the textual information (feedbacks, tweets, posts etc.) in a single glance. Another powerful use case is to build a word cloud of the website and identify potential keywords to target for SEO. And finally, it could be used to understand the context of blogs, articles and other bigger text and to discover critical textual features.