CommonLounge Archive

Introduction to Data Visualization with Matplotlib

December 28, 2017

Matplotlib is the most popular Python package for data visualization. It provides a quick way to visualize data from Python and create publication-quality figures in various different formats. Matplotlib is a multi-platform data visualization library built on NumPy arrays. This allows it to work with the broader SciPy stack.

In this article, we are going to explore matplotlib in interactive mode covering 7 basic cases. You are encouraged to follow along with the tutorial and play around with Matplotlib, trying various things and making sure you’re getting the hang of it. Let’s get started!

Matplotlib and Pyplot

Importing Matplotlib

Just as we use the np shorthand for NumPy and the pd shorthand for Pandas, we will use some standard shorthands for Matplotlib imports:

import matplotlib.pyplot as plt

Pyplot, shortened above as plt, is a module within the matplotlib package that provides a convenient interface to the matplotlib’s plotting classes and methods.

First Look: Line Chart

Creating plots with Matplotlib can be easily accomplished with just a few lines of code. As our first choice, we will create sine and cosine waves with a line chart. Line charts in general are also a good choice for showing trends.

Creating the data points

First thing first, let’s import the NumPy library and create an array of data points.

import numpy as np
x = np.linspace(-np.pi, np.pi, 256, endpoint=True)
S, C = np.sin(x), np.cos(x)
# Display x's shape and first 10 elements
print('x') 
print(x.shape)
print(x[:10])
print()
# Display S's shape and first 10 elements
print('S') 
print(S.shape)
print(S[:10])
print()
# Display C's shape and first 10 elements 
print('C') 
print(C.shape)
print(C[:10])

Each of the above variables is a vector of size 256. S is sine(x), and C is cos(x). Now that we have the data, let’s plot it.

Our first plot

# Start your figure
plt.figure()
# Plot sine curve with a solid - line
plt.plot(x, S, '-')
# Plot cosine curve with a dotted -- line
plt.plot(x, C, '--')
# Display plot and show result on screen.
plt.show()

All our plots will begin by first initiating a figure (plt.figure()), and end with displaying the plot (plt.show()). In between, we’ll call the functions which decide what gets plotted. In this case, we used the plt.plot function to plot lines.

A more detailed look at plotting

Let’s move on to instantiating all the built-in settings so that we can customize the appearance of our plot to suit our needs. The settings use a set to of default values unless specified.

## Create a new figure of size 10x6 inches, using 80 dots per inch
fig = plt.figure(figsize=(10,6), dpi=80)
## Plot cosine using blue color with a dotted line of width 1 (pixels)
plt.plot(x, C, color="blue", linewidth=2.5, linestyle="--", label="cosine")
## Plot sine using green color with a continuous line of width 1 (pixels)
plt.plot(x, S, color="green", linewidth=2.5, linestyle="-", label="sine")
## Set axis limits and ticks (markers on axis)
# x goes from -4.0 to 4.0
plt.xlim(-4.0, 4.0)
# 9 ticks, equally spaced
plt.xticks(np.linspace(-4, 4, 9, endpoint=True)) 
# Set y limits from -1.0 to 1.0
plt.ylim(-1.0, 1.0)
# 5 ticks, equally spaced
plt.yticks(np.linspace(-1, 1, 5, endpoint=True))
## Add legends, title and axis names
plt.legend(loc='upper left', frameon=False)
plt.title("Graph of wave movement with Sine and Cosine functions")
plt.xlabel("Time, t")
plt.ylabel("Position, x")
## Turn on grid
plt.grid(color='b', linestyle='-', linewidth=0.1)
## Moving spines to center in the middle
ax = plt.gca()
# Move left y-axis and bottim x-axis to centre, passing through (0,0)
ax.spines['left'].set_position('center')
ax.spines['bottom'].set_position('center')
# Eliminate upper and right axes
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
# Show ticks in the left and lower axes only
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
plt.show()

Well, there you have it!

Saving a plot

If you would like to save the figure instead of seeing its output, you can use the savefig() command.

fig.savefig('my_figure.png')

There are multiple formats we can save this image in.

import pprint # pretty printer
pprint.pprint(fig.canvas.get_supported_filetypes())

Types of plots

For your reference, here are all the kinds of plots you can call (more on this below):

  • bar’ or ‘barh’ for bar charts
  • hist’ for histograms
  • box’ for boxplots
  • kde’ or ’density’ for density plots
  • area’ for area plots
  • scatter’ for scatter plots
  • hexbin’ for hexagonal bin plots
  • pie’ for pie charts

Bar Chart

A bar chart is a good choice when you want to show how some quantity varies among some discrete set of items. Let’s create a Bar chart from described set.

# Setting figure size to 7x5
fig = plt.figure(figsize=(7,5))
# Setting data set
men_means = [20, 35, 30, 35, 27]
men_stds = [2, 3, 4, 1, 2]
# Setting index 
ind = np.arange(5)
# Setting argument for width
width = 0.35
# Plotting a horizontal bar graph for men_means against index
# with errorbars equal to standard deviation
error_args = {'ecolor': (0, 0, 0), 'linewidth': 2.0}
plt.barh(ind, men_means, width, xerr=men_stds, error_kw=error_args)
# Y-axis ticks and labels
ax = plt.gca()
ax.set_ylim(-0.5, 4.5)
ax.set_yticks(ind)
ax.set_yticklabels(['A', 'B', 'C', 'D', 'E', ])
plt.show()

In the plot, we need to separately calculate bottom, which is the y-axis position where the bar starts (position of bottom of each bar). error_args specify that the error bar is black in color, and its line-width is 2 pixels.


# Setting figure size to 7x5
fig = plt.figure(figsize=(7,5))
# Setting data set values
women_means = [25, 32, 34, 20, 25]
women_stds = [3, 5, 2, 3, 3]
# Plotting a horizontal bar graph with men's data at the bottom and women's data on top.
p1 = plt.bar(ind, men_means, width, yerr=men_stds, color='b', error_kw=error_args)
p2 = plt.bar(ind, women_means, width, bottom=men_means, yerr=women_stds, color='g', error_kw=error_args)
# Modifying x-axis 
ax = plt.gca()
ax.set_xlim(-0.5, 4.5)
ax.set_xticks(ind)
ax.set_xticklabels(['A', 'B', 'C', 'D', 'E', ])
plt.show()

Histogram

Histograms are plot type used to show the frequency across a continuous or discrete variable. Let’s have a look.

# Generate 3 different arrays
x = np.random.normal(0, 0.8, 1000)
y = np.random.normal(-2, 1, 1000)
z = np.random.normal(3, 2, 1000)
# Set figure size to 9x6
fig = plt.figure(figsize=(9, 6))
# Configure keyword arguments to customize histogram.
# Alpha adjusts translucency while bins define spacing. 
# More features available in the documentation.
kwargs = {
        'histtype' : 'stepfilled', 
        'alpha'    : 0.9, 
        'normed'   : True, 
        'bins'     : 40,
}
# Plot all 3 arrays on one graph
plt.hist([x, y, z], **kwargs)
plt.show()
# Generate 3 dimensional numpy array
X = 200 + 25*np.random.randn(1000, 3)
# Set figure size to 9x6
fig = plt.figure(figsize=(9, 6))
# Plot histogram from 3 stacked arrays after normalizing data
n, bins, patches = plt.hist(X, 30, alpha=0.9, stacked=True, normed=True, linewidth=0.0, rwidth=1.0)
plt.show()

Scatter Plot

A Scatter plot is the right choice for visualizing the entire dataset, and visually look for clusters or correlation.

N = 100
# Generate 2 different arrays
x = np.random.rand(N)
y = np.random.rand(N)
fig = plt.figure(figsize=(9, 6))
# Plotting a scatter graph at the given x-y coordinates
plt.scatter(x, y)
plt.show()

N = 100
# Generate 2 different arrays
x = np.random.rand(N)
y = np.random.rand(N)
fig = plt.figure(figsize=(9, 6))
# Assign random colors and variable sizes to the bubbles
colors = np.random.rand(N)
area = np.pi * (20 * np.random.rand(N))**2 # 0 to 20 point radii
# Scatter plot on x-y coordinate with the assigned size and color
plt.scatter(x, y, s=area, c=colors, alpha=0.7)
plt.show()

Box and Whisker Plot

Box plot is an easy and effective way to read descriptive statistics. These statistics summarize the distribution of the data by displaying: minimum, first quartile, median, third quartile, and maximum in a single graph.

np.random.seed(10)
# Generate 4 different arrays and combine them in a list
u = np.random.normal(100, 10, 200)
v = np.random.normal(80, 30, 200)
w = np.random.normal(90, 20, 200)
x = np.random.normal(70, 25, 200)
data_to_plot = [u, v, w, x]
fig = plt.figure(figsize=(9, 6))
## Plot a box plot that shows the mean, variance and limits within each column.
# Add patch_artist=True option to ax.boxplot() to get fill color
bp = plt.boxplot(data_to_plot, patch_artist=True, labels=['A', 'B', 'C', 'D', ])
# change outline color, fill color and linewidth of the boxes
for box in bp['boxes']:
    # change outline color
    box.set(color='#7570b3', linewidth=2)
    # change fill color
    box.set(facecolor = '#1b9e77')
# change color and linewidth of the whiskers
for whisker in bp['whiskers']:
    whisker.set(color='#7570b3', linewidth=2)
# change color and linewidth of the caps
for cap in bp['caps']:
    cap.set(color='#7570b3', linewidth=2)
# change color and linewidth of the medians
for median in bp['medians']:
    median.set(color='#b2df8a', linewidth=2)
# change the style of fliers and their fill
for flier in bp['fliers']:
    flier.set(marker='o', color='#e7298a', alpha=0.5)
plt.show()

If you haven’t seen a box plot before, here’s how to read the above plot. The starts and end of the box mark the first-quartile and third-quartile values (i.e. 25 percentile - 75 percentile). The line inside the box marks the median value. The ends of the bars mark the minimum and the maximum values (excluding the outliers). Any dots above / below the error bars are the outlier data points.

Area Plot

Area charts are used to represent cumulative totals using numbers or percentages over time. Since these plot by default are stacked they need each column to be either all positive or all negative values.

x = range(1,6)
# Set values for each line (4 lines in this example)
y = [ 
    [1, 4, 6, 8, 9],  
    [2, 2, 7, 10, 12],  
    [2, 8, 5, 10, 6],  
    [1, 5, 2, 5, 2],
]
# Setting figure size to 9x6 with dpi of 80
fig = plt.figure(figsize=(9,6), dpi=80)
# Stacked area plot
plt.stackplot(x, y, labels=['A','B','C','D'], alpha=0.8)
# Set location of legend
plt.legend(loc='upper left')
plt.show()

Pie Chart

Pie charts show percentage or proportion of data. This percentage represented by each category is right next to its corresponding slice of pie. For pie charts in Matplotlib, the slices are ordered and plotted counter-clockwise, as shown:

# Set keyword arguments
labels = 'Kenya', 'Tanzania', 'Uganda', 'Ruwanda', 'Burundi'
sizes = [35, 30, 20, 10 ,5]
explode = (0, 0.1, 0, 0, 0) # only "explode" the 2nd slice (i.e. 'Tanzania')
# Plot pie chart with the above set arguments
fig = plt.figure(figsize=(9, 6))
plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

Above, autopct='%1.1f%%' says display the percentage with 1 digit precision. And startangle=90 says that the first pie (Kenya) should start from angle 90 degrees (angle is the angle made with positive x-axis).

Further Exploration

Here are a few resources if you want to explore further:

  • For more on Matplotlib: pyplot — Matplotlib documentation
  • Seaborn is built on top of matplotlib and allows you to easily produce prettier (and more complex) visualizations.
  • D3.js is a JavaScript library for producing sophisticated interactive visualizations for the web. Although it is not in Python, it is both trendy and widely used.
  • Bokeh is a newer library that brings D3-style visualizations into Python.
  • ggplot is a Python port of the popular R library ggplot2, which is widely used for creating “publication quality” charts and graphics. It’s probably most interesting if you’re already an avid ggplot2 user, and possibly a little opaque if you’re not.

Before wrapping up, I’ll leave you to ponder over this Antoine de Saint-Exupery’s quote. ”Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away“.


© 2016-2022. All rights reserved.