Hands-on Assignment: Data Visualization with Matplotlib

June 27, 2018

In this hands-on assignment, we’ll use the matplotlib python library to visualize a dataset. The dataset we’ll be using is a medical dataset with information about some patients on metrics like glucose, insulin levels, and other metrics related to diabetes. The assignment will serve two primary objectives - (a) practice matplotlib on a realistic task, and (b) learn how one can visualize and present a dataset.

Project Template on Google Colaboratory

Notebook Link

Work on this project directly in-browser via Google Colaboratory. The link above is a starter template that you can save to your own Google Drive and work on. Google Colab is a free tool that lets you run small Machine Learning projects through your web browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset linked below to Google Colab after saving the notebook to your own system.

Getting started

To get started, first download the dataset from this link: diabetes.csv. Open the file in your favorite text editor and have a look.

First, we’ll import numpy, pandas and matplotlib. Then, we’ll load the dataset, clean it, and also create a normalized dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('diabetes.csv')
for column in ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]:
  bad = (dataset[column] == 0)
  dataset.loc[bad, column] = None
dataset.describe()
dataset.info()
normalized = (dataset - dataset.mean()) / dataset.std()
normalized["Outcome"] = (normalized["Outcome"] > 0.0)

Task 1: Using Bar plot to compare feature values by Outcome

First, we’ll create a bar plot to compare the values each feature takes depending on whether or not the person has diabetes.

Creating the bar plot

The result should look like the following. (In this entire tutorial, green = non-diabetic (= safe and red = diabetic (= danger)).

Bar-plot of each feature (normalized) for non-diabetic (green) and diabetic (red) people

Some notes:

We’re using normalized values here, so that we can plot all variables in the same plot.
The bar heights = mean + 2.0, and the error bars = standard deviation. The +2.0 is being done since it is widely accepted that standard deviation of 2.0 is an outlier / interesting data point.
The following snippet can be used to rotate the x-axis labels and make sure they don’t get cut-off.

plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
plt.tight_layout()

Reading the plot

The main conclusions we can draw from the plot is that although a lot of variables are indicators for diabetic vs non-diabetic, none of them is a clear indicator. This is because there is significant overlap within 1 standard deviation among the two classes of people.

Task 2: Histogram for each feature

Although the barplots gave us a rough idea of what each feature looks like, we’d like take a deeper look at each feature’s distribution among diabetic and non-diabetic folks.

Our goal is to create histograms that look similar to the following.

Histogram of Glucose levels for non-diabetic (green) and diabetic (red) people

Some notes:

You’ll need to specifically exclude data points where feature value is missing.
Small template for code included below

features = dataset.columns[:-1]
def histogram(feature):
    xx = 
    yy = 
    # do the plotting     
    # plt.show()
    fig.savefig('histogram_%s.png' % feature)
for feature in features:
    histogram(feature)

Task 3: Scatterplot of pairs of correlated features

You may recall from the Pandas assignment that some pairs of features are highly correlated. Let’s draw scatterplots for these pairs of variables and see what the plots look like.

Scatterplot of BMI vs skin thickness. Green data points are non-diabetic people and red data points are diabetic people.

Scatterplot of insulin vs glucose. Green data points are non-diabetic people and red data points are diabetic people.

Scatterplot of pregnancies vs age. Green data points are non-diabetic people and red data points are diabetic people.

Some notes:

You’ll need to ignore data points where values for either of the two variables are missing.
Small template for code included below

pairs = [
  ('Pregnancies', 'Age'),
  ('Insulin', 'Glucose'),
  ('BMI', 'SkinThickness'),
]
def scatterplot(v1, v2):
    # do stuff 
for v1, v2 in pairs:
  scatterplot(v1, v2)

Reading the plots

BMI vs SkinThickness is the cleanest plot out of the three. It has a nice Gaussian distribution.

Glucose vs Insulin is also kind of clean, but notice that the values are much more densely packed towards the lower region, and more spread out for the higher values.

Pregnancies vs Age is the most ‘dirty’ plot of the three, mostly because pregnancies takes a small number of discrete values.

Finding pairs of variables that are indicators of diabetic vs non-diabetic

Create scatter plots for glucose vs each feature, similar to glucose vs insulin. And see which graph gives the best separation between the green and red data-points. The best feature will be the one that has the most information about the outcome, but isn’t very correlated with glucose.

Task 4: Correlation Heatmap

In this final section, we’ll create a heatmap visualization of the pairwise correlations.

The following code can be used to create a heatmap. Go through the code and make sure you understand what’s going on in each line.

def heatmap(data, row_labels, col_labels):
    # Adapted from https://matplotlib.org/examples/images_contours_and_fields/interpolation_methods.html
    """
    Create a heatmap from a numpy array and two lists of labels.
    Arguments:
        data       : A 2D numpy array of shape (N,M)
        row_labels : A list or array of length N with the labels for the rows
        col_labels : A list or array of length M with the labels for the columns
    Optional arguments:
        ax         : A matplotlib.axes.Axes instance to which the heatmap
                     is plotted. If not provided, use current axes or
                     create a new one.
        cbar_kw    : A dictionary with arguments to
                     :meth:`matplotlib.Figure.colorbar`.
        cbarlabel  : The label for the colorbar
    All other arguments are directly passed on to the imshow call.
    """
    fig = plt.figure(figsize=(9, 9))
    ax = plt.gca()
    # Plot the heatmap
    im = ax.imshow(data, cmap="Wistia", interpolation="nearest")
    # Create colorbar
    ax.figure.colorbar(im, ax=ax, fraction=0.043, pad=0.04)
    # We want to show all ticks...
    ax.set_xticks(np.arange(data.shape[1]))
    ax.set_yticks(np.arange(data.shape[0]))
    ax.yaxis.tick_left()
    ax.xaxis.tick_bottom()
    # ... and label them with the respective list entries.
    ax.set_xticklabels(col_labels)
    ax.set_yticklabels(row_labels)
    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    plt.tight_layout()
    # Turn spines off and create white grid.
    for edge, spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xticks(np.arange(data.shape[1]+1)-.5, minor=True)
    ax.set_yticks(np.arange(data.shape[0]+1)-.5, minor=True)
    ax.grid(which="minor", color="w", linestyle='-', linewidth=2)
    ax.tick_params(which="minor", bottom=False, left=False, right=False, top=False)
    # plt.show()
    fig.savefig('heatmap.png')

Use the above function to create a pairwise correlation visualization. The output should look like the following:

Visualization of pairwise correlations among features and outcome

Conclusion

If you came this far, good job! Data visualization is a very important skill for understanding large datasets and communicating results with others. Seeing it is the easiest way to understanding it. Kudos!

Solution on Google Colaboratory

Notebook Link

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your own system.