In this hands-on assignment, we’ll use the matplotlib python library to visualize a dataset. The dataset we’ll be using is a medical dataset with information about some patients on metrics like glucose, insulin levels, and other metrics related to diabetes. The assignment will serve two primary objectives - (a) practice matplotlib on a realistic task, and (b) learn how one can visualize and present a dataset.
Work on this project directly in-browser via Google Colaboratory. The link above is a starter template that you can save to your own Google Drive and work on. Google Colab is a free tool that lets you run small Machine Learning projects through your web browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset linked below to Google Colab after saving the notebook to your own system.
To get started, first download the dataset from this link: diabetes.csv. Open the file in your favorite text editor and have a look.
First, we’ll import numpy, pandas and matplotlib. Then, we’ll load the dataset, clean it, and also create a normalized dataset.
import numpy as np import pandas as pd import matplotlib.pyplot as plt dataset = pd.read_csv('diabetes.csv') for column in ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]: bad = (dataset[column] == 0) dataset.loc[bad, column] = None dataset.describe() dataset.info() normalized = (dataset - dataset.mean()) / dataset.std() normalized["Outcome"] = (normalized["Outcome"] > 0.0)
First, we’ll create a bar plot to compare the values each feature takes depending on whether or not the person has diabetes.
The result should look like the following. (In this entire tutorial, green = non-diabetic (= safe and red = diabetic (= danger)).
Bar-plot of each feature (normalized) for non-diabetic (green) and diabetic (red) people
- We’re using normalized values here, so that we can plot all variables in the same plot.
- The bar heights = mean + 2.0, and the error bars = standard deviation. The +2.0 is being done since it is widely accepted that standard deviation of 2.0 is an outlier / interesting data point.
- The following snippet can be used to rotate the x-axis labels and make sure they don’t get cut-off.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor") plt.tight_layout()
The main conclusions we can draw from the plot is that although a lot of variables are indicators for diabetic vs non-diabetic, none of them is a clear indicator. This is because there is significant overlap within 1 standard deviation among the two classes of people.
Although the barplots gave us a rough idea of what each feature looks like, we’d like take a deeper look at each feature’s distribution among diabetic and non-diabetic folks.
Our goal is to create histograms that look similar to the following.
Histogram of Glucose levels for non-diabetic (green) and diabetic (red) people
- You’ll need to specifically exclude data points where feature value is missing.
- Small template for code included below
features = dataset.columns[:-1] def histogram(feature): xx = yy = # do the plotting # plt.show() fig.savefig('histogram_%s.png' % feature) for feature in features: histogram(feature)
You may recall from the Pandas assignment that some pairs of features are highly correlated. Let’s draw scatterplots for these pairs of variables and see what the plots look like.
Scatterplot of BMI vs skin thickness. Green data points are non-diabetic people and red data points are diabetic people.
Scatterplot of insulin vs glucose. Green data points are non-diabetic people and red data points are diabetic people.
Scatterplot of pregnancies vs age. Green data points are non-diabetic people and red data points are diabetic people.
- You’ll need to ignore data points where values for either of the two variables are missing.
- Small template for code included below
pairs = [ ('Pregnancies', 'Age'), ('Insulin', 'Glucose'), ('BMI', 'SkinThickness'), ] def scatterplot(v1, v2): # do stuff for v1, v2 in pairs: scatterplot(v1, v2)
BMI vs SkinThickness is the cleanest plot out of the three. It has a nice Gaussian distribution.
Glucose vs Insulin is also kind of clean, but notice that the values are much more densely packed towards the lower region, and more spread out for the higher values.
Pregnancies vs Age is the most ‘dirty’ plot of the three, mostly because pregnancies takes a small number of discrete values.
Create scatter plots for glucose vs each feature, similar to glucose vs insulin. And see which graph gives the best separation between the green and red data-points. The best feature will be the one that has the most information about the outcome, but isn’t very correlated with glucose.
In this final section, we’ll create a heatmap visualization of the pairwise correlations.
The following code can be used to create a heatmap. Go through the code and make sure you understand what’s going on in each line.
def heatmap(data, row_labels, col_labels): # Adapted from https://matplotlib.org/examples/images_contours_and_fields/interpolation_methods.html """ Create a heatmap from a numpy array and two lists of labels. Arguments: data : A 2D numpy array of shape (N,M) row_labels : A list or array of length N with the labels for the rows col_labels : A list or array of length M with the labels for the columns Optional arguments: ax : A matplotlib.axes.Axes instance to which the heatmap is plotted. If not provided, use current axes or create a new one. cbar_kw : A dictionary with arguments to :meth:`matplotlib.Figure.colorbar`. cbarlabel : The label for the colorbar All other arguments are directly passed on to the imshow call. """ fig = plt.figure(figsize=(9, 9)) ax = plt.gca() # Plot the heatmap im = ax.imshow(data, cmap="Wistia", interpolation="nearest") # Create colorbar ax.figure.colorbar(im, ax=ax, fraction=0.043, pad=0.04) # We want to show all ticks... ax.set_xticks(np.arange(data.shape)) ax.set_yticks(np.arange(data.shape)) ax.yaxis.tick_left() ax.xaxis.tick_bottom() # ... and label them with the respective list entries. ax.set_xticklabels(col_labels) ax.set_yticklabels(row_labels) # Rotate the tick labels and set their alignment. plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor") plt.tight_layout() # Turn spines off and create white grid. for edge, spine in ax.spines.items(): spine.set_visible(False) ax.set_xticks(np.arange(data.shape+1)-.5, minor=True) ax.set_yticks(np.arange(data.shape+1)-.5, minor=True) ax.grid(which="minor", color="w", linestyle='-', linewidth=2) ax.tick_params(which="minor", bottom=False, left=False, right=False, top=False) # plt.show() fig.savefig('heatmap.png')
Use the above function to create a pairwise correlation visualization. The output should look like the following:
Visualization of pairwise correlations among features and outcome
If you came this far, good job! Data visualization is a very important skill for understanding large datasets and communicating results with others. Seeing it is the easiest way to understanding it. Kudos!
You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you’re unfamiliar with Google Colaboratory. Note that, for this project, you’ll have to upload the dataset to Google Colab after saving the notebook to your own system.