May 23, 2019

Introduction

Correlation analysis can help us understand whether, and how strongly, a pair of variables are related.

In data science and machine learning, this can help us understand relationships between features/predictor variables and outcomes. It can also help us understand dependencies between different feature variables.

For example:

1. How strong is the correlation between mental stress and cardiac issues?
2. Is there a correlation between literacy rate and frequency of criminal activities?

This tutorial will help you learn the different techniques and approaches used to understand correlations that exist between features in any dataset.

Correlation coefficient

In correlation analysis, we also estimate a sample correlation coefficient between a pair of variables. Correlation coefficient of a pair of variables helps us understand how strongly one variable changes with respect to another.

Correlation coefficient ranges between a value of -1 to 1.

The graphs below show different pairs of variables with different correlation coefficient $\rho$:

Graphs of different variables with different Correlation Coefficients. [Source: Wikipedia]

The magnitude signifies the strength of relationship between the variables.

• So, a value of of -1 or +1 indicates that the two variables are perfectly related.
• A value of 0 indicates that the variables have no relation.

The sign (positive / negative) signifies the direction of relationship:

• A positive value indicates that higher values of one variable are accompanied with higher values of the other variable
• A negative value indicates that higher values of one variable are accompanied with lower values of the other variable

Types of Correlation Coefficients

The two most common types of correlation coefficient are Pearson’s and Spearman’s.

Pearson’s correlation coefficient is a measure of linear correlation between two variables. The graphs you saw above show Pearson’s correlation coefficient.

Spearman’s rank correlation coefficient assesses monotonic relationships, irrespective of whether it is linear or not. That is to say, it is only interested in knowing that an increase in one variable causes an increase in another variable.

Let us take an example to understand the difference:

Example of a non-linear monotonically related pair of variables. [Source: Wikipedia]

Here we can see that when there is an increase in X, there is always an increase in Y. However, the increase is not linear.

Thus Pearson’s correlation is 0.88 while Spearman’s correlation is a perfect 1.

Also note that correlation is a symmetric relationship. That is to say, if A correlates with B, then B correlates with A.

Correlation vs Causation

Before we proceed further, we should clarify that although correlation shows relationships between variables and it helps in predictive analysis, it does not imply causation.

For example: The sale of Sunglasses and the sale of ice-cream are highly correlated. But an increase in ice-cream sales does not cause an increase in the sales of sunglasses, or vice-versa. In this case, they are correlated because they both depend on a third independent variable, which is how hot / sunny it is.

Limitations of Correlation Coefficient

Although correlation coefficients give us an idea about the strength of the relationship between two variables, it is not possible for a single number to give us the full picture.

Importance of visualisation

Since correlation coefficients can only give us limited information, visualisation can be of great aid in understanding the relationship between variables.

For example, the following graphs show the famous example Anscombe’s Quartet. It consists of four data sets, where the correlation coefficient is exactly the same, even though the data sets are very different from one another.

Anscombe’s Quartet [Source: Wikipedia]

Set-up

For the rest of this tutorial (and the next tutorial as well), we will be see how to use visualization to understand the relation between two variables. To do this, we will we using the seaborn Python library and a couple of datasets.

Importing

We will be using the standard data science libraries — NumPy, Pandas, Matplotlib and Seaborn. So let’s start by importing them.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# change pandas display options
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 20)

We’ve also asked pandas to increase the display width to 200 characters, and the maximum number of columns it should display to 20. This will make sure we can view our data properly.

Datasets

We will be using the wine dataset and the tips dataset available on CommonLounge. We’ll load these datasets using the load_commonlounge_dataset() function, which returns a DataFrame.

Let’s first load the wine dataset (we’ll load a modified version for this tutorial):

wine_data = load_commonlounge_dataset('wine_v2')

Let’s also load the tips dataset:

tips_data = load_commonlounge_dataset('tips')

Below, we’ve included a brief description of these two datasets.

Wine Dataset

This dataset is the result of a chemical analysis of wines grown (in the same region) in Italy.

The following are the variables in the dataset:

• Wine: This is the target variable to be predicted. It is a categorical variable divided into a set of three classes denoting three different types of wines. The classes are labelled as 1, 2 and 3.

All other attributes are continuous numerical variables:

• Alcohol: alcohol content
• Malic.acid: one of the principal organic acids
• Ash: inorganic matter left
• Acl: the alkalinity of ash
• Mg: magnesium
• Phenols: Phenols
• Flavanoids: particular type of phenol
• Nonflavanoid.phenols: particular type of phenol
• Proanth: particular type of phenol
• Color.int: color intensity
• Hue: hue of a wine
• OD: protein content measurements
• Proline: an amino acid

Let’s take a glimpse of the first few instances in the dataset using the head() function:

print(wine_data.head())

The data is taken from UCI Machine Learning Repository.

Citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Tips Dataset

In the tips dataset contains information about the tips collected at a restaurant.

Here’s a description of the attributes:

• tip: (continuous) Tips paid. The target variable to be predicted.
• total_bill : (continuous) Total bill
• sex: (categorical) Male / Female
• smoker : (categorical) smoker
• day : (categorical) day of the week
• time: (categorical) Dinner / Lunch
• size: (integer) size of the party

This data is taken from Seaborn’s data repository.

Again, let’s take a glimpse of the first few instances in the dataset:

print(tips_data.head())

Questions dataset

We will also use a third dataset, which we will only use for the exercises in this article. For now, we will load it and keep it aside.

This is a synthetic dataset composed of 5 feature variables called feature1, feature2, feature3, feature4 and feature5. It has two target variables called target1 and target2.

qns = load_commonlounge_dataset('corr_qns')

Correlation analysis between a pair of Numerical variables

To explore correlation between two numerical variables, we will use two different kinds of plots: Scatter plots and Hexbin plots.

The tips dataset has a output variable which is continuous and some input variables which are also continuous. Let us plot them against each other and see if there is a trend.

Scatter Plots

We’ll start by plotting the tips variable against the totalbill_ variable.

To do so, we will use the jointplot() function from the seaborn library.

Here’s the syntax for the function:

sns.jointplot(xdata, ydata)

Let’s give it a try:

sns.jointplot(tips_data["total_bill"], tips_data["tip"])
plt.show()

As you can see, apart from plotting a scatter plot, the jointplot() function also plots the univariate distribution of x and y on the x and y axis.

Analysis:

• We can infer from this graph that there is indeed a positive correlation between tip and total_bill. However, it looks like the correlation is stronger for lower values of total_bill than the higher values.
• For lower values (between about 0 to 20), the values of tip rises steadily with an increase in value of total_bill.
• But as the value of total_bill increases (between 30 to 50), we see the variance in tip increasing. Some values of tip are high for high total_bill while some values of tip are very low. So the correlation at this stage is not very strong.
• We can also see that total_bill is more skewed than tip, which is relatively more symmetric, centered roughly around 3 dollars.

These insights would have been hard to extract without visualisation.

Hexbin plots

As we saw in the graph above, the data-points in a scatter plot may overlap each other. This makes it harder to gauge the density of points in certain regions of the plot.

The hexbin plot helps overcome this problem.

Let us use the same variables to plot a hexbin plot. We will use the same jointplot() function, but this time we will pass the kind parameter the value "hex". Here’s the syntax:

sns.jointplot(xdata, ydata, kind="hex")

Note : The default argument for kind is "scatter". That is why it plots a scatter plot when no argument is passed to this parameter.

sns.jointplot(tips_data["total_bill"], tips_data["tip"], kind="hex")
plt.show()

Hexbin plots are equivalent of bar graphs, but for a pair of variables. It creates hexagonal bins, dividing the data in both x and y axis into intervals. The darker the colour of the hexagon, the higher is the frequency in that interval.

Analysis

• As we can again see, there is a strong cluster of points at the lower values of tips and total_bill, while the data is much more spread out at the higher values.
• This means that the tip is very likely to be $2-$4 if the total_bill is between $10 and$20. But if the total_bill is between $40 and$50, the tip could range anywhere between $3 and$10, all values being more or less equally likely.

Seaborn Syntax

So far for plotting, we have been using the following syntax:

sns.jointplot(xdata, ydata)

Since we have a dataframe with all the data, this usually translates to:

sns.jointplot(df["x column name"], df["y column name"])

When the x and y data are stored in the same DataFrame, seaborn also supports another syntax, which is:

sns.jointplot(x="x column name", y="y column name", data=df)

These two syntax are equivalent, and all the seaborn functions discussed in this tutorial support both.

Correlation analysis between a Numerical and a Categorical variable

To explore correlation between a categorical and a numerical variable, we will use Strip plots, Swarm plots, Boxplots and Violin Plots. All the these plots will be drawn using functions from the seaborn library.

For this section, we will use both the datasets.

In the wine dataset, the output variable Wine is categorical, while all the input features are numerical.

In the tips dataset, the output variable tips is numerical, while the input variables are categorical.

Let us explore these cases visually and see if we can draw some inferences.

Strip Plots

Strip plots are similar to scatter plots, but for the situation when one of the variables is categorical.

We will be using the stripplot() function. The syntax is:

sns.stripplot(x="column name", y="column name", data=DataFrame)

Let us plot the output variable tip against the input variable time from tips_data.

sns.stripplot(x="time", y="tip", data=tips_data)
plt.show()

Although we can see the data points, they are not very clear because so many of them overlap. This makes it difficult to gauge the density of the data points.

Solution: We can fix this by using the parameter called jitter.

jitter can take in float value for the amount of jitter, or take a boolean value where True or 1 has a default value of jitter associated with it, and False or 0 means no jitter.

The default value of jitter is True.

Let’s re-draw the plot with the added parameter:

sns.stripplot(x="time", y="tip", data=tips_data, jitter=0.25)
plt.show()

The plot looks much better now.

Analysis

• Although the difference isn’t drastic, it looks like dinner tips on average are higher than lunch tips.
• We can also see that the number of data points is more for dinner.

Strip Plot for wine dataset

Let us now plot the output variable Wine against the input variable Alcohol from wine_data.

sns.stripplot(x="Alcohol", y="Wine", data=wine_data, jitter=0.25)
plt.show()

We can see a big problem in the graph.

Since we plotted the categorical variable on the y-axis and our categorical labels are integers, the stripplot() function is not able to infer which variable is actually categorical, and what is the orientation.

Solution: To fix this, we can use the orient parameter to explicitly tell the function to plot the graph horizontally.

Valid values for orient are "v" (orient vertically) and "h" (orient horizontally).

Let us re-draw the plot with the added parameter:

sns.stripplot(x="Alcohol", y="Wine", data=wine_data, orient="h", jitter=0.25)
plt.show()

That’s much better!

Analysis:

We can very clearly see that there is a relationship between Alcohol levels and category of Wine.

1. Wines of category 2 have much lower Alcohol levels (average around 12%).
2. Wines of category 1 have the highest Alcohol levels (average around 13.5% - 14%).
3. And wines of category 3 are somewhere in the middle. (average Alcohol level around 13% - 13.5%).

Swarm plots

One of the major problems with strip plots was the overlap of data points, which made it difficult to understand the density of points in some places.

Swarm plots are similar to strip plots, but the points are adjusted so that they don’t overlap.

We will be using the swarmplot() function. The syntax is:

sns.swarmplot(x="column name", y="column name", data=DataFrame)

Let us plot the variables tip against day from tips_data:

sns.swarmplot(x="day", y="tip", data=tips_data)
plt.show()

We can see, the plots are much cleaner than the strip plots. The breadth of the plots gives us a good sense of where most of the data points lie.

Analysis

We can draw the following inferences:

1. Majority of the data seems to be from the weekend.
2. On average, tips on weekends seem to be higher than tips on weekdays.
3. We can see a lot of points on the same horizontal line. That’s probably because people like paying tips in round values such as $2,$3 , $4 or$2.50, \$3.50 and so forth.

Boxplots

You should have already come across boxplots before in a separate tutorial. But in case you skipped it, we’ve provided a short recap.

Here’s a sample boxplot for a toy data set having just 10 observations – 144, 147, 153, 154, 156, 157, 161, 164, 170, 181.

Box plot for the toy dataset

A box plot has the following components:

• 1st Quartile (25th percentile)
• Median (2nd Quartile or 50th percentile)
• 3rd Quartile (75th percentile)
• Interquartile Range (IQR – difference between 3rd and 1st Quartile)
• Whiskers — marks the lowest data point which lies within 1.5 IQR of the 1st quartile, and the highest data point which lies within 1.5 IQR of the 3rd quartile
• Outliers — any data points beyond 1.5 IQR of the 1st or 3rd quartile, i.e. values which are greater than 3rd Quartile + 1.5 * IQR or less than 1st Quartile – 1.5 * IQR.

In other words, whiskers extend from the quartiles to the rest of the distribution, except for points that are determined to be “outliers”.

The syntax for plotting a boxplot is:

sns.boxplot(x="column name", y="column name", data=DataFrame, orient="orientation")

The orient parameter is used exactly for the same reason as we saw in the Swarm plot section — that is, to explicitly provide information about which variable is categorical.

Let us plot a graph between Proline and Wine from wine_data:

sns.boxplot(x="Proline", y="Wine", data=wine_data, orient="h")
plt.show()

Analysis:

As we can see, the plot gives us a good view of the relation between proline and wine category, without actually plotting the individual data points.

1. For category 1 Wine, the median Proline level is higher than any value of Proline from category 2 and 3 Wine. Thus high values of Proline directly imply that the wine is from category 1.
2. For low levels of Proline means, the overlap between category 2 and 3 is quite high, so we can’t say much conclusively about them from this plot.

Violinplots

Violinplots are similar to boxplots, but they have the added advantage that it also shows the actual underlying distribution of the data.

Here’s what a sample violin plot looks like:

Note: Violin plot displays an estimate of the distribution of data using a method called kernel density estimation. This estimation procedure is influenced by the sample size, and violins for relatively small samples might look misleadingly smooth.

The syntax is as follows:

sns.violinplot(x="column name", y="column name", data=DataFrame, orient="orientation")

Let us plot again plot the Proline variable with Wine, but this time it’s a violinplot:

sns.violinplot(x="Proline", y="Wine", data=wine_data, orient="h")
plt.show()

Analysis

1. From this plot, we could make all the conclusions we made earlier from the boxplot.
2. In addition, we can see the Proline levels for category 2 vs category 3 Wines much more clearly, and although there is an overlap, the modes are slightly separated.

Ending note

The orient parameter can be used for all the plots mentioned in this section - Strip plots, Swarm plots, Box plots and Violin plots.

Summary

• Correlation coefficient (ranges from -1 to +1) tells us about the relationship between two variables.
• Its magnitude signifies how strongly related the variables are.
• Its sign signifies the direction of relationship (positive vs negative).
• Scatterplots and hexbin plots visualize two numerical variables.
• Stripplots and swarmplots visualize a numerical variable with a categorical variable.
• Boxplots and violinplots also visualize a numerical variable with a categorical variable, but they don’t plot every data point, and instead display quartiles, density, etc.
• orient parameter is used to explicitly specify which variable is categorical.

Reference

Two numerical variables:

Scatterplot:

sns.jointplot(xdata, ydata)

Hexbin plots:

sns.jointplot(xdata, ydata, kind="hex")

One numerical variable and one categorical variable:

Stripplot:

sns.stripplot(xdata, ydata, jitter=float, orient="h" or "v")

Swarmplot:

sns.swarmplot(xdata, ydata, orient="h")

Boxplot:

sns.boxplot(xdata, ydata, orient="h")

Violinplot:

sns.violinplot(xdata, ydata, orient="h")