# Correlation Analysis: Multivariable [Under Construction]

May 23, 2019

Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.

# Introduction

Correlation analysis is statistical evaluation method used to study the strength of relationship between two numerical variables. This type of analysis is useful when we want to check if there exist any positive or negative connections between the variables.

# Setup

We will start by loading the `wine_v2`

, `tips`

and `questions_data`

datasets.

```
wine_data = load_commonlounge_dataset('wine_v2')
tips_data = load_commonlounge_dataset('tips')
qns = load_commonlounge_dataset('content-creators/swarnabha/questions_data')
```

Let’s set the pandas display width to 200 characters, and the maximum columns to display to 15.

```
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 20)
```

# Correlation Matrix and Heatmap

Let’s dive into the `Wine`

dataset.

`wine_data.info()`

There are 13 numerical variables. If we were to study the correlation between all the pairs there will be 78 correlation coefficients! One for each pair!

${{13}\choose{2}} = 78$ (combinations)

Instead of listing out all the correlation coefficients (between every pair of variables) separately, we can represent them by a **correlation matrix**.

## Correlation Matrix

An $n \times n$ correlation matrix can be formed for a set of $n$ numerical variables named $X_1, X_2, ... X_n$ , such that the $(i,j)$ element of the matrix is the correlation coefficient between $X_i$ and $X_j$

Hence, we can form a $13 \times 13$ correlation matrix.

To form the matrix , we will call the `corr()`

method of the pandas DataFrame object.

This is the syntax:

`correlation_matrix = DataFrame.corr(method="") `

For the `method`

parameter, we pass one of the following arguments to decide how the correlation coefficient be calculated:

`"pearson"`

: Pearson’s product-moment correlation coefficient`"spearman"`

: Spearman’s rank correlation coefficient

Before we create the correlation matrix, we make a copy of the wine dataset, with only the numerical input variables. by dropping the categorical variable `wine`

, with the pandas method - `DataFrame.drop(label="", axis=1)`

```
# creating a subset of Wine DataFrame with only numerical variables.
wine_data_num = wine_data.drop(labels=["Wine"], axis=1)
# creating correlation matrix
corr_matrix = wine_data_num.corr(method="pearson")
# display correlation matrix
print(corr_matrix)
```

As you can see, although the information here is useful, it is difficult to infer anything from this matrix and analyse the information.

It would be very cumbersome to even recognise which variables are negatively correlated and which are positively correlated!

## Heatmaps

To take care of this we will use a visual tool called heatmap from the **seaborn** library.

The `heatmap()`

function will create a two-dimensional graphical representation of data where the individual values that are contained in a matrix are mapped to colors.

A colourmap is used for this purpose , where a continuous spectrum of colours is used to represent the numerical values of the correlation coefficient.

We will use this function to create a heatmap, by passing the correlation matrix as the argument. Let’s use the parameter’s `linewidth`

, `vmin`

and `vmax`

.

`sns.heatmap(corr_matrix, linewidths=0, vmin=None, vmax=None)`

**Parameters:**

`linewidths`

: to draw lines separating every cell. The arguments defines the width of the line. Accepts float values and default is 0.`vmin`

,`vmax`

: Values of correlation coefficient to anchor the colour map. If nothing is passed, it infers limits from the matrix. Accepts float values. ( For our purpose here , we want the colourmap to extend from -1 to +1)

Since our correlation matrix is a DataFrame object, the heatmap function uses the Index/Column names to label the columns and rows.

Let us plot and see how it looks.

```
sns.heatmap(corr_matrix, linewidths=0.2, vmin=-1, vmax=1)
plt.show()
```

Observe the colour coding on the right side of the plot according to a colourmap.

The X-axis and Y-axis labels have been arranged according to the column names from our dataset.

Notice the extremely light and extremely dark colours and recognise which variables have very high positive and negative correlations!

**Analysis**:

We can see that all the diagonal elements are extremely light colours. Thats because they contain the correlation coefficient of the variables with themselves. Hence correlation coefficient value is 1.

- we can also see light colours which implies high positive correlation between
`OD`

and`Flavanoids`

,`Phenols`

and`Flavanoids`

- we can also see dark colours which implies high negative correlation between
`Nonflavanoid.phenols`

and`Flavanoids`

,`Hue`

and`Malic.acid`

Note:You can also see that the cells on the upper side and lower side of the diagonals are identical and repeat themselves. This is due to symmetry in the correlation matrix. The pair of labels from X and Y axis end up getting repeated on both sides of the diagonal

**Other parameters:**

You can also use the parameter `cmap`

to modify the plot.

This parameter controls the the mapping from data values to color space. For example you can pass `"RdYlGn"`

as an argument to get colour map of `Red-Yellow-Green`

.

```
sns.heatmap(corr_matrix, linewidths=0.2, vmin=-1, vmax=1, cmap="RdYlGn")
plt.show()
```

# PairGrid plots

Although correlation coefficient plays a central role in correlation analysis, the information it provides is often not enough.

As explained earlier, just from the correlation coefficient we won’t get to know quite a lot about the relationship between the variables.

There is a way to **visualise** relationship between all the pairs of the variables using instances of `PairGrid`

class.

`PairGrid`

allows us to draw multiple plots within a single figure. Thus we can draw scatterplots between each pair of variables within a single figure. An instance of the `PairGrid`

class will be our multi plot figure.

`pairplot()`

function

`pairplot()`

We will draw these figures using the function `pairplot()`

which returns a `PairGrid`

instance.

The figure drawn consist of rows and columns of subplots of pair of variables.

This is the syntax:

`PairGrid_object = sns.pairplot(data, vars=column_features)`

- Here
`data`

refers to the dataframe containing all the variables. `vars`

refers to the list of columns we want to show in our plot. We will only use selected 5 numerical features from the`wine_data`

DataFrame , because using all 12 features might be visually too crowded and hinder our analysis.

Note:`pairplot()`

function or`PairGrid`

class can only handle`tidy`

data, i.e., dataframe where each column is a variable and each row is an observation.

Let us plot our figure!

```
# list of column names to use
c=["Alcohol", "Phenols", "Flavanoids", "Proanth", "OD"]
# plot the figures
ax = sns.pairplot(wine_data, vars=c)
plt.show()
```

## Explanation

The names of the variables of each row (at the left) gives us the variable on the Y-axis along the row.

The names of the variables of each column (at the bottom) gives us the variable on the X-axis along that column.

As we can see, this figure gives a lot of insight into the data!

**Diagonals**

Notice that the diagonals and non-diagonal elements has very different plots. This is because all the plots on the non-diagonal, have a different variable on X-axis and Y-axis. Thus the scatterplots give meaningful visuals.

The diagonal plots on the other hand have the same variable on the X-axis and the Y-axis. Thus if we were to simply plot the same variables on the X and Y axis, we would get all the points on a straight line along a $45^o$ degrees slope! (a correlation coefficient of 1). So we plot the histogram for the variables along the diagonal.

You can also see that the subplots on the upper side and lower side of the diagonals are identical and repeat themselves. This is due to symmetry in the way the whole figure is designed. The pair of labels from X and Y axis end up getting repeated on both sides of the diagonal.

**Analysis**

Pair plots helps us to draw meaningful analysis from them:

- We can see from the first few of plots, that
`Alcohol`

seems to have a more or less symmetric distribution, and that it has very little correlation with any other variable on our plot - From the second row of plots we can see that
`Phenols`

have a very strong positive correlation with`Flavanoids`

, and weaker positive correlation with`Proanth`

and`OD`

.

## Other parameters

We can use the parameter `hue`

to further improve the subplots.

The `hue`

parameter takes in a categorical variable, and uses the information to plot the datapoints with different colours according to the labels in that variable.

For example, let us pass the target variable `"Wine"`

as an argument to the `hue`

parameter and see how it improves.

```
# list of column names to use
c=["Alcohol", "Phenols", "Flavanoids", "Proanth", "OD"]
ax = sns.pairplot(wine_data, vars=c, hue="Wine")
plt.show()
```

# FacetGrid Plots

Let us now look at another interesting type of plot, that help us in correlation analysis.

We have seen how the `hue`

parameter helps us in analysing relationships between a pair of variables, while demarcating the points according to a third variable (categorical) .

But what if we wanted to further distinguish the points with respect to other categorical variables in the data?

For example in the `tips`

dataset, we might want to demarcate the points according to `Sex`

(Male, Female) and `Time`

(Lunch, Dinner) in the same figure!

Let’s construct plots which allows to visualise relationship between multiple variables separately within subsets of our dataset.

We will do so using plots created by the `FacetGrid`

objects.

`relplot()`

function

`relplot()`

functionWe will use the function `relplot()`

to call an instance of the `FacetGrid`

class, which will plot the desired figure.

Unlike `PairGrid`

plots, in `FacetGrid`

plots, the **horizontal and vertical axes denote the same variable in every subplot**. In each subplot we can differentiate between the subsets of the data by passing arguments to relevant parameters.

This is the syntax:

`FacetGrid_Object = relplot(data, x="", y="", hue="")`

**Parameters:**

- In the
`data`

parameter, we pass the DataFrame as the argument. - In the
`x`

and`y`

parameter’s, we pass the column label from the DataFrame, to be plotted in the x-axis and y-axis respectively. - In the
`hue`

parameter, we pass the column name according to which the map plot aspects be mapped to different colours

Let us use the `relplot()`

function to explore the `tips`

dataset. We will analyse the relationship between `total_bill`

and `tips`

variable.

Let’s plot one subplot:

- Where the color(hue) is determined by the column
`day`

```
ax = sns.relplot(data=tips_data, x="total_bill", y="tip", hue="day")
plt.show()
```

As we can see although, the `hue`

parameter gives more information, it is hard to understand much from this graph.

Let us see if we can separate the points according to the values from the variable `time`

.

`col`

parameter

Let us separately draw subplots, side by side in a row, each containing points according to labels from `time`

.

For this we will use the parameter `col`

and pass the argument `"time"`

.

`ax = sns.relplot(data=tips, x="total_bill", y="tip", col="time", hue="day")`

I will explain the parameters, after the figure.

```
ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", hue="day")
plt.show()
```

Note:Since we are analysing the relationship between the variables`total_bill`

and`tips`

, the x-axis will denote`total_bill`

and y-axis will denote`tip`

in all subplots drawn by the`relplot()`

function.

Think of this as a table of subplots, with **one row** and **two columns**.

- The subplot in the first column only has data points from “time=Dinner” and the second column has data points from “time=Lunch”
- Thus each column shows a different
*facet*, according to the**argument (variable) passed to the**`col`

parameter

`row`

parameter

Let us see if we can improve it further.

Let us add more rows of subplots, where every row will contain data points according to values in the `sex`

variable.

We will use a parameter called `row`

and pass the name of the variable `sex`

as an argument.

Rest of the syntax remains intact.

```
ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", row="sex", hue="day")
plt.show()
```

So in the `row`

and `col`

parameter we basically pass the variable names according to which different faceting of the grid will take place .

**Analysis:**

We can see some interesting patterns here.

- The correlation between tip and total bill is quite strong if the
`sex`

is female and`time`

is lunch. - The lunch data almost entirely comes from Thursdays and Fridays.
- The correlation between
`tips`

and`total_bill`

seem to be stronger for lunch data than dinner data. Although we should do further analysis to verify this

# Summary

Let’s summarize the syntax of creating correlation matrix and the various plots.

**Correlation Matrix and Heatmap**

- Correlation Matrix :

`correlation_matrix = DataFrame.corr(method="")`

- Heatmap :

`sns.heatmap(corr_matrix, linewidths=float, vmin=float, vmax=float)`

**PairGrid Plots**

Using the `pairplot()`

function

`ax = sns.pairplot(data, vars=list, hue="column")`

**FacetGrid Plots**

Using the `relplot()`

function

`ax = sns.relplot(data=tips_data, x="total_bill", y="tip", col="time", row="sex", hue="day")`