Box plots

March 22, 2019

What is a box plot?

Box plot (also called ‘box and whisker plot’) visualizes the distribution of numerical data and helps in identifying outliers present in the data.

Below, you can see the box plot of a toy data set having just 10 observations – 144, 147, 153, 154, 156, 157, 161, 164, 170, 181.

Box plot for the toy dataset

Notice the following components in the above box plot:

1^st Quartile (25^th percentile)
Median (2^nd Quartile or 50^th percentile)
3^rd Quartile (75^th percentile)
Interquartile Range (IQR – difference between 3^rd and 1^st Quartile)
Whiskers — marks the lowest data point which lies within 1.5 IQR of the 1^st quartile, and the highest datum which lies within 1.5 IQR of the 3^rd quartile
Outliers — any data points beyond 1.5 IQR of the 1^st or 3^rd quartile, i.e. values which are greater than 3^rd Quartile + 1.5 * IQR or less than 1^st Quartile – 1.5 * IQR.

In other words, whiskers extend from the quartiles to the rest of the distribution, except for points that are determined to be “outliers”.

Outliers

Whenever we do any data analysis, outliers should be investigated carefully. They may contain valuable information about the problem under investigation, or inform us about errors in recording the data. Moreover, even a few outliers can distort many commonly used machine learning algorithms.

Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear.

Creating a boxplot in Python

Let’s generate 1000 random values for ‘height’ and create the corresponding box plot. The ‘height’ values are randomly generated from a normal distribution with mean 165 and standard deviation 10. For creating the box-plot, we’ll make use of the seaborn Python library.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate data - 1000 values with mean 165 and std dev 10 
np.random.seed(0) 
height = np.round((10 * np.random.randn(1000) + 165), 0)
# create box plot 
sns.boxplot(height)
plt.show()

The above boxplot is for normally distributed data with mean 165 and standard deviation 10. For data that follows a symmetric distribution (such as the normally distributed data above), the box plot will be symmetrical as well. In other words:

The median will be roughly in the middle of 1^st and 3^rd quartiles.
The whisker lengths will be roughly equal.

Boxplot for Skewed distribution

The box plot for a chi-square distribution (right-skewed) is shown below. We can observe that median is closer to 1^st quartile, and the box plot has a long whisker to the right.

var = np.random.chisquare(1, 100) # generate 100 values 
sns.boxplot(var)                  # plot 
plt.show()

Also notice that the longer tail implies many more outliers, as we can see in the box plot above.

Multiple box plots

As we did with a histogram, we can use box plots to compare two sets of data.

Let’s first generate some data:

import pandas as pd
# generate data 
np.random.seed(2)
# Group 1 has mean 165 and stddev 10
df1 = pd.DataFrame()
df1['Height'] = np.round((10 * np.random.randn(500) + 165), 0)
df1['Group'] = 'Group1'
# Group 2 has mean 170 and stddev 12
df2 = pd.DataFrame()
df2['Height'] = np.round((12 * np.random.randn(500) + 170), 0)
df2['Group'] = 'Group2'
df = df1.append(df2)

Now, let’s plot the box-plot:

# create plot 
sns.boxplot(x='Group', y='Height', data=df)
plt.show()

Here heights of two different Groups are compared by drawing the box plots simultaneously. X-axis has 'Group' and Y-axis has 'Height'.

From the box plot, we can infer that on an average people from 'Group2' are taller than people from 'Group1'.

Handling multiple categories simultaneously

Let’s add one more variable — ‘Basketball_Player’ — to our data and then recreate our box plot.

# randomly assign Basketball_Player status
df['Basketball_Player'] = np.random.randint(0, 2, df.shape[0])
# use the 'hue' parameter to add colors based on Basketball_Player status
sns.boxplot(x='Group', y='Height', hue='Basketball_Player', data=df)
plt.show()

This produces the box plot for the numerical variable 'Height' across multiple levels of categorical variables — 'Group' and 'Basketball_Player'.

Since we assigned the 'Basketball_Player' status randomly, we can see that 'Height' does not have much influence on for 'Basketball_Player' this data set.

Other parameters in seaborn.boxplot

So far, we’ve seen the x, y, data and hue parameters supported by the seaborn.boxplot() function. In this section, we’ll see a couple more important parameters supported by the function.

whis - Usually, we consider outliers to be the data points which are whose distance from the 1^st and 3^rd Quartiles is greater than 1.5 x IQR. We can change this setting using the parameter whis.
orient - takes values ‘v’ or ‘h’ which sets the orientation of the box plot to vertical or horizontal, respectively.

Let’s use these parameters to create a vertical boxplot with whisker length 0.5 x IQR:

np.random.seed(0)
height = np.round((10 * np.random.randn(1000) + 165), 0)
sns.boxplot(height, orient='v', whis=0.5)
plt.show()

As you can see, the number of outliers increased significantly because of the reduced whisker length.

Summary

Box plot can be used to summarize large data graphically. In particular, it highlights the various quartiles and outliers.
We can create boxplots using the boxplot() function from the seaborn Python library.
Looking at the boxplot allows us to easily infer whether the distribution is symmetric or skewed.
Box plots allow us to compare a variable’s value grouped by 1 or more categorical variables.

References

We covered all the important parameters that the seaborn.boxplot() function supports. If you would like to read more about the other parameters supported by seaborn.boxplot(), you can check the documentation here.