Box plot (also called ‘box and whisker plot’) visualizes the distribution of numerical data and helps in identifying outliers present in the data.
Below, you can see the box plot of a toy data set having just 10 observations – 144, 147, 153, 154, 156, 157, 161, 164, 170, 181.
Box plot for the toy dataset
Notice the following components in the above box plot:
- 1st Quartile (25th percentile)
- Median (2nd Quartile or 50th percentile)
- 3rd Quartile (75th percentile)
- Interquartile Range (IQR – difference between 3rd and 1st Quartile)
- Whiskers — marks the lowest data point which lies within 1.5 IQR of the 1st quartile, and the highest datum which lies within 1.5 IQR of the 3rd quartile
- Outliers — any data points beyond 1.5 IQR of the 1st or 3rd quartile, i.e. values which are greater than 3rd Quartile + 1.5 * IQR or less than 1st Quartile – 1.5 * IQR.
In other words, whiskers extend from the quartiles to the rest of the distribution, except for points that are determined to be “outliers”.
Whenever we do any data analysis, outliers should be investigated carefully. They may contain valuable information about the problem under investigation, or inform us about errors in recording the data. Moreover, even a few outliers can distort many commonly used machine learning algorithms.
Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear.
Let’s generate 1000 random values for ‘height’ and create the corresponding box plot. The ‘height’ values are randomly generated from a normal distribution with mean 165 and standard deviation 10. For creating the box-plot, we’ll make use of the
seaborn Python library.
import numpy as np import seaborn as sns import matplotlib.pyplot as plt # generate data - 1000 values with mean 165 and std dev 10 np.random.seed(0) height = np.round((10 * np.random.randn(1000) + 165), 0) # create box plot sns.boxplot(height) plt.show()
The above boxplot is for normally distributed data with mean 165 and standard deviation 10. For data that follows a symmetric distribution (such as the normally distributed data above), the box plot will be symmetrical as well. In other words:
- The median will be roughly in the middle of 1st and 3rd quartiles.
- The whisker lengths will be roughly equal.
The box plot for a chi-square distribution (right-skewed) is shown below. We can observe that median is closer to 1st quartile, and the box plot has a long whisker to the right.
var = np.random.chisquare(1, 100) # generate 100 values sns.boxplot(var) # plot plt.show()
Also notice that the longer tail implies many more outliers, as we can see in the box plot above.
As we did with a histogram, we can use box plots to compare two sets of data.
Let’s first generate some data:
import pandas as pd # generate data np.random.seed(2) # Group 1 has mean 165 and stddev 10 df1 = pd.DataFrame() df1['Height'] = np.round((10 * np.random.randn(500) + 165), 0) df1['Group'] = 'Group1' # Group 2 has mean 170 and stddev 12 df2 = pd.DataFrame() df2['Height'] = np.round((12 * np.random.randn(500) + 170), 0) df2['Group'] = 'Group2' df = df1.append(df2)
Now, let’s plot the box-plot:
# create plot sns.boxplot(x='Group', y='Height', data=df) plt.show()
Here heights of two different Groups are compared by drawing the box plots simultaneously. X-axis has
'Group' and Y-axis has
From the box plot, we can infer that on an average people from
'Group2' are taller than people from
Let’s add one more variable —
‘Basketball_Player’ — to our data and then recreate our box plot.
# randomly assign Basketball_Player status df['Basketball_Player'] = np.random.randint(0, 2, df.shape) # use the 'hue' parameter to add colors based on Basketball_Player status sns.boxplot(x='Group', y='Height', hue='Basketball_Player', data=df) plt.show()
This produces the box plot for the numerical variable
'Height' across multiple levels of categorical variables —
Since we assigned the
'Basketball_Player' status randomly, we can see that
'Height' does not have much influence on for
'Basketball_Player' this data set.
So far, we’ve seen the
hue parameters supported by the
seaborn.boxplot() function. In this section, we’ll see a couple more important parameters supported by the function.
whis- Usually, we consider outliers to be the data points which are whose distance from the 1st and 3rd Quartiles is greater than 1.5 x IQR. We can change this setting using the parameter
orient- takes values
‘h’which sets the orientation of the box plot to vertical or horizontal, respectively.
Let’s use these parameters to create a vertical boxplot with whisker length 0.5 x IQR:
np.random.seed(0) height = np.round((10 * np.random.randn(1000) + 165), 0) sns.boxplot(height, orient='v', whis=0.5) plt.show()
As you can see, the number of outliers increased significantly because of the reduced whisker length.
- Box plot can be used to summarize large data graphically. In particular, it highlights the various quartiles and outliers.
- We can create boxplots using the
boxplot()function from the
- Looking at the boxplot allows us to easily infer whether the distribution is symmetric or skewed.
- Box plots allow us to compare a variable’s value grouped by 1 or more categorical variables.
We covered all the important parameters that the
seaborn.boxplot() function supports. If you would like to read more about the other parameters supported by
seaborn.boxplot(), you can check the documentation here.