CommonLounge Archive

End-to-End Project: Data Cleaning and Analysis [Under Construction]

May 23, 2019

Note: This project is currently under construction. The final version is expected to be ready on or before June 15th 2019.

In this project, we will put together all the pieces we have learnt so far for data exploration, data cleaning and data analysis. This will include loading datasets, exploring numerical and categorical variables, data visualization, handling missing values, checking for outliers, performing data transformations, etc.

Let’s get started!

The student dataset

For this tutorial, we will use a dataset which has information about students admitted to an engineering college in India. Information about each student includes the performance in school exams, family income, rank in screening examination, etc.

Let’s import the standard libraries, change pandas’ display options, and load the data into a pandas DataFrame:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 20)
# load the dataset 
df = load_commonlounge_dataset('student_v1')

The first thing we’ll do after loading the data is to get a feel for the dataset. Let’s start by seeing how large the dataset is using shape:

print('Shape of dataset:', df.shape)

Our data set has 535 observations and 13 columns.


Let’s check the first few rows in our dataset using the head() function:

print(df.head())

Below, we’ve included a description of each of the columns in the dataset:

  • Admn Yr - The year of admission in the college
  • Seat Type - Students are admitted to nine different type of seats broadly classified on the basis of merit and fee structure
  • Entrance Rank - Rank secured in a common examination to measure academic merit of the student
  • Reference - Whether the student was recommended by the management of the college
  • Board - Board under which the student studied in school. Three different types exists in the data set.
  • Physics - Marks secured in Physics
  • Chemistry - Marks secured in Chemistry
  • Maths - Marks secured in Maths
  • Matriculation - Total marks secured in the final school exam.
  • Sex - Male / Female
  • Family Income - Annual Income of the family in Indian Rupee (INR)
  • Proposed Residence - Whether the student resides on-campus or off-campus. Day scholar / Hostler
  • No of Subjects Failed - The number of subjects the student failed in the college. This is the target variable.

Type of Variables

We have 13 variables of which one variable ( No of Subjects Failed ) is our target variable, and the remaining 12 variables are independent variables.

Variables can be numerical or categorical.

  • Numerical variables represent a measurable quantity such as height, weight, examination marks, etc. These are quantitative variables which have a mathematical meaning.
  • Categorical variables represent a category such as color of car, gender, etc. These variables may sometimes be represented by integer values, but the number assigned to them doesn’t have any mathematical meaning and we should not do mathematical operations on them.

Let’s identify the type of our variables and transform them into the correct datatype in the dataframe. Of our independent variables, there are 6 numeric variables and 6 categorical variables.

The numeric variables are:

  • Entrance Rank
  • Matriculation
  • Family Income
  • Physics
  • Chemistry
  • Maths

The categorical variables are:

  • Admn Yr
  • Seat Type
  • Reference
  • Sex
  • Proposed Residence
  • Board

Note: Some variables such as year can be treated as a numerical or categorical variable depending on the dataset or domain. In our case, we know that the dataset focuses on examinations, etc which is different from year to year. For example, students with admission year 2010 would all give the same exam, and students with admission year 2011 would all give a different exam. Hence, we treat Admn Yr as a categorical variable.


Now that we have identified the type of each variable, let’s make sure the datatype of the variables in the DataFrame are correct.

df.info()

The categorical variables are currently of datatype object. Let’s set it them to be of category datatype instead. (We can also see that many variables have missing values (null values), but we’ll get back to that in a bit.)

categorical_labels = ['Admn Yr', 'Seat Type', 'Reference', 'Sex', 'Proposed Residence', 'Board']
df.loc[:, categorical_labels] = df.loc[:, categorical_labels].astype('category')
# Check that everything looks good
df.info()

Great! Now all the variable datatypes look good!

Duplicate Observations

If the data contains duplicate observations, we need to decide whether to drop duplicates or not.

In our data set, there is no chance of two observations (students) being identical on all variables such as Entrance Rank, Physics, Family Income, etc. If our data has duplicates, they were probably caused due to some data entry error. These type of duplicates must be dropped.

To remove the duplicates in our data set, we will use df.drop_duplicates(). After dropping the duplicates, we will also reset the index of our DataFrame using reset_index().

print('No of duplicates present:', df.duplicated().sum())
df = df.drop_duplicates()
# reset index.
df = df.reset_index(drop=True)
print('Shape:', df.shape)

54 duplicates were present in the dataset. After dropping them, we are now left with 481 observations.

Distribution of Variables

When analyzing the distribution of variables, we analyze numerical and categorical variables separately. This is because measures like mean, standard deviation, etc do not mean anything for categorical variables.

Categorical Variables

Let’s start by taking a look at the categorical variables. First, we’ll use the describe() DataFrame method to get a summary of the categorical variables by passing it the parameter include='category':

print(df.describe(include='category'))

We can see that most categorical variables have 2-4 unique values, and only Seat Type takes 9 unique values.


Since the number of unique values for each variable is quite small (less than 10), the easiest way to look at the data is to plot barplots.

We’ll use the countplot() function in the seaborn library to create barplots.

# bar graph of Admn Yr and Board
plt.subplots_adjust(wspace=0.5)
plt.subplot(1, 2, 1)
sns.countplot(df['Admn Yr'])
plt.subplot(1, 2, 2)
sns.countplot(df['Board'])
plt.show()
# bar graph of Reference, Sex and Proposed Residence
plt.subplots_adjust(wspace=0.5)
plt.subplot(1, 3, 1)
sns.countplot(df['Reference'])
plt.subplot(1, 3, 2)
sns.countplot(df['Sex'])
plt.subplot(1, 3, 3)
sns.countplot(df['Proposed Residence'])
# rotate x-axis labels because they don't fit otherwise
plt.xticks(rotation=10, wrap=True) 
plt.show()
# bar graph of Seat Type
plt.figure()
sns.countplot(df['Seat Type'])
# rotate x-axis labels because they don't fit otherwise
plt.xticks(rotation=20, wrap=True)
plt.show()

From the above plots, we can make the following observations:

  1. The number of students is almost equal for each Admn Yr, as one might expect.
  2. More than 95% of the students come from HSC or CBSE schools.
  3. For the variables Reference and Sex, all but a few observations belong to only one class. That is, very few students provided a reference, and the college has a very skewed gender ratio.

Numerical Variables

Next, we will take a look at the numerical variables. Again, we’ll start off with the the .describe() method. By default, the describe() method only summarizes numeric data:

print(df.describe())

If we pay attention to the minimum and maximum values for all the variables, the only thing that seems to stand out is that the maximum values for Physics, Chemistry and Maths is 200 — as opposed to 100, which is what we might expect in general.

The other values look fine — for example, the worst Entrance Rank is about 75000, which seems reasonable, and the maximum Family Income is 50,00,000 INR. But the table is quite difficult to look at, to let’s just draw some histograms.


For plotting histograms, we’ll use the plt.hist() function from matplotlib:

plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
plt.hist(df['Entrance Rank'], ec='black', bins=20)
plt.title("Entrance Rank")
plt.subplot(3, 2, 2)
plt.hist(df['Matriculation'], ec='black', bins=20)
plt.title('Matriculation')
plt.subplot(3, 2, 3)
plt.hist(df['Family Income'], ec='black', bins=20)
plt.title('Family Income')
plt.subplot(3, 2, 4)
plt.hist(df['Physics'], ec='black', bins=20)
plt.title('Physics')
plt.subplot(3, 2, 5)
plt.hist(df['Chemistry'], ec='black', bins=20)
plt.title('Chemistry')
plt.subplot(3, 2, 6)
plt.hist(df['Maths'], ec='black', bins=20)
plt.title('Maths')
plt.show()

Wow! A lot of those plots have multiple modes. Moreover, sometimes these modes are separated by a pretty significant gap. Anytime we see multiple modes, it’s important to reason about why this might be happening.

  1. We can see that Entrance Rank has trimodal distribution, while Matriculation, Physics, Chemistry and Maths has a bimodal distribution. In particular, for the subjects (Chemistry, Physics and Maths), the first mode seems to end at 100, and the next mode ends at 200. This is a strong hint that different students gave different exams, one of which had a maximum score of 100 per subject, and the other had a maximum score of 200 per subject. We should analyze this more and confirm if this is indeed the case.
  2. Family Income is extremely right skewed, which is consistent with regular income distribution.

Why bimodal / trimodal distributions?

We have seen earlier that Entrance Rank has a trimodal distribution, while Matriculation, Physics, Chemistry and Maths have a bimodal distribution.

Let’s look at the bimodal distribution first.

Bimodal Distributions (Matriculation, Physics, Chemistry, Maths)

We noted a couple of other things earlier:

  1. For the subjects (Chemistry, Physics and Maths), the first mode seems to end at 100, and the next mode ends at 200. Hence, it is quite probable that the students gave different exams, one of which had a maximum score of 100 per subject, and the other had a maximum score of 200 per subject.
  2. We saw in the plots for categorical variable Board that 90% of the students come from HSC and CBSE schools. So, it is likely that in the bimodal distributions, one of the modes corresponds to CBSE and the other mode corresponds to HSC. That is, one of CBSE and HSC has maximum marks 100 per subject (Physics, Chemistry and Maths), and the other has maximum marks 200 per subject.
  3. Similarly, the two boards probably have different maximum total marks for Matriculation.

Let’s see if we can confirm this idea.


Let’s plot the boxplots for Matriculation, Physics, Chemistry and Maths grouped by Board. We’ll use the boxplot() method from seaborn library.

plt.subplot(2, 2, 1)
sns.boxplot(x='Board', y='Matriculation', data=df)
plt.subplot(2, 2, 2)
sns.boxplot(x='Board', y='Physics', data=df)
plt.subplot(2, 2, 3)
sns.boxplot(x='Board', y='Chemistry', data=df)
plt.subplot(2, 2, 4)
sns.boxplot(x='Board', y='Maths', data=df)
plt.show()

We can see that Matriculation, Physics, Chemistry and Maths from HSC and CBSE differ significantly.

  • We can clearly see that the maximum marks is 200 per subject for HSC and 100 per subject for CBSE.
  • ISC board has fewer data points, but we can still see that ISC’s behavior is similar to CBSE‘s.
  • For Matriculation, the maximum mark is 1200 for HSC and 500 for CBSE and ISC.

Let’s also plot histograms. We’ll split the data by Board first.

# extract data for each board into a separate data frame
hsc_data  = df[df['Board'] == 'HSC']
isc_data  = df[df['Board'] == 'ISC']
cbse_data = df[df['Board'] == 'CBSE']
labels = ['HSC', 'ISC', 'CBSE']
plt.subplots_adjust(hspace=0.2)
# histogram for Matriculation
plt.subplot(2, 2, 1)
data = [hsc_data['Matriculation'], isc_data['Matriculation'], cbse_data['Matriculation']]
plt.hist(data, label=labels, alpha=0.6, ec='black', histtype='stepfilled', bins=50, range=(0,1200))
plt.legend(loc='upper right')
plt.title('Matriculation')
# histogram for Physics
plt.subplot(2, 2, 2)
data = [hsc_data['Physics'], isc_data['Physics'], cbse_data['Physics']]
plt.hist(data, label=labels, alpha=0.6, ec='black', histtype='stepfilled', bins=50, range=(0,200))
plt.legend(loc='upper right')
plt.title('Physics')
# histogram for Chemistry
plt.subplot(2, 2, 3)
data = [hsc_data['Chemistry'], isc_data['Chemistry'], cbse_data['Chemistry']]
plt.hist(data, label=labels, alpha=0.6, ec='black', histtype='stepfilled', bins=50, range=(0,200))
plt.legend(loc='upper right')
plt.title('Chemistry')
# histogram for Maths
plt.subplot(2, 2, 4)
data = [hsc_data['Maths'], isc_data['Maths'], cbse_data['Maths']]
plt.hist(data, label=labels, alpha=0.6, ec='black', histtype='stepfilled', bins=50, range=(0,200))
plt.legend(loc='upper right')
plt.title('Maths')
plt.show()

The histograms once again clearly confirms our idea.

Trimodal distribution (Entrance Rank)

Now, let’s see what can explain the trimodal distribution for Entrance Rank. In contrast to Matriculation, Physics, etc which were related to the students school examination results, the Entrance Rank is related to the admission of the student into the university. Out of our other variables, the other variables related to student’s admission into the university are Seat Type and Admn Yr.

So, our best guess would be, admission is easier / tougher into the college, based on how much fees the student is paying. Seat Type does have information about that, since there are GENERAL seats, FEE WAIVER seats, NRI (non-residential Indian) seats, etc.

Let’s do the same thing we did before — plot boxplots and histograms — but this time for Entrance Rank with Seat Type.

Here’s the boxplot of Entrance Rank grouped by SeatType.

plt.figure(figsize=(12, 8))
sns.boxplot(x='Seat Type', y='Entrance Rank', data=df)
plt.xticks(rotation=10, wrap=True) 
plt.show()

Let’s also plot the histogram:

# list of possible values for seat type
seat_types = df['Seat Type'].unique()
# extract data for each seat type
data = []
for seat_type in seat_types:
    data.append(df[df['Seat Type'] == seat_type]['Entrance Rank'])
# histogram for entrance rank
plt.hist(data, label=seat_types, histtype='barstacked', alpha=0.6, ec='black', bins=30)
# add legend and title
plt.legend(loc='upper right')
plt.title('Entrance Rank')
plt.show()

It is evident from the boxplot and histogram that multiple distributions exist in Entrance Rank. From the above plots, we can make the following observations:

  1. FEE WAIVER, FEE WAIVER (INCOME) and GENERAL together form one mode.
  2. MANAGEMENT SEAT together forms the second mode. Although OBC, OEC and ReferenceES have very few data points, they seem to behave similar to MANAGEMENT SEAT.
  3. NRI forms the third mode. Although SC has very few data points, it seems to behave similar to NRI.

Awesome! We noticed something very interest in our dataset, and we were able to use our data exploration and visualization skills to good use!

Numerical Variable Transformations

As part of data cleaning, we need to transform variables into a more usable form for data science and machine learning.

Let’s perform the transformations for numerical data first.

Log transformation

When doing data exploration for numerical variables, we saw that Entrance Rank and Family Income have very skewed distributions.

Let’s apply a log transformation to the these variables to reduce the skew.

df['Family Income'] = np.log(df['Family Income'])
df['Entrance Rank'] = np.log(df['Entrance Rank'])
print(df.head())

Notice the changed values for both variables. Let’s visualize the new distribution by plotting their histograms.

plt.subplots_adjust(hspace=1)
plt.subplot(1, 2, 1)
plt.hist(df['Entrance Rank'], ec='black', bins=20)
plt.title("Entrance Rank")
plt.subplot(1, 2, 2)
plt.hist(df['Family Income'], ec='black', bins=20)
plt.title('Family Income')

As we can see, the data now resembles a normal distribution quite closely. Awesome!

Percentage Transformation

We saw during our bimodal analysis that students from Board - CBSE and ISC have maximum mark of 500 for Matriculation while HSC board students have maximum marks of 1200.

Similarly, for Physics, Chemistry and Maths, we noticed that the maximum marks is 100 for CBSE and ISC and 200 for HSC.

Since the maximum marks differ by Board, we should rescale the values so that all the grades are percentage values (that is, scale it as if all examinations had a maximum possible score of 100).

# Convert Matriculation to percentage based on Board
df['Matriculation'] = np.where(df['Board'] == 'HSC', df['Matriculation']/12, df['Matriculation'])
df['Matriculation'] = np.where(df['Board'] == 'CBSE', df['Matriculation']/5, df['Matriculation'])
df['Matriculation'] = np.where(df['Board'] == 'ISC',  df['Matriculation']/5, df['Matriculation'])
# Convert Physics, Chemistry and Maths of HSC to percentage
# Marks of other boards are already in percentage
df['Physics']   = np.where(df['Board'] == 'HSC', df['Physics']/2, df['Physics'])
df['Chemistry'] = np.where(df['Board'] == 'HSC', df['Chemistry']/2, df['Chemistry'])
df['Maths']     = np.where(df['Board'] == 'HSC', df['Maths']/2, df['Maths'])
print(df.head())

To see a clearer picture, we can redo the histograms we had plotted initially for all the numerical variables.

plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
plt.hist(df['Entrance Rank'], ec='black', bins=20)
plt.title("Entrance Rank")
plt.subplot(3, 2, 2)
plt.hist(df['Matriculation'], ec='black', bins=20)
plt.title('Matriculation')
plt.subplot(3, 2, 3)
plt.hist(df['Family Income'], ec='black', bins=20)
plt.title('Family Income')
plt.subplot(3, 2, 4)
plt.hist(df['Physics'], ec='black', bins=20)
plt.title('Physics')
plt.subplot(3, 2, 5)
plt.hist(df['Chemistry'], ec='black', bins=20)
plt.title('Chemistry')
plt.subplot(3, 2, 6)
plt.hist(df['Maths'], ec='black', bins=20)
plt.title('Maths') 
plt.show()

Data Re-scaling: Standardization

Many machine learning algorithms require numerical data to be re-scaled before fitting the model (especially important for distance based algorithms). Re-scaling also ensures variables with high magnitude don’t dominate the data analysis.

For rescaling, we will perform standardization — that is, re-scale each variable so that its new mean becomes 0 (μ = 0) and standard deviation becomes 1 (σ = 1).

We’ll use sklearn library’s preprocessing.scale() function to perform standardization:

# numeric variable names stored as a list
numerical_labels = ['Entrance Rank', 'Matriculation', 'Family Income', 'Physics', 'Chemistry', 'Maths']
# standardize data using preprocessing.scale()
from sklearn import preprocessing
df.loc[:,numerical_labels] = preprocessing.scale(df.loc[:,numerical_labels])
# check first five observations
print(df.head())
# check last mean is 0 and std 1
print(df.describe())

Since floating-point operations are only precise to a certain number of digits, the new means won’t be exactly 0, but they’ll be really small (about 10-15).

We can also plot the histograms again, to see the effect of standardization.

plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
plt.hist(df['Entrance Rank'], ec='black', bins=20)
plt.title("Entrance Rank")
plt.subplot(3, 2, 2)
plt.hist(df['Matriculation'], ec='black', bins=20)
plt.title('Matriculation')
plt.subplot(3, 2, 3)
plt.hist(df['Family Income'], ec='black', bins=20)
plt.title('Family Income')
plt.subplot(3, 2, 4)
plt.hist(df['Physics'], ec='black', bins=20)
plt.title('Physics')
plt.subplot(3, 2, 5)
plt.hist(df['Chemistry'], ec='black', bins=20)
plt.title('Chemistry')
plt.subplot(3, 2, 6)
plt.hist(df['Maths'], ec='black', bins=20)
plt.title('Maths') 
plt.show()

We have implemented standardization for all our numerical variables.

Missing Values

Next, we’ll check for missing values. We’ll also need to decide how we would like to deal with missing values. Common methods for dealing with missing values are:

  • Deleting the observations with missing values.
  • Imputing the missing values by median/mode, kNN etc.

Checking for missing values

Let’s check whether our data has missing values:

print('Data set shape: ', df.shape)
# missing values
print(df.isnull().sum())

Yes, seven variables have missing values. The number of missing values are listed to the right of the corresponding variable in the above output. We have missing values for 6 numerical variables and 1 categorical variable.

Dealing with missing values

The simplest approach is to delete the observations which have missing values. However, in doing so, we don’t want to end up deleting a significant fraction of our data, since that would be harmful for any further analysis or machine learning.

Let’s see how much our dataset would reduce by if we deleted observations which have missing values:

print('Data set shape before dropping missing values:', df.shape)
print('Data set shape after dropping missing values:', df.dropna().shape)

This would result in dropping 40 observations, which is about 8% of our datasize. That isn’t too bad, but it isn’t ideal either.

Let’s impute the missing values instead.

TODO: Imputing missing values

Imputation by Median / Mode

We can impute the missing values of numerical variables by its median. For categorical variables, missing values are imputed by mode. See the codes below.

# fill missing values with median for numerical values
df.fillna(df[numerical_labels].median())
# fill missing values with mode for categorical values
df.fillna(df[categorical_labels].mode().iloc[0])

Imputation by Median / Mode of Subgroup

We can impute with median/mode of a subgroup with respect to few selected variables. If the data values of a variable are known to depend upon some other variables in the dataset, this method of imputation is ideal. The code for imputation of numerical variables by median grouped by Seat Type and mode of categorical variables grouped by Seat Type is :

df_temp = df.copy()
for var in numerical_labels:
    missing = df_temp[var].isnull()
    df_temp.loc[missing, [var]] = df_temp.groupby('Seat Type')[var].transform('median')
for var in categorical_labels:
    missing = df_temp[var].isnull()
    df_temp.loc[missing, [var]] = df_temp.groupby('Seat Type')[var].transform(lambda S: S.mode()[0])
df_temp.isnull().sum()

kNN Imputation

The missing values of a variable may be imputed using kNN. kNN finds the corresponding values of its nearest k neighbors by some distance function and imputes those values. We will discuss more about kNN imputation after one hot encoding categorical variables, which is a recommended step before kNN.

Recommended treatment for missing values

Our dataset too suffer from missing values. The missing value exist in 6 numerical variables namely, Entrance Rank, Family Income, Matriculation, Physics, Chemistry and Maths. It is known that these variables are highly correlated to the variable Seat Type. By plotting boxplot grouped by Seat Type helps verify our assumption.

plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
sns.boxplot(x='Seat Type', y='Entrance Rank',data=df)
plt.subplot(3, 2, 2)
sns.boxplot(x='Seat Type', y='Matriculation',data=df)
plt.subplot(3, 2, 3)
sns.boxplot(x='Seat Type', y='Family Income',data=df)
plt.subplot(3, 2, 4)
sns.boxplot(x='Seat Type', y='Physics',data=df)
plt.subplot(3, 2, 5)
sns.boxplot(x='Seat Type', y='Chemistry',data=df)
plt.subplot(3, 2, 6)
sns.boxplot(x='Seat Type', y='Maths',data=df)

Therefore, for all numerical variables, missing values are imputed by median grouped by Seat Type.

for var in numerical_labels:
    missing = df[var].isnull()
    df.loc[missing, [var]] = df.groupby('Seat Type')[var].transform('median')

The categorical variable Proposed Residence also contain missing values, which can be imputed by mode grouped by Seat Type.

#categoric variable Proposed Residence imputed by mode of subgroup Seat Type
missing = df['Proposed Residence'].isnull()
df.loc[missing, ['Proposed Residence']] = df.groupby('Seat Type')['Proposed Residence'].transform(lambda S: S.mode()[0])

Final check

Yes, now we have completed the missing value analysis. Let’s check if we left anything out!

df.isnull().sum()

We can see that no missing values exists in our data set. All missing values has been addressed.

Outliers

An outlier is an observation that appears far away and diverges from the overall pattern of the data set. Outliers need close attentionotherwise they can result in wildly wrong estimations.

Let’s see this with a simple example. Suppose we have a sample data set of 10 numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

The mean for the above data is 5.5.

Now let’s replace the number 10 with 1000. So the new data set will be: 1, 2, 3, 4, 5, 6, 7, 8, 9, 1000.

The mean is 104.5.

As you can see, having a single outlier had a significant effect on the mean value, which results in wrong estimation of the distribution. Outliers drastically influence the results of data analysis and statistical modeling, which has negative impact on the prediction.

Outliers are of two types.

  • Artificial Outliers - Outliers that arise due to data entry, experimental or other errors.
  • Natural Outliers - Outliers which are not because of any error, but which occur naturally and throw some light into the internal processes and hence require close attention. These observations have to be segmented separately and analyzed.

Box plot is a handy tool for detection of outliers. See more about box plots here.

Let’s visualize outliers by drawing boxplot

plt.subplots_adjust(hspace=1)
plt.subplot(3, 2, 1)
sns.boxplot(df['Entrance Rank'])
plt.subplot(3, 2, 2)
sns.boxplot(df['Matriculation'])
plt.subplot(3, 2, 3)
sns.boxplot(df['Family Income'])
plt.subplot(3, 2, 4)
sns.boxplot(df['Physics'])
plt.subplot(3, 2, 5)
sns.boxplot(df['Chemistry'])
plt.subplot(3, 2, 6)
sns.boxplot(df['Maths'])

We see that outliers exist for Matriculation, Physics, Chemistry and Maths. By the code below, we find the number of outliers in each of these variables.

for var in numerical_labels:
    print(var)
    q1, q3 = np.percentile(df.loc[:,var], [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    above_upper = df[var] > upper_bound
    below_lower = df[var] < lower_bound
    outliers = above_upper | below_lower
    print(outliers.sum()) # number of outliers

Therefore, the number of outliers in Matriculation is 21, Physics has 6, Chemistry has 1 and Maths has 14 outliers.

The common techniques to deal with outliers are

Deleting Observations

We delete observations having outlier values if it is due to data entry or some other errors (artificial outliers). Again it depends on the size of data set. For small sized data set’s, this method is not recommended. However, the code for the same is:

print('Shape before dropping outliers:', df.shape)
for var in numerical_labels:    
    q1, q3 = np.percentile(df.loc[:,var], [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    above_upper = df[var] > upper_bound
    below_lower = df[var] < lower_bound
    outliers = outliers | above_upper | below_lower
print(outliers.sum()) # number of outliers    
#drop outliers
df.drop(df[outliers==True].index)
print('Shape after dropping outliers:', df.drop(df[outliers==True].index).shape)

There are 33 rows having outliers. Whe they are dropped, the number of observations reduced to 448. Alternatively, we can impute the outlier values.

Imputing

As in the case of missing values, outliers can also be imputed by median or kNN. Imputation is normally done for

  • artificial outliers, or
  • outliers which arise due to some random cause which have negligible chance of repeatability and doesn’t affect our analysis.

We can impute the numeric variables by median. The following code shows imputing outliers by median grouped by Seat Type.

df_temp = df.copy()
for var in numerical_labels:
    print(var)
    q1, q3 = np.percentile(df_temp.loc[:,var], [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    above_upper = df_temp[var] > upper_bound
    below_lower = df_temp[var] < lower_bound
    outliers = above_upper | below_lower
    print(outliers.sum()) # number of outliers
    print(df_temp.loc[outliers, [var]]) # before replacing
    df_temp.loc[outliers, [var]] = df_temp.groupby('Seat Type')[var].transform('median')
    print(df_temp.loc[outliers, [var]]) # after replacing

Segment and Analyze Separately

If the outliers present are natural outliers and are in significant number, one method is to treat them separately while building models. For example, we can split the data into two different groups and build individual models for each group.

Transforming and binning values

Another method to eliminate outliers is by transforming variables. Natural log of a value reduces the variation caused by extreme values.

Binning is also a form of variable transformation. Here we assign the values in different bins determined by the class interval. These variables can then be treated as categorical variables.

No processing

Natural outliers which does not deviate much form the main stream data can left without any processing. We might often use this

Recommended method for Outliers

For our dataset, the outliers of Matriculation, Physics, Chemistry and Maths are natural outliers, and for our analysis its recommended not to do any processing.

Categorical Variable Transformation

One hot encoding

Many machine learning algorithms cannot work with categorical data directly. They need to be converted to numerical representation before processing.

Each possible label for the categorical variable is converted into dummy/indicator variables. These dummy variables (of the categorical variable) are assigned values 0 except for one of them is given value 1, corresponding to the category it belongs.

Take an example of encoding the categorical variable SeatType. It is having 9 different categories or labels. So nine dummy variable dataframe is created, each having one category label as its name. Every row will be having only one 1, while all others be 0.

Then the original categorical variable in the data set is replaced with new encoded dummy/indicator variables. While replacing, one dummy variable is omitted intentionally to avoid collinearity.


#one hot encoding - all categorical variables at once
# creating dummy variables to convert categorical into numeric values
dummies = pd.get_dummies(df[categorical_labels], prefix=categorical_labels, drop_first=True)
df.drop(categorical_labels, axis=1, inplace = True)
df = pd.concat([df, dummies], axis =1)

Observe the one hot encoded categorical variables. For each category except for one, a new variable is created.

KNN Imputation of Missing Values

kNN finds the corresponding values of its nearest k neighbors by some distance function and imputes those values. For kNN all variables should be of numeric nature. As we have one hot encoded categoric variables and converted them into numeric, we can apply kNN imputation on missing data. The code for imputing all missing values at once based on kNN is:

from fancyimpute import KNN
print('Missing values:\n', df_missing.isnull().sum())
#one hot encoding
dummies = pd.get_dummies(df_missing[categorical_labels], prefix= categorical_labels, drop_first=True)
df_missing.drop(categorical_labels, axis=1, inplace = True)
df_missing = pd.concat([df_missing,dummies], axis =1 )
# fit
df_missing.loc[:, :] = KNN(k=5).fit_transform(df_missing)
print('Missing values:\n', df_missing.isnull().sum())

Summary

  • Data exploration and cleaning is a critical step where we prepare the data ready for further analysis, to be used by machine learning algorithms.
  • Shape of the data set and variable types are explored.
  • Numerical and Categorical variables are analysed to see the distribution of the data points.
  • Numerical variables are re scaled as per the requirement of the algorithms to be employed
  • Missing values are removed or imputed.
  • Outliers are examined and transformed if required.
  • Categorical variables are one hot encoded.

© 2016-2022. All rights reserved.