Note: This tutorial is currently under construction. The final version is expected to be ready on or before June 15th 2019.
Duplicate Observations
The data may contain duplicate observations. We have to analyze the data set to take a decision on whether to drop duplicates or not.
In our data set, there is no chance of two observations (students) be identical on all variables such as Entrance Rank, Admn Yr, etc. Duplicates occur due to some data entry error. These type of duplicates must be dropped.
Alternatively, some data sets may contain duplicate values which represents the pattern of the process under measure. For example, the iris dataset in sklearn has 4 variables which measures the length and width of petals and sepals 150 iris flowers. This data set contains duplicates, which need not be removed.
To remove the duplicates in our data set, we use df.drop_duplicates(). The index of the dataframe has to reset after dropping duplicates, done by reset_index().
print('No of duplicates present:', df.duplicated().sum())df = df.drop_duplicates()# reset index.df = df.reset_index(drop=True)print('Shape:', df.shape)
54 duplicates were present in the dataset and are removed. Now we are left with 481 observations.