Pandas is an open-source Python library which provides a 2-dimensional data structure called a DataFrame.
A DataFrame is the Python equivalent of a spreadsheet or SQL table, and is the most commonly used Pandas object. It provides a lot of useful functionality to work with tabular data, which is something we do regularly in data science and machine learning.
In this tutorial, we will be learning about the pandas DataFrame and explore some of its useful methods.
We will need the Pandas and NumPy libraries for this tutorial, so let us load them at the beginning.
import pandas as pd import numpy as np
import pandas as pdis the standard convention for importing pandas. It is followed universinally within the machine learning / data science communities.
The next step is to load our dataset.
Before we start loading the dataset using Pandas, we need to download the data on CommonLounge’s system. For this, we will use the
The above function downloads a file to the location
/tmp/adult_income_data_small.csv. Here are the first few lines from the file:
education,edu_yrs,sex,capital_gain,hours_per_week,income Some-college,10,Male,0,40,<=50K Some-college,10,Male,0,45,<=50K HS-grad,9,Male,0,50,>50K HS-grad,9,Female,0,50,<=50K Doctorate,16,Male,0,40,>50K
The above format is called comma-separated values, or CSV. It’s a very common way of storing tabular data in plain-text files. Often, the first line is a header line containing the column names. The rest of the lines have one observation per line.
Now that the dataset is downloaded, let’s use the Pandas
read_csv() function to read the dataset.
adult_income = pd.read_csv("/tmp/adult_income_data_small.csv")
read_csv() function returns a DataFrame. A DataFrame is a 2-dimensional data structure — analogous to a spreadsheet or SQL table. It is generally the most commonly used Pandas object.
Once the data is loaded into a pandas DataFrame, the first step is to see how large the dataset is. For this, we can check the
shape of the DataFrame, which tells us the number of rows (observations) and columns (variables).
print('Shape of dataset:', adult_income.shape)
Our data set has 326 rows and 6 columns.
As you may already know, you can run this code right here. To run the program, just click
Try It Now!and then the
Runbutton. You can also edit the code in any way you want.
To get the column names, use the
To get a glimpse of the data itself, we can see the first five rows using the
Pandas takes care of printing the table nicely for us! Sweet!
Note: On the very left, you can see the numbers 0, 1, 2, and so on. These are the row indexes. The same way every column has a name, every row has an index. By default, the row indexes go from 0 to n-1, where n is the total number of rows in the DataFrame.
We can also pass in an argument to
head() specifying how many lines we want to see. Let’s try printing the first 8 observations:
Remember, you can run the code yourself. Click on
Try it now!and edit the code to print the first 50 rows of the dataset. Then, press the
Runbutton and take a detailed look at the dataset.
Before we move further, it might be helpful to know a little more about our dataset. This data has been taken from the UCI Machine Learning Repository. (The dataset has been modified for the purpose of this tutorial).
Citation: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
The features/variables in the dataset are described below:
education: highest educational qualification (categorical)
edu_yrs: total number of years spent in educational institutes (numerical)
capital_gain: profits made through capital markets (numerical)
hours_per_week: working hours per week (numerical)
income: the target variable - either
Let’s see another couple of helpful DataFrame methods —
We can use the
.info() method to look at the datatype of every variable, along with number of observations present, like so:
As you can see, the numerical variables such as
capital_gain, etc have type
int64 (64-bit integers), and the strings (categorical variables) such as
sex, etc have type
Often, datasets have missing values, usually denoted by
null. Hence, pandas is being helpful and letting us know that for all variables in our dataset, we have 326
non-null values, i.e. no missing values. 🙂
Finally, there’s the
.describe() method, which by default summarizes the numerical variables. For each numerical variable, it tells us the minimum value, maximum value, mean, standard deviation, etc. Even the 25th and 75th percentile values are included.
.describe() method also accepts an argument which allows us to specify which datatype we want a summary of. For categorical variables, the datatype is
Hence, to get a summary, of the categorical variables, we will pass the parameter
include with argument
'object'. When summarizing categorical variables, the
.describe() method includes information about the number of unique values for the variable, the most common value for the variable, etc.
Let’s see it in action:
Awesome! Let’s see some even more interesting things we could do with our dataset.
Sometimes in order to do data exploration or feature engineering, we might need the DataFrame to be sorted in a particular order.
For example, suppose we want the adult-income dataset to be sorted in increasing order of
We can use the
sort_values() DataFrame method to sort our dataset. The syntax is as follows:
Sorted_DataFrame = DataFrame.sort_values(by, axis=0, ascending=True)
This returns a sorted copy of the DataFrame.
by: the column name or list of column names to sort the DataFrame by
axis: “0” denotes sorting by values in a column. (“1” means sorting by row)
ascending: if this is “True”, sort in ascending order, otherwise descending order
Let us sort the
adult_income data and see for ourselves!
We will sort according to the
df_sorted = adult_income.sort_values(by="edu_yrs", axis=0, ascending=False)
Now, let’s look at the sorted dataframe:
# display the top 5 values (maximum "edu_yrs") print(df_sorted.head())
Note: On the very left, you can see the row indexes 75, 4, 211, and so on. When creating a DataFrame from another DataFrame, Pandas operations maintain the index values. Recall that the
adult_incomeDataFrame has row indexes going from 0 to 325, in that order. So now, the sorted DataFrame still has the row indexes 0 to 325, but the order of the indexes has changed because of sorting.
# display the bottom 5 values (minimum "edu_yrs") print(df_sorted.tail())
As you can see, our data is now sorted.
However, you may notice that even though the data is sorted, multiple rows with the same value for
edu_yrs are not arranged in any particular order.
What if we wanted all the rows which have the same value of
edu_yrs to be sorted with respect to the
We just need to pass the name of the second column we want to sort by, along with the first column as a list to the
df_sorted = adult_income.sort_values(by=["edu_yrs","hours_per_week"], axis=0, ascending=False) # display the top 12 values after sorting print(df_sorted.head(12))
We can even modify this further. It is not necessary that we will want the DataFrame to be sorted in descending order by both the columns.
We might want to get it sorted in descending order of
edu_yrs values, while sorting in ascending order of
hours_per_week values within that.
To do that, we pass a list to the
ascending parameter as well, with the order corresponding to that of the list passed to the
For example in our current problem we will pass the following argument
ascending = [False, True]
df_sorted = adult_income.sort_values(by=["edu_yrs","hours_per_week"], axis=0, ascending=[False, True]) # display the top 12 values after sorting print(df_sorted.head(12))
Sometimes we may not be interested in sorting the DataFrame, but just need to look at the top 10 or top 20 instances according to some column/feature.
- Who are the top 10 earners of
- According to
hours_per_week, which 5 people have the lowest working hours?
Instead of sorting and then viewing the top 5 or 10 results, we can use a function to display the largest or smallest n-values according to any column. We will be using the following functions for this:
1. Who are the top 10 earners of “capital-gain”?
Let us answer the first question with
nlargest(). The syntax is as follows:
n— number of rows to return
column_label— name of the column for which we want the largest values
This function returns a DataFrame with n-rows.
df_largest = adult_income.nlargest(10, "capital_gain") # display results print(df_largest)
2. According to “hours-per-week”, which 5 people have the lowest working hours?
We will answer the second question with
nsmallest(n, column_label). The parameters are exactly the same:
# Who are the top 10 earners of "capital-gain". df_smallest = adult_income.nsmallest(5, "hours_per_week") # display results print(df_smallest)
- The pandas DataFrame is a 2D table. It is the Python equivalent of a spreadsheet or SQL table.
read_csv()function — reads a CSV file and returns a DataFrame
tail()methods — used to see the first and last few rows
info()method — prints the datatypes of every variable
describe()method — summarizes each variable in the DataFrame. For numerical variables, displays the mean, median, etc. For categorical variables, displays the mode, number of unique values, etc.
sort_values()method — used to sort a DataFrame
byparameter — for specifying the column to sort by
ascendingparameter — for ascending or descending order
nsmallest(n, column)methods — returns the n rows with the largest / smallest values for the specified column