You have already learnt about basics of the Pandas DataFrame — a 2-dimensional data-structure supported by Pandas which looks and behaves like a table.
In this tutorial and the next one, we will learn how to select various subsets of the DataFrame. Pandas library provides us with a number a flexible options to do this.
As in the first Pandas tutorial, let’s start by importing the libraries and loading the dataset.
import pandas as pd import numpy as np # load the dataset load_commonlounge_dataset('adult_income_data_small') adult_income = pd.read_csv("/tmp/adult_income_data_small.csv") # just to check data loaded print(adult_income.head())
Data in pandas is stored along rows and columns.
Each of the columns can be identified by its column label and each row can be identified by the row index.
These column labels and row indexes allow us to access subsets of the DataFrame.
We can select a column based on the column labels with the
For example, say we want the column
income from the
adult_income DataFrame, we will select it with the syntax:
selected_col = adult_income["income"]
This will return the selected column.
Note: When we select a column of a Pandas DataFrame this way, the result is the Pandas data-structure called Series. A pandas Series is a 1-dimensional data-structure.
Let us try:
selected_col = adult_income["income"] # display first few elements print(selected_col.head())
In the output above, there’s no column name. That’s because we’re calling the
head() function on a Series. If the result was a dataframe, we would see a column name in the output.
We can also select multiple columns using
 by passing a list of columns.
Let us select the columns
First we will create a list with the column labels, and then pass this list within
cols = ["education", "sex", "hours_per_week", "income"] adult_income[cols]
We could also create the list within
 directly without declaring it beforehand.
adult_income[["education", "sex", "hours_per_week", "income"]]
The result is a DataFrame with the selected columns.
Let us try it out:
# form a list of column labels cols = ["education", "sex", "hours_per_week", "income"] # form the subset subset = adult_income[cols] # display first few rows print(subset.head())
Note that if the list has a single column name, the result is a DataFrame with one-column, which is different from a Series. Let’s try it:
cols = ["education"] subset = adult_income[cols] print(subset.head())
Lastly, we can also extract a column directly as an attribute using the
For example, to extract the
education column from the
adult_income DataFrame we would write:
This is a more convenient syntax for accessing a single column than the square brackets, but there are some limitations of using this syntax:
- The column label must be a valid Python identifier. If the column name has a space, starts with a digit, uses special symbols, etc, then you can’t use this syntax to access the column.
- The column label shouldn’t have the same name as any existing method or attribute.
Okay, let’s use this and see for ourselves once:
selected_col = adult_income.education # display first few elements print(selected_col.head())
We can use
 for selecting a subset of rows as well. The syntax is:
Here, we are selecting the rows from position
This works in a way very similar to Python list slicing. So:
- The row at position start is included in the subset, but the row at position stop is not.
- If you don’t provide a
startvalue, the subset starts with the first row (0th row).
- If you don’t provide a
stopvalue, the subset ends with the last row.
- If we just do
DataFrame[:](i.e. omit both
stop), then we get the entire DataFrame.
- We can even use negative indexes as
stop. -1 is considered to be the last row, -2 is the second last row, and so on.
Let us look at a few examples using the
adult_income DataFrame to clarify the idea:
# subset of DataFrame from row 5 to row 10 print(adult_income[5:11])
# subset of DataFrame from row 1 to row 5 print(adult_income[:6])
# the last row print(adult_income[-1:])
adult_incomedoes not work, because Pandas will look for a column with label 3. We must use slices to select rows with
To recap, we can use
 in three ways.
"column_name"— result is a single column as a Series
["col1", "col2"]— result is a DataFrame with the list of columns
start:stop— result is a DataFrame with rows from position start to stop-1
To specify a subset of rows and columns from a DataFrame based on position, we can use
iloc. Its syntax is as follows:
columns parameters take the integer positions of the rows and columns to select. For example, if we want the
education column we will refer to it as the 0th column for
columns parameters in
iloc accept three kinds of arguments:
- integer — position of the row / column we want
- list of integers — positions of rows / columns we want
- slice object (
start:stop) — range of rows / columns positions we want
Let us take some examples of each.
Select rows at positions 0, 1, 2 and 5, and all columns:
subset = adult_income.iloc[[0,1,2,5], :] print(subset)
Select all rows, and first 3 columns:
subset = adult_income.iloc[:, 0:3] print(subset.head())
When we pass in a single value as the argument for
columns, Pandas selects that row / column, but also reduces the dimensionality of the data-structure.
For example, selecting a single column returns a Series. Let’s select the column at position 2 and all the rows:
column = adult_income.iloc[:, 2] print(column.head())
Similarly, selecting a single row also returns a Series. Let’s select the row at position 4:
row = adult_income.iloc[4, :] print(row)
Finally, selecting a single row and a single column just gives the actual data:
value = adult_income.iloc[4, 2] print(value)
Awesome! Play around with these more to select different kinds of subsets!
Another way to select a subset of rows and columns from a DataFrame is using
loc. Its syntax is similar to
columns parameters accept the three argument types that we have discussed so far — a single value, a list of values, and a slice.
The main differences are that:
columnsparameter take the index labels, and not the positions.
- For rows, so far in the tutorial our row index labels have been the same as position, starting at 0 and going up to n-1. But this will not be true always.
- For columns, the index labels are the column names.
- For slices, both
stopare included in the subset.
Let’s see all of this in action with some examples.
Let’s start with the case where we select a slice of columns.
If we wanted all the columns between
hours_per_week, we will simply pass the slice object -
'edu_yrs':'hours_per_week' to the
Let’s select rows with index labels
15 (both inclusive), and columns from
hours_per_week (again, both inclusive).
subset = adult_income.loc[10:15, 'edu_yrs':'hours_per_week'] print(subset)
Select rows with index labels 2, 4, 6 and 8 and all columns.
subset = adult_income.loc[[2, 4, 6, 8], :] print(subset)
Now, let’s see an illustration of the difference between row index labels and row positions.
In our current DataFrame, the row index labels go from 0 to n-1, so there’s no difference. Let’s first create a new DataFrame which starts from row 100 of the current DataFrame:
df_100 = adult_income.iloc[100:, :] # display first few rows print(df_100.head())
Now the row index labels start from 100. Let’s see the difference between row index labels and row positions:
print('.loc[105:110, :] -- gives rows matching that *index*') print(df_100.loc[105:110, :]) print('') print('[105:110] -- (without loc) gives rows from *position* 105 to 110') print(df_100[105:110]) print('') print('iloc[105:110, :] -- also gives rows from *position* 105 to 110') print(df_100.iloc[105:110, :])
Hope that example made things crystal clear!
Passing single values for selecting rows / columns with
loc has the same effect as in
iloc — Pandas selects that row / column and reduces the dimensionality of the data-structure.
For example, selecting a single column returns a Series. Let’s select the
sex column from our DataFrame:
column = adult_income.loc[:, "sex"] print(column.head())
Similarly, selecting a single row also returns a Series. Let’s select the row with row index label
row = adult_income.loc[2, :] print(row)
Finally, selecting a single row and a single column just gives the actual data.
print('sex of person at row index 2') print(adult_income.loc[2, 'sex']) print('education of person at row index 4') print(adult_income.loc[4, 'education'])
When we perform two or more indexing operations in a single expression, it is known as chained indexing.
Earlier in the tutorial, we selected all the rows from 100th row onward with
adult_income.iloc[100:, :], and then from within that subset we selected the 5 rows from the 105th row onward using
This took place in two steps:
df_100 = adult_income.iloc[100:, :] print(df_100.iloc[105:110, :])
Let us combine these two operations into one line and see if we can get the same result.
subset = adult_income.iloc[100:, :].iloc[105:110, :] print(subset)
The results are same!
Let’s see another example of selecting a subset. This time we will select the rows
30 (by position) and columns
education. But, we will not use
iloc, only the
 which we learnt about at the beginning.
 only takes a single argument, we will use chain indexing to achieve this:
subset = adult_income[25:30][['education', 'income']] print(subset)
Note: Chained indexing is actually a two-step process. Writing things in one line is just convenience, but it’s not more computationally efficient for the program. So in the above example, first
[25:30]operation selects the rows and creates a temporary DataFrame. Then
[['education', 'income']]operation takes place, which selects the columns on the result returned from the first operation.
If we are not careful about using the indexing operations properly, it’s easy to make mistakes with chained indexing. Let us see an example:
print(adult_income.iloc[10, :].loc[:, 'hours_per_week'])
The above code gives an error because the first operation —
iloc[10, :] — selects a row as a Series, but the second operation still tries to select both rows and columns, while Series only has one set of labels.
With indexing, it is important to keep in mind the whether the result is a DataFrame, a Series or a single value.
As a final example, let’s use chain indexing to select a single element from the DataFrame:
print('selecting the hours_per_week value from row at position 10') print(adult_income.iloc[10, :].loc['hours_per_week'])
- Pandas provides many ways of getting a subset of data from a DataFrame. In this tutorial, we learnt about
- When using the
- We can provide a column name or list of column names —
DataFrame[list of columns labels]
- Or we can provide a slice which selects rows by position —
DataFrame.iloc[rows,columns]— selects a subset of rows and columns by position
DataFrame.loc[rows,columns]— selects a subset of rows and columns by label
ilocaccept three types of parameters — single value, list of values, and slices.
- When we perform two or more indexing operations in a single expression, it is known as chained indexing.
- When selecting columns, it’s more natural to select by name. When selecting rows, it’s more natural to select by position.