May 07, 2019

# Introduction

You have already learnt about basics of the Pandas DataFrame — a 2-dimensional data-structure supported by Pandas which looks and behaves like a table.

In this tutorial and the next one, we will learn how to select various subsets of the DataFrame. Pandas library provides us with a number a flexible options to do this.

# Set-up

As in the first Pandas tutorial, let’s start by importing the libraries and loading the dataset.

import pandas as pd
import numpy as np
# just to check data loaded
print(adult_income.head())

# Indexing and slicing

Data in pandas is stored along rows and columns.

Each of the columns can be identified by its column label and each row can be identified by the row index.

These column labels and row indexes allow us to access subsets of the DataFrame.

## Selecting single column using []

We can select a column based on the column labels with the [] operator.

For example, say we want the column income from the adult_income DataFrame, we will select it with the syntax:

selected_col = adult_income["income"]

This will return the selected column.

Note: When we select a column of a Pandas DataFrame this way, the result is the Pandas data-structure called Series. A pandas Series is a 1-dimensional data-structure.

Let us try:

selected_col = adult_income["income"]
# display first few elements
print(selected_col.head())

In the output above, there’s no column name. That’s because we’re calling the head() function on a Series. If the result was a dataframe, we would see a column name in the output.

## Selecting multiple columns using []

We can also select multiple columns using [] by passing a list of columns.

Let us select the columns education, sex, hours_per_week and income.

First we will create a list with the column labels, and then pass this list within []:

cols = ["education", "sex", "hours_per_week", "income"]
adult_income[cols]

We could also create the list within [] directly without declaring it beforehand.

adult_income[["education", "sex", "hours_per_week", "income"]]

The result is a DataFrame with the selected columns.

Let us try it out:

# form a list of column labels
cols = ["education", "sex", "hours_per_week", "income"]
# form the subset
# display first few rows
print(subset.head())

Note that if the list has a single column name, the result is a DataFrame with one-column, which is different from a Series. Let’s try it:

cols = ["education"]
print(subset.head())

## Column name as attribute of the DataFrame

Lastly, we can also extract a column directly as an attribute using the . operator.

For example, to extract the education column from the adult_income DataFrame we would write:

adult_income.education

This is a more convenient syntax for accessing a single column than the square brackets, but there are some limitations of using this syntax:

1. The column label must be a valid Python identifier. If the column name has a space, starts with a digit, uses special symbols, etc, then you can’t use this syntax to access the column.
2. The column label shouldn’t have the same name as any existing method or attribute.

Okay, let’s use this and see for ourselves once:

selected_col = adult_income.education
# display first few elements
print(selected_col.head())

## Selecting subset of rows using []

We can use [] for selecting a subset of rows as well. The syntax is:

DataFrame[start:stop]

Here, we are selecting the rows from position start to stop.

This works in a way very similar to Python list slicing. So:

1. The row at position start is included in the subset, but the row at position stop is not.
2. If you don’t provide a start value, the subset starts with the first row (0th row).
3. If you don’t provide a stop value, the subset ends with the last row.
4. If we just do DataFrame[:] (i.e. omit both start and stop), then we get the entire DataFrame.
5. We can even use negative indexes as start or stop. -1 is considered to be the last row, -2 is the second last row, and so on.

Let us look at a few examples using the adult_income DataFrame to clarify the idea:

# subset of DataFrame from row 5 to row 10
print(adult_income[5:11])
# subset of DataFrame from row 1 to row 5
print(adult_income[:6])
# the last row
print(adult_income[-1:])

Note: adult_income[3] does not work, because Pandas will look for a column with label 3. We must use slices to select rows with [].

To recap, we can use [] in three ways.

• "column_name" — result is a single column as a Series
• ["col1", "col2"] — result is a DataFrame with the list of columns
• start:stop — result is a DataFrame with rows from position start to stop-1

# Indexing and slicing with iloc

To specify a subset of rows and columns from a DataFrame based on position, we can use iloc. Its syntax is as follows:

DataFrame.iloc[rows, columns]

The rows and columns parameters take the integer positions of the rows and columns to select. For example, if we want the education column we will refer to it as the 0th column for iloc.

The rows and columns parameters in iloc accept three kinds of arguments:

• integer — position of the row / column we want
• list of integers — positions of rows / columns we want
• slice object (start:stop) — range of rows / columns positions we want

Let us take some examples of each.

## Selecting by lists / slices with iloc

Select rows at positions 0, 1, 2 and 5, and all columns:

subset = adult_income.iloc[[0,1,2,5], :]
print(subset)

Select all rows, and first 3 columns:

subset = adult_income.iloc[:, 0:3]
print(subset.head())

## Selecting by single value with iloc

When we pass in a single value as the argument for rows or columns, Pandas selects that row / column, but also reduces the dimensionality of the data-structure.

For example, selecting a single column returns a Series. Let’s select the column at position 2 and all the rows:

column = adult_income.iloc[:, 2]
print(column.head())

Similarly, selecting a single row also returns a Series. Let’s select the row at position 4:

row = adult_income.iloc[4, :]
print(row)

Finally, selecting a single row and a single column just gives the actual data:

value = adult_income.iloc[4, 2]
print(value)

Awesome! Play around with these more to select different kinds of subsets!

# Indexing and slicing with loc

Another way to select a subset of rows and columns from a DataFrame is using loc. Its syntax is similar to iloc:

DataFrame.loc[rows, columns]

Both rows and columns parameters accept the three argument types that we have discussed so far — a single value, a list of values, and a slice.

The main differences are that:

• The rows and columns parameter take the index labels, and not the positions.
• For rows, so far in the tutorial our row index labels have been the same as position, starting at 0 and going up to n-1. But this will not be true always.
• For columns, the index labels are the column names.
• For slices, both start and stop are included in the subset.

Let’s see all of this in action with some examples.

Let’s start with the case where we select a slice of columns.

If we wanted all the columns between edu_yrs and hours_per_week, we will simply pass the slice object - 'edu_yrs':'hours_per_week' to the columns parameter.

Let’s select rows with index labels 10 to 15 (both inclusive), and columns from edu_yrs to hours_per_week (again, both inclusive).

subset = adult_income.loc[10:15, 'edu_yrs':'hours_per_week']
print(subset)

Select rows with index labels 2, 4, 6 and 8 and all columns.

subset = adult_income.loc[[2, 4, 6, 8], :]
print(subset)

Now, let’s see an illustration of the difference between row index labels and row positions.

In our current DataFrame, the row index labels go from 0 to n-1, so there’s no difference. Let’s first create a new DataFrame which starts from row 100 of the current DataFrame:

df_100 = adult_income.iloc[100:, :]
# display first few rows
print(df_100.head())

Now the row index labels start from 100. Let’s see the difference between row index labels and row positions:

print('.loc[105:110, :] -- gives rows matching that *index*')
print(df_100.loc[105:110, :])
print('')
print('[105:110] -- (without loc) gives rows from *position* 105 to 110')
print(df_100[105:110])
print('')
print('iloc[105:110, :] -- also gives rows from *position* 105 to 110')
print(df_100.iloc[105:110, :])

Hope that example made things crystal clear!

## Single value with loc

Passing single values for selecting rows / columns with loc has the same effect as in iloc — Pandas selects that row / column and reduces the dimensionality of the data-structure.

For example, selecting a single column returns a Series. Let’s select the sex column from our DataFrame:

column = adult_income.loc[:, "sex"]
print(column.head())

Similarly, selecting a single row also returns a Series. Let’s select the row with row index label 2:

row = adult_income.loc[2, :]
print(row)

Finally, selecting a single row and a single column just gives the actual data.

print('sex of person at row index 2')
print('education of person at row index 4')
print(adult_income.loc[4, 'education'])

# Chained Indexing

When we perform two or more indexing operations in a single expression, it is known as chained indexing.

Earlier in the tutorial, we selected all the rows from 100th row onward with adult_income.iloc[100:, :], and then from within that subset we selected the 5 rows from the 105th row onward using iloc[105:110,:].

This took place in two steps:

df_100 = adult_income.iloc[100:, :]
print(df_100.iloc[105:110, :])

Let us combine these two operations into one line and see if we can get the same result.

subset = adult_income.iloc[100:, :].iloc[105:110, :]
print(subset)

The results are same!

Let’s see another example of selecting a subset. This time we will select the rows 25 to 30 (by position) and columns income and education. But, we will not use loc or iloc, only the [] which we learnt about at the beginning.

Since [] only takes a single argument, we will use chain indexing to achieve this:

subset = adult_income[25:30][['education', 'income']]
print(subset)

Note: Chained indexing is actually a two-step process. Writing things in one line is just convenience, but it’s not more computationally efficient for the program. So in the above example, first [25:30] operation selects the rows and creates a temporary DataFrame. Then [['education', 'income']] operation takes place, which selects the columns on the result returned from the first operation.

If we are not careful about using the indexing operations properly, it’s easy to make mistakes with chained indexing. Let us see an example:

print(adult_income.iloc[10, :].loc[:, 'hours_per_week'])

The above code gives an error because the first operation — iloc[10, :] — selects a row as a Series, but the second operation still tries to select both rows and columns, while Series only has one set of labels.

With indexing, it is important to keep in mind the whether the result is a DataFrame, a Series or a single value.

As a final example, let’s use chain indexing to select a single element from the DataFrame:

print('selecting the hours_per_week value from row at position 10')
print(adult_income.iloc[10, :].loc['hours_per_week'])

# Summary

• Pandas provides many ways of getting a subset of data from a DataFrame. In this tutorial, we learnt about [], .iloc[,] and .loc[,].
• When using the [] operation
• We can provide a column name or list of column names — DataFrame[column_name] and DataFrame[list of columns labels]
• Or we can provide a slice which selects rows by position — DataFrame[start:stop]
• DataFrame.iloc[rows,columns] — selects a subset of rows and columns by position
• DataFrame.loc[rows,columns] — selects a subset of rows and columns by label
• Both loc and iloc accept three types of parameters — single value, list of values, and slices.
• When we perform two or more indexing operations in a single expression, it is known as chained indexing.
• When selecting columns, it’s more natural to select by name. When selecting rows, it’s more natural to select by position.