You have already learnt about basics of the Pandas DataFrame, and how to select various subsets of the DataFrame using indexing and slicing.
In this tutorial we will learn how to select a subset of DataFrame which matches some criteria. For example, we might want to select the rows of a DataFrame based on the values in each row (say age >= 30). Or, we might even want to select a subset of the DataFrame based on the datatypes.
Let’s get started!
As in the previous tutorial, let’s start by importing the libraries and loading the dataset.
import pandas as pd import numpy as np # load the dataset load_commonlounge_dataset('adult_income_data_small') adult_income = pd.read_csv("/tmp/adult_income_data_small.csv") # just to check data loaded print(adult_income.head())
Sometimes we want to select a subset of DataFrame based on the values it contains.
For example, from the
adult_income DataFrame, we might want to select all the adults with more than 12 years of education (
edu_yrs >= 12).
Pandas allows us to create a subset based on a Boolean Array. This Array needs to be of the same length as number of rows in the DataFrame. The rows corresponding to
True are kept in the subset and the rows corresponding to
False are excluded. This is known as Boolean Indexing.
Let’s write the code for select all the adults with more than 12 years of education (
edu_yrs >= 12) from our
We will first create the Boolean array:
b_arr = adult_income["edu_yrs"] >= 12
Then, we will use this array to select rows from
adult_income DataFrame using
temp_df = adult_income[b_arr]
Let us create the Boolean Array and take glimpse at the first 10 elements. The array created is actually a Pandas Series with elements of datatype
bool — so all the values are
# create boolean array b_arr = adult_income["edu_yrs"] >= 12 # print the first 10 values print(b_arr.head(10))
We can see that among the top 10 rows, only the row with index 4 is
True and everything else is
Hence, when perform the boolean indexing, that row is the only one which will be selectively kept from the first 10 rows. Let us look at the data subset selected.
# selection of subset temp_df = adult_income[b_arr] # display first few elements print(temp_df.head())
We can do this directly within the
 operator too:
temp_df = adult_income[adult_income["edu_yrs"] >= 12]
We can also use
loc to perform Boolean Indexing, which works exactly the same way. Here’s an example:
temp_df = adult_income.loc[adult_income["edu_yrs"]>=12, :] print(temp_df.head())
Although we will be using
 operator throughout this tutorial, you should try and re-write the code with
loc and see that they give the same results!
Note: In Pandas,
ilocdoes not support Indexing using a Boolean Series.
In Python, we can combine boolean values using logical operators
Pandas supports these operations for Series, so we can write more complicated expressions and then use them for Boolean Indexing. In Pandas,
& is used for the and operation,
| is used for the or operation, and
~ is used for the not operation.
Let’s see a few examples.
Suppose we wanted the rows for which
edu_yrs is greater than or equal to 12 and
We can create a Boolean array using the
& operator as follows:
b_arr = (adult_income["edu_yrs"] >= 12) & (adult_income["sex"] == 'Female')
Let us use this to create a subset and see the first few elements!
# create boolean array b_arr = (adult_income["edu_yrs"] >= 12) & (adult_income["sex"] == 'Female') # selection of subset temp_df = adult_income[b_arr] # display first few elements print(temp_df.head())
Now, let’s select rows for which either capital_gain is more than 1000 or income is >50K.
This time, we will use the
| operator to create the Boolean array:
# create boolean array b_arr = (adult_income["capital_gain"] > 1000) | (adult_income["income"] == '>50K') # selection of subset temp_df = adult_income[b_arr] # display first 10 elements print(temp_df.head(10))
We can use the
~ operator select all the rows which do not match some criteria!
For example, suppose we need to select all the rows where the
education is not “Doctorate”. The Boolean expression corresponding to this criteria is
~(adult_income["education"] == 'Doctorate')
Let’s see the full code using this criteria:
# create boolean array b_arr = ~(adult_income["education"] == 'Doctorate') # selection of subset temp_df = adult_income[b_arr] # display first 10 elements print(temp_df.head(10))
You can do some boolean indexing yourself to find interesting types of people in the above data. Are there high school graduates (
HS-grad) with income
>50K? Are there people who work less than 30 hours a week with income
Sometimes we might want to keep only particular types of data in our DataFrame subset. The data type of a column in a DataFrame is referred to as dtype.
We will use the method
select_dtypes() to carry out data-type based selection. The syntax is as follows:
Here, the parameter
include is used to decide which dtypes to include, and
exclude is used to decide which dtypes to exclude.
Each parameter can be either a dtype, or a list of dtypes.
Usually, we specify the each dtype as a string. Also, most of the time, one parameter is enough to work with.
Some common valid strings for dtypes are:
'int': for integers
'float': for floating point numbers
'bool': Boolean data type
'number': all numeric types
'categorical': for pandas dtype categorical
'object': all object data types, generally strings are the only object datatypes
'timedelta': these are datatypes related to time and date. We will learn about these in a later tutorial.
Let us use this to select all the numeric datatypes from the
# select numeric dtypes df_temp = adult_income.select_dtypes(include='number') # display first few elements print(df_temp.head())
- With Boolean Indexing, we can select DataFrame rows satisfying some criteria
- We can define criterias / Boolean arrays using comparison operations (using
>, etc) and logical operations (using
- We can also selectively keep columns with particular datatypes using
subset_dataframe = DataFrame[boolean_array] subset_dataframe = DataFrame[(boolean_array_1) & (boolean_array_2)] # and subset_dataframe = DataFrame[(boolean_array_1) | (boolean_array_2)] # or subset_dataframe = DataFrame[~(boolean_array)] # not
Selecting columns with specific datatypes:
DataFrame.select_dtypes(include="", exclude="") # common arguments: 'int', 'float', 'bool', 'number', 'categorical', 'object, 'datetime'