So far in the course, we have learnt quite a bit about DataFrames. In particular, we learnt about using various boolean and arithmetic operations on DataFrame columns, and also about indexing to select and modify various subsets of a DataFrame.
In this tutorial we will learn another method for doing operations on and also modifying a DataFrame using DataFrame methods like
applymap(). These methods allow us to apply a function over an entire DataFrame.
Let’s get started!
As in the previous tutorials, let us load the Pandas and Numpy libraries at the beginning.
import pandas as pd import numpy as np
Let us load the dataset for this tutorial using . We will use the
read_csv() function for this. The dataset has 8 columns, but we will only keep 5 of them for this tutorial.
load_commonlounge_dataset('student_v3') student = pd.read_csv("/tmp/student_v3.csv") student = student[['Admn Yr', 'Board', 'Physics', 'Chemistry', 'Maths']]
Let us look at the first five rows of the data using the
This dataset contains information about students from an Engineering college. Here’s a brief description of the columns in the dataset:
Admn Yr- The year in which the student was admitted into the college (numerical)
Board- Board under which the student studied in High School (categorical)
Physics- Marks secured in Physics in the final High School exam (numerical)
Chemistry- Marks secured in Chemistry in the final High School exam (numerical)
Maths- Marks secured in Maths in the final High School exam (numerical)
For some parts of this tutorial, we will also need a DataFrame which only has numerical values. So, let’s also create a modified DataFrame with the numerical features from
student. We will call this
# extract the numerical features student_num = student.select_dtypes(include='number') # display print(student_num.head())
Note: In this tutorial, you can assume that we reload the dataset at the beginning of each section in the tutorial. That is, changes we make to the dataset will not carry on to the next section of the tutorial.
apply() method is used to apply a function to every row or column of the DataFrame.
The syntax for
apply() method is as follows:
axis— allows us to decide whether to apply the function over rows or columns. Here,
0means column-by-column, and
func— is the function which we want to apply. It must accept one argument, which will be a Series (either a column or a row of the DataFrame, depending on the value of
When we use the
apply() method, it calls the
func function once for each row / column, and passes the Series object to
func as an argument.
Let’s see some examples.
In this first example, we will use
apply() to calculate the difference between the mean and the median of every column. For this example, we will be using the
Let’s first define the function which will be called for each column.
def diff_func(arg): diff = np.mean(arg) - np.median(arg) return diff
arg will be the Pandas Series object for a column in the DataFrame. Using the NumPy function
median(), we will calculate the mean and median of the column and then return the difference.
Now, let’s use
apply() to apply this function over all columns in the
result = student_num.apply(diff_func, axis=0) print(result)
As you can see, the result is a Series with the appropriate values.
Note: We do not put parentheses after the function name
diff_func, since we do not want the function to execute immediately. We want to pass the function as a parameter, to be used by the
Before we move on to the next topic, let’s learn a little about a concept in Python called
lambda or anonymous functions.
It allows us to define and use a function directly in one expression instead of defining the function separately using
In general, the syntax for
lambda functions is as follows:
lambda arguments: expression using arguments
This returns a function object. In particular, note that there’s no function name, and that we can omit the
return keyword. We’re only allowed to have one expression inside a
For example, here’s a
lambda function to find the cube of a number:
f = lambda x: x**3 print(f(5))
So, we can rewrite the previous code as follows:
result = student_num.apply(lambda arg: np.mean(arg) - np.median(arg), axis=0) print(result)
This syntax is convenient when the function we are passing to
apply() is really short.
Obviously, we can also directly use existing functions with
For example, to calculate the mean of the values in each column, we can simply pass
np.mean in the
Now, let us apply a function over rows using
We will be calculating the average marks from
Maths columns for every row.
Let’s define our function:
def avg_func(arg): x = (arg['Physics'] + arg['Chemistry'] + arg['Maths']) / 3 return x
Here the Series that will be passed to
arg, will be the rows of the DataFrame. The column labels will be the index of this Series.
Now, we can
apply() the function over all the rows.
avg = student.apply(avg_func, axis=1) print(avg.head())
We can also store the results back in our DataFrame. Let’s try it:
student['Average'] = student.apply(avg_func, axis=1) print(student.head())
Now, you may be wondering why would we use the
apply() method when we could instead do these things using vectorized operations. For example, we could have done the last example without
apply() as well,
student['Average'] = (student['Physics'] + student['Chemistry'] + student['Maths']) / 3
The advantage of
apply() functions are that they are much more flexible, since we can write any arbitrarily complicated code inside
func. We will some examples of more complicated functions being passed to
apply() in the next couple of sections.
The main disadvantage of
apply() functions is that they are not as fast as vectorized operations which take advantage of the fact that Pandas DataFrame and Series are built on arrays. So the vectorized code above for calculating the average would be faster than doing the same thing using
Hence, when Pandas or NumPy already provides vectorized operations to do what we want to do, we should use those operations. But if those functions are not available, or the code is less complicated using
apply(), then we should use
Let’s do some slightly more complicated things using
student DataFrame contains the
Chemistry grades for some students. However, they gave different examinations, and for students whose school
'HSC', the subject exams had a maximum possible score of 200. Whereas for all other students, the maximum possible score is 100.
So, it would be nice to divide the
Chemistry) marks by 2, but only if
Let’s define our function for dividing the
Maths marks by 2 if
'HSC', and then use
apply() the function:
# print first few rows before applying function print(student.head(10)) print('') # define func def normalize_math(x): if x['Board'] == 'HSC': return x['Maths'] / 2 else: return x['Maths'] # do apply and store results in student student['Maths_normalized'] = student.apply(normalize_math, axis=1) # print first few rows after applying function print(student.head(10)) print('')
Maths marks are based out of 100 now! Similarly, we can normalize the marks for
applymap() method is used to apply a function on every single element of the DataFrame.
The syntax is very simple and similar to the
func is the function we want to apply.
This returns a DataFrame with transformed elements.
Let us take an example where we divide all the elements by 100 and then square them.
First, we will define a function
foo(). Here the arguments to the function are the individual values in the DataFrame.
def foo(arg): return np.square(arg/100)
Next, we use
applymap() to us execute the function:
# apply the new function temp = student_num.applymap(foo) # display top 5 results print(temp.head())
Now, let us try and do the same thing with the
lambda function. Instead of passing the function as argument, we will directly write the
# apply function with lambda temp = student_num.applymap(lambda x: np.square(x/100)) # display top 5 results print(temp.head())
Earlier, we used the
apply() method to find the mean of each column in the
But if you notice the labels like
Admn Yr, it suggests that students may belong to different groups or clusters, according to values in these columns. Therefore, it is possible that aggregate statistics like mean or median are different for each such group of students.
We will use the
groupby() method to break up the dataset into different groups, and then calculate aggregate statistics using the Pandas DataFrame method
We will use the following syntax:
groupby() method will group the data according to the values from the column that is passed as argument. Then
mean() will calculate the mean for each group separately.
Note: By default, DataFrame methods like
mean()selectively operate on the columns with numeric
A good way to understand
groupby() to think of it as a three step process of Split-Apply-Combine:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
Let’s see an example of group by.
First we will look at the mean of all the columns as a whole. This will allow us to understand the difference in the aggregate statistics better.
Now, let us use
groupby() on the
student DataFrame and find the average values of different groups of students from different
# calculate the group mean group_mean = student.groupby(["Board"]).mean() # display result print(group_mean)
We can see the different groups of students based on
Board have significantly different means for
groupby() can be a powerful tool when applied appropriately! Similarly, we can also calculate other statistics like median, standard deviation, etc with
- We can apply a function to every row / column of a DataFrame using
- We can apply both pre-defined Pandas DataFrame statistical/mathematical functions or create our own functions.
- The anonymous function
lambdacan be used to apply functions inline, without defining it separately.
apply()is slower than performing vectorized operations, it is more flexible.
applymap()method is used to apply a function on every single element of the DataFrame
groupby()method is used to break up the dataset into different groups, after which we can apply functions on each group separately.
lambda arguments: expression using arguments