# Pandas: Apply functions and GroupBy

April 19, 2019

# Introduction

So far in the course, we have learnt quite a bit about DataFrames. In particular, we learnt about using various boolean and arithmetic operations on DataFrame columns, and also about indexing to select and modify various subsets of a DataFrame.

In this tutorial we will learn another method for doing operations on and also modifying a DataFrame using DataFrame methods like `apply()`

and `applymap()`

. These methods allow us to *apply* a function over an entire DataFrame.

Let’s get started!

# Set up

As in the previous tutorials, let us load the Pandas and Numpy libraries at the beginning.

```
import pandas as pd
import numpy as np
```

# The `student`

dataset

Let us load the dataset for this tutorial using . We will use the `read_csv()`

function for this. The dataset has 8 columns, but we will only keep 5 of them for this tutorial.

```
load_commonlounge_dataset('student_v3')
student = pd.read_csv("/tmp/student_v3.csv")
student = student[['Admn Yr', 'Board', 'Physics', 'Chemistry', 'Maths']]
```

Let us look at the first five rows of the data using the `head()`

method.

`print(student.head())`

# Brief description of the data-set:

This dataset contains information about students from an Engineering college. Here’s a brief description of the columns in the dataset:

`Admn Yr`

- The year in which the student was admitted into the college (numerical)`Board`

- Board under which the student studied in High School (categorical)`Physics`

- Marks secured in Physics in the final High School exam (numerical)`Chemistry`

- Marks secured in Chemistry in the final High School exam (numerical)`Maths`

- Marks secured in Maths in the final High School exam (numerical)

# Numbers only dataset

For some parts of this tutorial, we will also need a DataFrame which only has numerical values. So, let’s also create a modified DataFrame with the numerical features from `student`

. We will call this `student_num`

:

```
# extract the numerical features
student_num = student.select_dtypes(include='number')
# display
print(student_num.head())
```

Note: In this tutorial, you can assume that wereload the datasetat the beginning of each section in the tutorial. That is, changes we make to the dataset will not carry on to the next section of the tutorial.

# The `apply()`

method

The `apply()`

method is used to **apply** a *function* to every row or column of the DataFrame.

The syntax for `apply()`

method is as follows:

`DataFrame.apply(func, axis=0)`

where,

`axis`

— allows us to decide whether to apply the function over rows or columns. Here,`0`

means column-by-column, and`1`

means row-by-row.`func`

— is the function which we want to apply. It must accept one argument, which will be a Series (either a column or a row of the DataFrame, depending on the value of`axis`

).

When we use the `apply()`

method, it calls the `func`

function once for each row / column, and passes the Series object to `func`

as an argument.

Let’s see some examples.

`apply()`

over columns

In this first example, we will use `apply()`

to calculate the difference between the mean and the median of every column. For this example, we will be using the `student_num`

DataFrame.

Let’s first define the function which will be called for each column.

```
def diff_func(arg):
diff = np.mean(arg) - np.median(arg)
return diff
```

Here, `arg`

will be the Pandas Series object for a column in the DataFrame. Using the NumPy function `mean()`

and `median()`

, we will calculate the mean and median of the column and then return the difference.

Now, let’s use `apply()`

to apply this function over all columns in the `student_num`

DataFrame:

```
result = student_num.apply(diff_func, axis=0)
print(result)
```

As you can see, the result is a Series with the appropriate values.

Note:We do not put parentheses after the function name`diff_func`

, since we do not want the function to execute immediately. We want to pass the function as a parameter, to be used by the`apply()`

method.

## Anonymous functions — `lambda`

Before we move on to the next topic, let’s learn a little about a concept in Python called `lambda`

or anonymous functions.

It allows us to define and use a function directly in one expression instead of defining the function separately using `def`

first.

In general, the syntax for `lambda`

functions is as follows:

`lambda arguments: expression using arguments`

This returns a function object. In particular, note that there’s no function name, and that we can omit the `return`

keyword. We’re only allowed to have one expression inside a `lambda`

function.

For example, here’s a `lambda`

function to find the cube of a number:

```
f = lambda x: x**3
print(f(5))
```

So, we can rewrite the previous code as follows:

```
result = student_num.apply(lambda arg: np.mean(arg) - np.median(arg), axis=0)
print(result)
```

This syntax is convenient when the function we are passing to `apply()`

is really short.

## Pre-defined functions

Obviously, we can also directly use existing functions with `apply()`

.

For example, to calculate the mean of the values in each column, we can simply pass `np.mean`

in the `apply()`

method:

`print(student_num.apply(np.mean, axis=0))`

`apply()`

over rows

Now, let us apply a function over rows using `axis=1`

.

We will be calculating the average marks from `Physics`

, `Chemistry`

and `Maths`

columns for every row.

Let’s define our function:

```
def avg_func(arg):
x = (arg['Physics'] + arg['Chemistry'] + arg['Maths']) / 3
return x
```

Here the Series that will be passed to `arg`

, will be the rows of the DataFrame. The column labels will be the index of this Series.

Now, we can `apply()`

the function over all the rows.

```
avg = student.apply(avg_func, axis=1)
print(avg.head())
```

We can also store the results back in our DataFrame. Let’s try it:

```
student['Average'] = student.apply(avg_func, axis=1)
print(student.head())
```

Awesome!

## apply() vs vectorized operations

Now, you may be wondering why would we use the `apply()`

method when we could instead do these things using vectorized operations. For example, we could have done the last example without `apply()`

as well,

`student['Average'] = (student['Physics'] + student['Chemistry'] + student['Maths']) / 3`

The advantage of `apply()`

functions are that they are much more flexible, since we can write any arbitrarily complicated code inside `func`

. We will some examples of more complicated functions being passed to `apply()`

in the next couple of sections.

The main disadvantage of `apply()`

functions is that they are not as fast as vectorized operations which take advantage of the fact that Pandas DataFrame and Series are built on arrays. So the vectorized code above for calculating the average would be faster than doing the same thing using `apply()`

.

Hence, when Pandas or NumPy already provides vectorized operations to do what we want to do, we should use those operations. But if those functions are not available, or the code is less complicated using `apply()`

, then we should use `apply()`

.

Let’s do some slightly more complicated things using `apply()`

.

`apply()`

function if-else example

Our `student`

DataFrame contains the `Maths`

, `Physics`

and `Chemistry`

grades for some students. However, they gave different examinations, and for students whose school `Board`

was `'HSC'`

, the subject exams had a maximum possible score of 200. Whereas for all other students, the maximum possible score is 100.

So, it would be nice to divide the `Maths`

(and `Physics`

and `Chemistry`

) marks by 2, but **only** **if** `Board`

is `'HSC'`

.

Let’s define our function for dividing the `Maths`

marks by 2 if `Board`

is `'HSC'`

, and then use `apply()`

the function:

```
# print first few rows before applying function
print(student.head(10))
print('')
# define func
def normalize_math(x):
if x['Board'] == 'HSC':
return x['Maths'] / 2
else:
return x['Maths']
# do apply and store results in student
student['Maths_normalized'] = student.apply(normalize_math, axis=1)
# print first few rows after applying function
print(student.head(10))
print('')
```

All the `Maths`

marks are based out of 100 now! Similarly, we can normalize the marks for `Physics`

and `Chemistry`

.

# The `applymap()`

method

The `applymap()`

method is used to **apply** a function on every single element of the DataFrame.

The syntax is very simple and similar to the `apply()`

function:

`DataFrame.applymap(func)`

where `func`

is the function we want to apply.

This returns a DataFrame with transformed elements.

Let us take an example where we divide all the elements by 100 and then square them.

First, we will define a function `foo()`

. Here the arguments to the function are the individual values in the DataFrame.

```
def foo(arg):
return np.square(arg/100)
```

Next, we use `applymap()`

to us execute the function:

```
# apply the new function
temp = student_num.applymap(foo)
# display top 5 results
print(temp.head())
```

Now, let us try and do the same thing with the `lambda`

function. Instead of passing the function as argument, we will directly write the `lambda`

expression.

```
# apply function with lambda
temp = student_num.applymap(lambda x: np.square(x/100))
# display top 5 results
print(temp.head())
```

# The `groupby()`

method

Earlier, we used the `apply()`

method to find the mean of each column in the `student`

DataFrame.

But if you notice the labels like `Board`

or `Admn Yr`

, it suggests that students may belong to different groups or clusters, according to values in these columns. Therefore, it is possible that aggregate statistics like mean or median are different for each such group of students.

We will use the `groupby()`

method to break up the dataset into different groups, and then calculate aggregate statistics using the Pandas DataFrame method `mean()`

.

We will use the following syntax:

`DataFrame.groupby(["column"]).mean()`

Here the `groupby()`

method will group the data according to the values from the column that is passed as argument. Then `mean()`

will calculate the mean for each group separately.

Note:By default, DataFrame methods like`mean()`

selectively operate on the columns with numeric`dtype`

.

A good way to understand `groupby()`

to think of it as a **three step** process of **Split-Apply-Combine**:

**Splitting**the data into groups based on some criteria.**Applying**a function to each group independently.**Combining**the results into a data structure.

Let’s see an example of group by.

First we will look at the mean of all the columns as a whole. This will allow us to understand the difference in the aggregate statistics better.

`print(student.mean())`

Now, let us use `groupby()`

on the `student`

DataFrame and find the average values of different groups of students from different `Board`

s.

```
# calculate the group mean
group_mean = student.groupby(["Board"]).mean()
# display result
print(group_mean)
```

We can see the different groups of students based on `Board`

have significantly different means for `Physics`

, `Chemistry`

and `Maths`

columns.

`groupby()`

can be a powerful tool when applied appropriately! Similarly, we can also calculate other statistics like median, standard deviation, etc with `groupby()`

.

# Summary

- We can apply a function to every row / column of a DataFrame using
`apply()`

method. - We can apply both pre-defined Pandas DataFrame statistical/mathematical functions or create our own functions.
- The anonymous function
`lambda`

can be used to apply functions inline, without defining it separately. - Although
`apply()`

is slower than performing vectorized operations, it is more flexible. - The
`applymap()`

method is used to apply a function on every single element of the DataFrame - The
`groupby()`

method is used to break up the dataset into different groups, after which we can apply functions on each group separately.

# Reference

`apply()`

`DataFrame.apply(func, axis=0) `

`lambda`

functions

`lambda arguments: expression using arguments`

`applymap()`

`DataFrame.applymap(func)`

`groupby()`

`DataFrame.groupby(["Column label"]).function()`