# Pandas: Modifying DataFrames

April 19, 2019

# Introduction

In the last couple of tutorials, we learned how to select various subsets of a DataFrame. In this tutorial, we will use these techniques to select a subset of the DataFrame and modify the selected data.

# Set-up

As in the previous tutorials, let’s start by importing the libraries and loading the dataset.

We will use the same dataset that we used in Indexing and Slicing and Criteria Based Selection tutorials.

```
import pandas as pd
import numpy as np
# load the dataset
load_commonlounge_dataset('adult_income_data_small')
adult_income = pd.read_csv("/tmp/adult_income_data_small.csv")
# just to check data loaded
print(adult_income.head())
```

Note: In this tutorial, you can assume that wereload the datasetat the beginning of each section in the tutorial. That is, changes we make to the dataset will not carry on to the next section of the tutorial.

# Modifying specific values in a DataFrame

In this section, we will see how to modify a single value in a DataFrame — that is, the value associated with a specific row and column.

As we saw in the earlier tutorials, we can access a DataFrame element by specifying its row and column using:

`loc[row, column]`

- If we know the row index and column label of the element`iloc[row, column]`

- If we know the integer position of the row and column.

Once we access the element we can assign a new value to it.

Let’s see an example.

We will update the element in the row with index `2`

and column label `education`

.

First, let us take a look at the row with index `2`

.

`print(adult_income.loc[2, :])`

As you can see, the initial value for the element in the row with index `2`

and column label `education`

is `'HS-grad'`

.

Now we will assign it the value `'College'`

:

```
adult_income.loc[2, 'education'] = 'College'
# print row index 2 to verify modification
print(adult_income.loc[2, :])
```

# Modifying values based on criteria (Boolean indexing)

We learned about Boolean indexing in the previous tutorial, which allows us to select rows or columns from the DataFrame satisfying some criteria*.*

In this section of the tutorial, we will learn how to modify data in all the rows which match our criteria.

- First, we select all the rows which match our criteria through Boolean indexing.
- We then modify the data in those rows in the manner we want.

Let’s look at an example.

Suppose we want to change the value for `education`

column to `'Ph.D.'`

for every row where the `edu_yrs`

column has a value greater than or equal to 16.

Let us first take a glimpse of all the rows where `edu_yrs`

is greater than or equal to 16.

`print(adult_income.loc[adult_income["edu_yrs"] >= 16, :])`

We also want to ensure that only the *selected rows* get modified. So, let’s also take a look at the first few rows of the data.

`print(adult_income.head())`

Now, we’ll select these rows again and modify the `education`

column by assigning it the value `'Ph.D.'`

.

```
adult_income.loc[adult_income["edu_yrs"] >= 16, 'education'] = 'Ph.D.'
# Print the rows to verify the modification
print(adult_income.loc[adult_income["edu_yrs"] >= 16, :])
print('')
# Also make sure only the *selected rows* got modified
print(adult_income.head())
```

All the rows which match the criteria got modified, but the rest of the rows remain unchanged. For example, in the `head()`

output, we can see that only the 5th row got modified (which has `edu_yrs >= 16`

), the rest of the rows remain as is.

Great!

# Modifying columns

In the previous sections, we learned about modifying individual elements in a DataFrame, or selecting and modifying elements from a subset of rows.

In this section, we will learn how to modify all the elements in a single column and combining multiple columns to create a new column.

# Modifying column datatype

Before we modify the data itself, let us see how to change the datatype (`dtype`

) of columns in our DataFrame.

Let’s start by taking a look at the dtypes of different columns in our DataFrame.

`adult_income.info()`

As you can see, there are three columns with `object`

`dtype`

and three columns with integer (`int64`

) `dtype`

.

Although the `read_csv()`

function assigns columns like `sex`

and `income`

as `object`

`dtype`

, for categorical variables a more suitable `dtype`

called `category`

exists in Pandas. Changing to `category`

`dtype`

for these columns saves space, and also allows us to use other helpful functions to find number of unique values, etc.

We will use the function `astype()`

, to change the `dtype`

of the column. The syntax is as follows:

`DataFrame["column_name"].astype(dtype)`

Here the `dtype`

argument specifies the datatype to which we want to convert. This function returns the column with the changed `dtype`

.

Let’s convert the `dype`

of the `income`

column from `object`

to `category`

.

```
adult_income["income"] = adult_income["income"].astype('category')
# check the datatypes
print("Check the dtypes after converting")
adult_income.info()
```

The `dtype`

has changed!

You should also try updating the `dtype`

for the other two categorical columns, `education`

and `sex`

. Or you can try changing the `dtype`

of `capital_gain`

to `float64`

.

# Creating new columns / Modifying column values

In this section, we will see how to create new columns in a DataFrame (or update *all* the values in a column).

Here’s the syntax:

`DataFrame["new_column_name"] = col_values`

The `col_values`

can be:

- an integer - the whole column is filled with the same integer,
- an array or a Series - the length should be same as number of rows in the Dataframe

To see examples of this, we need to also learn about **element-wise** **operations** in Pandas.

## Operations on columns

In Pandas, we can perform binary operations such as addition (`+`

), subtraction (`-`

), multiplication (`*`

) and division (`/`

) between a column and a scalar.

For example, let’s say that for our data, we wanted to calculate the number of hours worked per *day* (from hours worked per week). Assuming people only work on weekdays, all we need to do is divide all the values by 5.

So, we can do: `adult_income["hours_per_week"] / 5`

. **Every element** of the column will get divided by 5.

Let’s perform this operation and put the results in a new column in the `adult_income`

DataFrame.

```
adult_income["hours_per_day"] = adult_income["hours_per_week"] / 5
print(adult_income.head())
```

Awesome! We added a new column to our DataFrame, in this case derived from values in a different column.

Note that after performing the operations, we don’t need to necessarily put the results in the same DataFrame.

For example, in the code below we are just storing the Series as a separate variable.

```
hours_per_day = adult_income["hours_per_week"] / 5
print(hours_per_day.head())
```

NumPy functions such as `np.exp()`

or `np.log()`

which work on `ndarrays`

, can also be used with Pandas Series (and DataFrames).

For example, let us carry out the NumPy `np.exp()`

function on the `edu_yrs`

column and see the result:

```
data = np.exp(adult_income["edu_yrs"])
print(data.head())
```

## Operations involving multiple columns

Now, let us see operations involving multiple columns.

The binary operations such as addition (`+`

), subtraction (`-`

), multiplication (`*`

), division (`/`

), etc can also be performed on two columns (and even two DataFrames). The operation takes place element-wise along the columns.

Let us create a new column combining two columns - `hours_per_week`

and `edu_yrs`

. Every row of this column will contain the product of `hours_per_week`

times `edu_yrs`

from that row.

```
adult_income["product"] = adult_income["hours_per_week"] * adult_income["edu_yrs"]
print(adult_income.head())
```

Let’s do another example. This time, let’s combine two categorical features. Let’s say we care about the combination feature (`education`

, `income`

).

```
adult_income["edu_with_income"] = adult_income["education"] + "-" + adult_income["income"]
print(adult_income.head())
```

# Summary

- We can modify subsets of the DataFrame by accessing the element with
`[]`

,`loc`

or`iloc`

, and then assigning the new value. This includes modifying a single value or a subset of rows and columns. - We can selectively modify values from DataFrame using Boolean indexing. First we select the rows which match a criterion, and then assign new values to elements.
- We can modify the
`dtype`

of a column with`astype()`

- We can carry out
**element-wise operations.**This includes operations between a column and a scalar, operations between two columns or applying a function element wise to the entire column. - We can create new columns using the same
`[]`

syntax used to select a column.