Pandas: Modifying DataFrames

April 19, 2019

Introduction

In the last couple of tutorials, we learned how to select various subsets of a DataFrame. In this tutorial, we will use these techniques to select a subset of the DataFrame and modify the selected data.

Set-up

As in the previous tutorials, let’s start by importing the libraries and loading the dataset.

We will use the same dataset that we used in Indexing and Slicing and Criteria Based Selection tutorials.

import pandas as pd
import numpy as np
# load the dataset
load_commonlounge_dataset('adult_income_data_small')
adult_income = pd.read_csv("/tmp/adult_income_data_small.csv")
# just to check data loaded
print(adult_income.head())

Note: In this tutorial, you can assume that we reload the dataset at the beginning of each section in the tutorial. That is, changes we make to the dataset will not carry on to the next section of the tutorial.

Modifying specific values in a DataFrame

In this section, we will see how to modify a single value in a DataFrame — that is, the value associated with a specific row and column.

As we saw in the earlier tutorials, we can access a DataFrame element by specifying its row and column using:

loc[row, column] - If we know the row index and column label of the element
iloc[row, column] - If we know the integer position of the row and column.

Once we access the element we can assign a new value to it.

Let’s see an example.

We will update the element in the row with index 2 and column label education.

First, let us take a look at the row with index 2.

print(adult_income.loc[2, :])

As you can see, the initial value for the element in the row with index 2 and column label education is 'HS-grad'.

Now we will assign it the value 'College':

adult_income.loc[2, 'education'] = 'College'
# print row index 2 to verify modification
print(adult_income.loc[2, :])

Modifying values based on criteria (Boolean indexing)

We learned about Boolean indexing in the previous tutorial, which allows us to select rows or columns from the DataFrame satisfying some criteria.

In this section of the tutorial, we will learn how to modify data in all the rows which match our criteria.

First, we select all the rows which match our criteria through Boolean indexing.
We then modify the data in those rows in the manner we want.

Let’s look at an example.

Suppose we want to change the value for education column to 'Ph.D.' for every row where the edu_yrs column has a value greater than or equal to 16.

Let us first take a glimpse of all the rows where edu_yrs is greater than or equal to 16.

print(adult_income.loc[adult_income["edu_yrs"] >= 16, :])

We also want to ensure that only the selected rows get modified. So, let’s also take a look at the first few rows of the data.

print(adult_income.head())

Now, we’ll select these rows again and modify the education column by assigning it the value 'Ph.D.'.

adult_income.loc[adult_income["edu_yrs"] >= 16, 'education'] = 'Ph.D.'
# Print the rows to verify the modification
print(adult_income.loc[adult_income["edu_yrs"] >= 16, :])
print('')
# Also make sure only the *selected rows* got modified
print(adult_income.head())

All the rows which match the criteria got modified, but the rest of the rows remain unchanged. For example, in the head() output, we can see that only the 5th row got modified (which has edu_yrs >= 16), the rest of the rows remain as is.

Great!

Modifying columns

In the previous sections, we learned about modifying individual elements in a DataFrame, or selecting and modifying elements from a subset of rows.

In this section, we will learn how to modify all the elements in a single column and combining multiple columns to create a new column.

Modifying column datatype

Before we modify the data itself, let us see how to change the datatype (dtype) of columns in our DataFrame.

Let’s start by taking a look at the dtypes of different columns in our DataFrame.

adult_income.info()

As you can see, there are three columns with object dtype and three columns with integer (int64) dtype.

Although the read_csv() function assigns columns like sex and income as object dtype, for categorical variables a more suitable dtype called category exists in Pandas. Changing to category dtype for these columns saves space, and also allows us to use other helpful functions to find number of unique values, etc.

We will use the function astype(), to change the dtype of the column. The syntax is as follows:

DataFrame["column_name"].astype(dtype)

Here the dtype argument specifies the datatype to which we want to convert. This function returns the column with the changed dtype.

Let’s convert the dype of the income column from object to category.

adult_income["income"] = adult_income["income"].astype('category')
# check the datatypes
print("Check the dtypes after converting")
adult_income.info()

The dtype has changed!

You should also try updating the dtype for the other two categorical columns, education and sex. Or you can try changing the dtype of capital_gain to float64.

Creating new columns / Modifying column values

In this section, we will see how to create new columns in a DataFrame (or update all the values in a column).

Here’s the syntax:

DataFrame["new_column_name"] = col_values

The col_values can be:

an integer - the whole column is filled with the same integer,
an array or a Series - the length should be same as number of rows in the Dataframe

To see examples of this, we need to also learn about element-wise operations in Pandas.

Operations on columns

In Pandas, we can perform binary operations such as addition (+), subtraction (-), multiplication (*) and division (/) between a column and a scalar.

For example, let’s say that for our data, we wanted to calculate the number of hours worked per day (from hours worked per week). Assuming people only work on weekdays, all we need to do is divide all the values by 5.

So, we can do: adult_income["hours_per_week"] / 5. Every element of the column will get divided by 5.

Let’s perform this operation and put the results in a new column in the adult_income DataFrame.

adult_income["hours_per_day"] = adult_income["hours_per_week"] / 5
print(adult_income.head())

Awesome! We added a new column to our DataFrame, in this case derived from values in a different column.

Note that after performing the operations, we don’t need to necessarily put the results in the same DataFrame.

For example, in the code below we are just storing the Series as a separate variable.

hours_per_day = adult_income["hours_per_week"] / 5
print(hours_per_day.head())

NumPy functions such as np.exp() or np.log() which work on ndarrays, can also be used with Pandas Series (and DataFrames).

For example, let us carry out the NumPy np.exp() function on the edu_yrs column and see the result:

data = np.exp(adult_income["edu_yrs"])
print(data.head())

Operations involving multiple columns

Now, let us see operations involving multiple columns.

The binary operations such as addition (+), subtraction (-), multiplication (*), division (/), etc can also be performed on two columns (and even two DataFrames). The operation takes place element-wise along the columns.

Let us create a new column combining two columns - hours_per_week and edu_yrs. Every row of this column will contain the product of hours_per_week times edu_yrs from that row.

adult_income["product"] = adult_income["hours_per_week"] * adult_income["edu_yrs"]
print(adult_income.head())

Let’s do another example. This time, let’s combine two categorical features. Let’s say we care about the combination feature (education, income).

adult_income["edu_with_income"] = adult_income["education"]  + "-" + adult_income["income"]
print(adult_income.head())

Summary

We can modify subsets of the DataFrame by accessing the element with [], loc or iloc, and then assigning the new value. This includes modifying a single value or a subset of rows and columns.
We can selectively modify values from DataFrame using Boolean indexing. First we select the rows which match a criterion, and then assign new values to elements.
We can modify the dtype of a column with astype()
We can carry out element-wise operations. This includes operations between a column and a scalar, operations between two columns or applying a function element wise to the entire column.
We can create new columns using the same [] syntax used to select a column.