In the last couple of tutorials, we learned how to select various subsets of a DataFrame. In this tutorial, we will use these techniques to select a subset of the DataFrame and modify the selected data.
As in the previous tutorials, let’s start by importing the libraries and loading the dataset.
import pandas as pd import numpy as np # load the dataset load_commonlounge_dataset('adult_income_data_small') adult_income = pd.read_csv("/tmp/adult_income_data_small.csv") # just to check data loaded print(adult_income.head())
Note: In this tutorial, you can assume that we reload the dataset at the beginning of each section in the tutorial. That is, changes we make to the dataset will not carry on to the next section of the tutorial.
In this section, we will see how to modify a single value in a DataFrame — that is, the value associated with a specific row and column.
As we saw in the earlier tutorials, we can access a DataFrame element by specifying its row and column using:
loc[row, column]- If we know the row index and column label of the element
iloc[row, column]- If we know the integer position of the row and column.
Once we access the element we can assign a new value to it.
Let’s see an example.
We will update the element in the row with index
2 and column label
First, let us take a look at the row with index
As you can see, the initial value for the element in the row with index
2 and column label
Now we will assign it the value
adult_income.loc[2, 'education'] = 'College' # print row index 2 to verify modification print(adult_income.loc[2, :])
We learned about Boolean indexing in the previous tutorial, which allows us to select rows or columns from the DataFrame satisfying some criteria.
In this section of the tutorial, we will learn how to modify data in all the rows which match our criteria.
- First, we select all the rows which match our criteria through Boolean indexing.
- We then modify the data in those rows in the manner we want.
Let’s look at an example.
Suppose we want to change the value for
education column to
'Ph.D.' for every row where the
edu_yrs column has a value greater than or equal to 16.
Let us first take a glimpse of all the rows where
edu_yrs is greater than or equal to 16.
print(adult_income.loc[adult_income["edu_yrs"] >= 16, :])
We also want to ensure that only the selected rows get modified. So, let’s also take a look at the first few rows of the data.
Now, we’ll select these rows again and modify the
education column by assigning it the value
adult_income.loc[adult_income["edu_yrs"] >= 16, 'education'] = 'Ph.D.' # Print the rows to verify the modification print(adult_income.loc[adult_income["edu_yrs"] >= 16, :]) print('') # Also make sure only the *selected rows* got modified print(adult_income.head())
All the rows which match the criteria got modified, but the rest of the rows remain unchanged. For example, in the
head() output, we can see that only the 5th row got modified (which has
edu_yrs >= 16), the rest of the rows remain as is.
In the previous sections, we learned about modifying individual elements in a DataFrame, or selecting and modifying elements from a subset of rows.
In this section, we will learn how to modify all the elements in a single column and combining multiple columns to create a new column.
Before we modify the data itself, let us see how to change the datatype (
dtype) of columns in our DataFrame.
Let’s start by taking a look at the dtypes of different columns in our DataFrame.
As you can see, there are three columns with
dtype and three columns with integer (
read_csv() function assigns columns like
dtype, for categorical variables a more suitable
category exists in Pandas. Changing to
dtype for these columns saves space, and also allows us to use other helpful functions to find number of unique values, etc.
We will use the function
astype(), to change the
dtype of the column. The syntax is as follows:
dtype argument specifies the datatype to which we want to convert. This function returns the column with the changed
Let’s convert the
dype of the
income column from
adult_income["income"] = adult_income["income"].astype('category') # check the datatypes print("Check the dtypes after converting") adult_income.info()
dtype has changed!
You should also try updating the
dtype for the other two categorical columns,
sex. Or you can try changing the
In this section, we will see how to create new columns in a DataFrame (or update all the values in a column).
Here’s the syntax:
DataFrame["new_column_name"] = col_values
col_values can be:
- an integer - the whole column is filled with the same integer,
- an array or a Series - the length should be same as number of rows in the Dataframe
To see examples of this, we need to also learn about element-wise operations in Pandas.
In Pandas, we can perform binary operations such as addition (
+), subtraction (
-), multiplication (
*) and division (
/) between a column and a scalar.
For example, let’s say that for our data, we wanted to calculate the number of hours worked per day (from hours worked per week). Assuming people only work on weekdays, all we need to do is divide all the values by 5.
So, we can do:
adult_income["hours_per_week"] / 5. Every element of the column will get divided by 5.
Let’s perform this operation and put the results in a new column in the
adult_income["hours_per_day"] = adult_income["hours_per_week"] / 5 print(adult_income.head())
Awesome! We added a new column to our DataFrame, in this case derived from values in a different column.
Note that after performing the operations, we don’t need to necessarily put the results in the same DataFrame.
For example, in the code below we are just storing the Series as a separate variable.
hours_per_day = adult_income["hours_per_week"] / 5 print(hours_per_day.head())
NumPy functions such as
np.log() which work on
ndarrays, can also be used with Pandas Series (and DataFrames).
For example, let us carry out the NumPy
np.exp() function on the
edu_yrs column and see the result:
data = np.exp(adult_income["edu_yrs"]) print(data.head())
Now, let us see operations involving multiple columns.
The binary operations such as addition (
+), subtraction (
-), multiplication (
*), division (
/), etc can also be performed on two columns (and even two DataFrames). The operation takes place element-wise along the columns.
Let us create a new column combining two columns -
edu_yrs. Every row of this column will contain the product of
edu_yrs from that row.
adult_income["product"] = adult_income["hours_per_week"] * adult_income["edu_yrs"] print(adult_income.head())
Let’s do another example. This time, let’s combine two categorical features. Let’s say we care about the combination feature (
adult_income["edu_with_income"] = adult_income["education"] + "-" + adult_income["income"] print(adult_income.head())
- We can modify subsets of the DataFrame by accessing the element with
iloc, and then assigning the new value. This includes modifying a single value or a subset of rows and columns.
- We can selectively modify values from DataFrame using Boolean indexing. First we select the rows which match a criterion, and then assign new values to elements.
- We can modify the
dtypeof a column with
- We can carry out element-wise operations. This includes operations between a column and a scalar, operations between two columns or applying a function element wise to the entire column.
- We can create new columns using the same
syntax used to select a column.