Pandas: DataFrame Merge and Concatenation

April 18, 2019

Introduction

Once we start working with actual datasets, it might be very rare to find all the data in one place in the same file organised neatly. From time to time we will need to extract data from different sources, combine them in suitable form and only then can we proceed to do some work with the data.

In this tutorial we will learn about two operations — merge and concatenate — that can help us with combining DataFrames in different ways.

Merge

There are various ways in which DataFrames can be merged, depending on our needs. Let us take an example.

Suppose we have two Datasets: Left-Dataset and Right-Dataset as shown below.

The Left-Dataset contains the subject score for “Physics” and “Chemistry”, while the Right-Dataset contains the subject score for “Maths” for a group of students.

Each student is known by their unique “Student_ID” number.

We can see that students with ID 2,4 and 5 only have score for Physics and Chemistry, while students with ID 10, 11 and 13 have only Maths scores. Students with ID 6, 7 and 8 have all three subject scores.

In order to analyse data of all the students across the subjects, we would merge these two datasets together to form a single dataset with “Physics”, “Chemistry” and “Maths” marks of all the students.

Now, when we decide to merge the DataFrames, we need to figure out what to do about discrepancy that arises due to all the scores not being available for all the students.

If we only need the data of the students who have all the three scores, then we could merge it in a way that only keeps the data of students who are common in both the DataFrames. It would look like below:

But maybe “Maths” was an optional paper for all students, and we are interested in the marks of all the students who have “Physics” and “Chemistry” scores, with the “Maths” scores if available. The resulting DataFrame would look like below:

Note that if the “Maths” scores are not available, it says NaN (stands for Not a Number). We could substitute it with ‘Not Available’, ’Not Opted’, or any other appropriate special value.

Types of Merge

What we saw above were examples of two possible ways in which merging can be done, based on how we want to combine the datasets.

The first merge we performed, where we took values common to both DataFrames, is known as Inner Join.
The second merge that we performed, where we took all values from the Left-DataFrame and only matching values from the Right-DataFrame, is known as Left Join.

In this section we will learn about four ways in which we can decide to join DataFrames.

Note: The decision to include or exclude rows from the DataFrames are based on the selected columns (one or multiple). In our examples, we based it on the “Student ID” column.

Let us now look at the kind of merges that we can perform:

INNER JOIN: In this kind of merging, we keep the rows which are common to both the DataFrames. This is also known as taking the intersection of the two DataFrames.

Inner Join

OUTER JOIN: We keep all the rows in this type of merging, even if it only appears in one DataFrame. This is also known as taking the union of the two DataFrames.

Whenever some column values are not available for a row, it is substituted with 'NaN' or 'Null' values.

Outer Join

LEFT JOIN: In this, we keep all the rows from the left DataFrame, and only data from rows with matching values from the right DataFrame are used.

Left Join

RIGHT JOIN: In this, we keep all the all the rows from the right DataFrame, and only data from rows with matching values from the left DataFrame are used.

Right Join

Set up

As in the previous tutorial, let’s start by importing the Pandas library.

import pandas as pd
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 20)

DataSets for Merge

We will be using two small data sets for exploring the different types of merge and joins. The datasets contain information about students from an engineering college. Some rows and columns, will be common across both the datasets.

Explanation of the features that appear across both the DataFrames:

Admn Yr - The year of admission in the college (numerical)
Seat Type - Students are admitted to nine different type of seats broadly classified on the basis of merit and fee structure. (categorical)
Entrance Rank - Rank secured in a common examination to measure academic merit of the student (numerical)
Board - Board under which the student studied in school. (categorical)
Marks Obtained - Total marks secured in the final school exam. (numerical)
Physics - Marks secured in Physics (numerical)
Chemistry - Marks secured in Chemistry (numerical)
Maths - Marks secured in Maths (numerical)

Let us load the Datasets one by one, using the read_csv() method from Pandas and display the contents.

# load first dataset 
load_commonlounge_dataset("students_merge1")
df_merge1 = pd.read_csv("/tmp/students_merge1.csv")
# load second dataset
load_commonlounge_dataset("students_merge2")
df_merge2 = pd.read_csv("/tmp/students_merge2.csv")
# display first dataset
print("df_merge1 DataFrame")
print(df_merge1)
print('')
# display second dataset
print("df_merge2 DataFrame")
print(df_merge2)

The `merge()` function

Now that we have loaded these datasets, let’s see how we can perform the above types of merge using Pandas.

The pandas library provides a function called merge() to perform merging of DataFrames. The syntax is as follows:

pandas.merge(left, right, how='inner', on=None)

This returns a merged DataFrame. The original DataFrames remain unchanged.

left and right parameters denote the left DataFrame and the right DataFrame to be merged. The order of these parameters is important, specially while doing Left Join or Right Join.
how parameter controls which type of join to perform. We can pass one of four different values:
"inner": For inner join. It is also the default value for the parameter, if nothing is passed.
"outer": For outer join
"left": For left join
"right": For right join
on parameter accepts column label or list of column labels. These are the columns based on which merging is performed, i.e. based on which we match rows from one DataFrame to rows in the other DataFrame. These column(s) should be present in both the DataFrames.
If nothing is passed then all the column labels that are common between the two DataFrames are used to perform the merge operation.

It is also possible that the columns on the left and right DataFrame have different names. In such a scenario we separately pass the names of the columns for each of the two DataFrames based on which merging should be performed. The syntax is as follows:

pandas.merge(left, right, how='inner', left_on=None, right_on=None)

Here, left_on is for the column label(s) for the left DataFrame, and right_on is for the column label(s) for the right DataFrame.

Let us take a couple of examples see for ourselves how it works.

Inner Join Example

We will now perform the inner join on the df_merge1 and df_merge2 DataFrames. So the how parameter will have "inner" as an argument. (This is also the default value.)

We will perform the merge on Students ID. We can see that Student ID is common to both the DataFrames.

merged_df = pandas.merge(df_merge1, df_merge2, how='inner', on="Student ID")

This will return us the merged DataFrame, where only the common values from "Student ID" will be kept.

Let us carry this out and see for ourselves:

# merge the DataFrames
merged_df = pd.merge(df_merge1, df_merge2, how='inner', on="Student ID")
# display
print(merged_df)

We can see only the "Student ID" values which are common to both the DataFrames (11 to 21) have been selected and kept in the merged DataFrame, just as we expected.

Left Join Example

Let’s also see an example of performing left join. We will take df_merge1 as the left DataFrame and df_merge2 as the right DataFrame:

merged_df = pandas.merge(df_merge1, df_merge2, how='left', on="Student ID")

This will return the merged DataFrame, where all the instances with values for "Student ID" from the left DataFrame will be kept. From the right DataFrame, only the rows which have matching "Student ID" values will be merged.

Let us carry it out and see the resulting DataFrame:

# merge the DataFrames
merged_df = pd.merge(df_merge1, df_merge2, how='left', on="Student ID")
# display
print(merged_df)

Instances from left DataFrame which do not have column values for the columns from the right DataFrame, have been filled in with NaN values.

Concatenate

Unlike merging, concatenation doesn’t look at values along a column to check for common elements. It simply places or glues DataFrames side by side (along rows index) or below one another (along columns).

Let us look at examples to clear this up. (We will be using subsets of the students DataSet which we have encountered earlier).

Set-up

As in the previous section for Merge, let’s start by loading the dataset.

# load first dataset 
load_commonlounge_dataset("students_concat1")
df_concat1 = pd.read_csv("/tmp/students_concat1.csv")
# load second dataset
load_commonlounge_dataset("students_concat2")
df_concat2 = pd.read_csv("/tmp/students_concat2.csv")
# load second dataset
load_commonlounge_dataset("students_concat3")
df_concat3 = pd.read_csv("/tmp/students_concat3.csv")
# load second dataset
load_commonlounge_dataset("students_concat4")
df_concat4 = pd.read_csv("/tmp/students_concat4.csv")
# display first dataset
print("df_concat1 DataFrame")
print(df_concat1)
print('')
# display second dataset
print("df_concat2 DataFrame")
print(df_concat2)
print('')
# display third dataset
print("df_concat3 DataFrame")
print(df_concat3)
print('')
# display fourth dataset
print("df_concat4 DataFrame")
print(df_concat4)

These datasets contain data of different students from the same college. The Student ID is unique for every individual student. The features have already been explained in the previous section on Merge.

The Datasets either contain data about the same set of students about different features, or data of different students about the same features. I will explain them as we move through the tutorial.

`concat()` function

The concat() function is used for concatenating DataFrames.

The syntax is as follows:

pandas.concat(objs, axis=0, join="outer", ignore_index=False)

objs parameter contains the DataFrames that need to be concatenated. We need to pass the DataFrames as a list of DataFrames.
axis parameter controls whether the concatenation happens along columns or rows.
“0” means along the row index
“1” means along the columns.
join parameter controls whether the joining is “inner” or “outer”. We will discuss this in the next few sections. By default it is “outer”.
If ignore_index is set to “False”, the concatenated DataFrame retains the index from the original DataFrames. If it is set to “True”, then it labels the concatenating axis with 0 to n-1. We will see this in the next sections also.

Concatenating along the columns

Let us look at the DataFrames df_concat1 and df_concat2.

df_concat1 contains information about different aspects such as rank, marks and university.
df_concat2 contains the subject specific marks for the same set of students.

Now since the marks are of the same students, we would like to simply attach the df_concat2 DataFrame side-by-side or column-wise with the df_concat1 DataFrame.

We will do so by passing the argument “1” to axis parameter. The other parameters can have the default value for now.

Let us concatenate the two DataFrames and see:

combined = pd.concat([df_concat1, df_concat2], axis=1)
# display concatenated DataFrame
print(combined)

So we see an example of concatenation along the columns. The two DataFrames are simply glued side by side and all the columns labels are concatenated.

Concatenating along the rows (index)

Now, let’s consider the case where two DataFrames have the same features for two different sets of students.

In our case, both df_concat1 and df_concat3 contain information about Entrance Rank, Marks obtained and Board. Moreover, if you notice the student IDs in df_concat1 and df_concat3, there is no overlap.

In such a scenario, we would like to attach the second DataFrame below the first DataFrame. This would give us a combined DataFrame with all the students.

We will do so by passing “0” to the axis parameter. Let the other parameters remain with their default value for now:

combined = pd.concat([df_concat1, df_concat3], axis=0)
# display concatenated DataFrame
print(combined)

It is exactly as we wanted.

There is only one slight problem. If you notice the row index, they have been retained from the original DataFrames. We do not want that. We want the new DataFrame to have index from 0 to n-1.

We can do that by setting the ignore_index parameter to “True”. Let us do that and see.

# concatenated with ignore_index=True
combined = pd.concat([df_concat1, df_concat3], axis=0, ignore_index=True)
# display concatenated DataFrame
print(combined)

Now our problem has been solved!

The `join` parameter

Now let us deal with a slightly more complicated case. The df_concat4 dataset is for a different set of students, and it also doesn’t have all the same features as df_concat1. It has some common features.

In such a scenario, we can concatenate in two different ways:

Combine the DataFrames one below the other by keeping only the columns which are common to both the DataFrames. This is achieved by passing the argument “inner” to the join parameter.
Combine DataFrames while keeping all the columns than appear in either of the DataFrames. Columns outside the intersection will be filled with NaN values. This is achieved by passing the argument “outer” to the join parameter

Let us see both the examples.

Inner Join Concat

If we pass “inner” argument to the join parameter, and concatenate with axis=0, only common columns will be kept.

Let us see how it looks.

combined = pd.concat([df_concat1, df_concat4], axis=0, join="inner", ignore_index=True)
# display concatenated DataFrames
print(combined)

In this case, Board and Physics columns will not be included from first DataFrame, and Maths column will not be included from the second DataFrame, as these columns only appear in either one DataFrame.

Outer Join Concat

Outer join is also the default behaviour of the concat() function. Let us see how it works:

combined = pd.concat([df_concat1, df_concat4], axis=0, join="outer", ignore_index=True)
# display concatenated DataFrames
print(combined)

We can see that all the columns has been retained and NaN has been used to fill up columns which are not common.

Summary

Merging matches rows between two DataFrames based on specified columns. In this tutorial, we performed merging based on Student ID.
We studied four different kinds of merge:
Inner Join: Only rows common to both the DataFrames are preserved. Also known as taking the intersection of the two DataFrames.
Outer Join: Rows appearing in either DataFrame are preserved. Also known as taking the union of the two DataFrames.
Left Join: All rows from the left DataFrame, and data from matching rows from the right DataFrame are preserved.
Right Join: All rows from the right DataFrame, and data from matching rows from the left DataFrame are preserved.
Concatenation glues DataFrames side by side (along rows) or below one another (along columns).
We studied two different kinds of concatenation:
Inner Join: Only rows / columns common to both the DataFrames are preserved.
Outer Join: Rows / columns appearing in either DataFrame are preserved.

Reference

Merge:

merged_df = pd.merge(df_merge1, df_merge2, how=join_type, on=column_names)
merged_df = pd.merge(df_merge1, df_merge2, how=join_type, left_on=column_names, right_on=column_names)
# join_type can be 'inner', 'outer', 'left' or 'right'

Concat:

combined_df = pandas.concat(list_of_dataframes, axis=0, join=join_type, ignore_index=False)
# join_type can be 'inner' or 'outer'