Once we start working with actual datasets, it might be very rare to find all the data in one place in the same file organised neatly. From time to time we will need to extract data from different sources, combine them in suitable form and only then can we proceed to do some work with the data.
In this tutorial we will learn about two operations — merge and concatenate — that can help us with combining DataFrames in different ways.
There are various ways in which DataFrames can be merged, depending on our needs. Let us take an example.
Suppose we have two Datasets: Left-Dataset and Right-Dataset as shown below.
The Left-Dataset contains the subject score for “Physics” and “Chemistry”, while the Right-Dataset contains the subject score for “Maths” for a group of students.
Each student is known by their unique “Student_ID” number.
We can see that students with ID 2,4 and 5 only have score for Physics and Chemistry, while students with ID 10, 11 and 13 have only Maths scores. Students with ID 6, 7 and 8 have all three subject scores.
In order to analyse data of all the students across the subjects, we would merge these two datasets together to form a single dataset with “Physics”, “Chemistry” and “Maths” marks of all the students.
Now, when we decide to merge the DataFrames, we need to figure out what to do about discrepancy that arises due to all the scores not being available for all the students.
If we only need the data of the students who have all the three scores, then we could merge it in a way that only keeps the data of students who are common in both the DataFrames. It would look like below:
But maybe “Maths” was an optional paper for all students, and we are interested in the marks of all the students who have “Physics” and “Chemistry” scores, with the “Maths” scores if available. The resulting DataFrame would look like below:
Note that if the “Maths” scores are not available, it says NaN (stands for Not a Number). We could substitute it with ‘Not Available’, ’Not Opted’, or any other appropriate special value.
What we saw above were examples of two possible ways in which merging can be done, based on how we want to combine the datasets.
- The first merge we performed, where we took values common to both DataFrames, is known as Inner Join.
- The second merge that we performed, where we took all values from the Left-DataFrame and only matching values from the Right-DataFrame, is known as Left Join.
In this section we will learn about four ways in which we can decide to join DataFrames.
Note: The decision to include or exclude rows from the DataFrames are based on the selected columns (one or multiple). In our examples, we based it on the “Student ID” column.
Let us now look at the kind of merges that we can perform:
INNER JOIN: In this kind of merging, we keep the rows which are common to both the DataFrames. This is also known as taking the intersection of the two DataFrames.
OUTER JOIN: We keep all the rows in this type of merging, even if it only appears in one DataFrame. This is also known as taking the union of the two DataFrames.
Whenever some column values are not available for a row, it is substituted with
LEFT JOIN: In this, we keep all the rows from the left DataFrame, and only data from rows with matching values from the right DataFrame are used.
RIGHT JOIN: In this, we keep all the all the rows from the right DataFrame, and only data from rows with matching values from the left DataFrame are used.
As in the previous tutorial, let’s start by importing the Pandas library.
import pandas as pd pd.set_option('display.width', 200) pd.set_option('display.max_columns', 20)
We will be using two small data sets for exploring the different types of merge and joins. The datasets contain information about students from an engineering college. Some rows and columns, will be common across both the datasets.
Explanation of the features that appear across both the DataFrames:
Admn Yr- The year of admission in the college (numerical)
Seat Type- Students are admitted to nine different type of seats broadly classified on the basis of merit and fee structure. (categorical)
Entrance Rank- Rank secured in a common examination to measure academic merit of the student (numerical)
Board- Board under which the student studied in school. (categorical)
Marks Obtained- Total marks secured in the final school exam. (numerical)
Physics- Marks secured in Physics (numerical)
Chemistry- Marks secured in Chemistry (numerical)
Maths- Marks secured in Maths (numerical)
Let us load the Datasets one by one, using the
read_csv() method from Pandas and display the contents.
# load first dataset load_commonlounge_dataset("students_merge1") df_merge1 = pd.read_csv("/tmp/students_merge1.csv") # load second dataset load_commonlounge_dataset("students_merge2") df_merge2 = pd.read_csv("/tmp/students_merge2.csv") # display first dataset print("df_merge1 DataFrame") print(df_merge1) print('') # display second dataset print("df_merge2 DataFrame") print(df_merge2)
Now that we have loaded these datasets, let’s see how we can perform the above types of merge using Pandas.
pandas library provides a function called
merge() to perform merging of DataFrames. The syntax is as follows:
pandas.merge(left, right, how='inner', on=None)
This returns a merged DataFrame. The original DataFrames remain unchanged.
rightparameters denote the left DataFrame and the right DataFrame to be merged. The order of these parameters is important, specially while doing Left Join or Right Join.
howparameter controls which type of join to perform. We can pass one of four different values:
"inner": For inner join. It is also the default value for the parameter, if nothing is passed.
"outer": For outer join
"left": For left join
"right": For right join
onparameter accepts column label or list of column labels. These are the columns based on which merging is performed, i.e. based on which we match rows from one DataFrame to rows in the other DataFrame. These column(s) should be present in both the DataFrames.
- If nothing is passed then all the column labels that are common between the two DataFrames are used to perform the merge operation.
It is also possible that the columns on the left and right DataFrame have different names. In such a scenario we separately pass the names of the columns for each of the two DataFrames based on which merging should be performed. The syntax is as follows:
pandas.merge(left, right, how='inner', left_on=None, right_on=None)
left_on is for the column label(s) for the left DataFrame, and
right_on is for the column label(s) for the right DataFrame.
Let us take a couple of examples see for ourselves how it works.
We will now perform the inner join on the
df_merge2 DataFrames. So the
how parameter will have
"inner" as an argument. (This is also the default value.)
We will perform the merge on
Students ID. We can see that
Student ID is common to both the DataFrames.
merged_df = pandas.merge(df_merge1, df_merge2, how='inner', on="Student ID")
This will return us the merged DataFrame, where only the common values from
"Student ID" will be kept.
Let us carry this out and see for ourselves:
# merge the DataFrames merged_df = pd.merge(df_merge1, df_merge2, how='inner', on="Student ID") # display print(merged_df)
We can see only the
"Student ID" values which are common to both the DataFrames (11 to 21) have been selected and kept in the merged DataFrame, just as we expected.
Let’s also see an example of performing left join. We will take
df_merge1 as the left DataFrame and
df_merge2 as the right DataFrame:
merged_df = pandas.merge(df_merge1, df_merge2, how='left', on="Student ID")
This will return the merged DataFrame, where all the instances with values for
"Student ID" from the left DataFrame will be kept. From the right DataFrame, only the rows which have matching
"Student ID" values will be merged.
Let us carry it out and see the resulting DataFrame:
# merge the DataFrames merged_df = pd.merge(df_merge1, df_merge2, how='left', on="Student ID") # display print(merged_df)
Instances from left DataFrame which do not have column values for the columns from the right DataFrame, have been filled in with
Unlike merging, concatenation doesn’t look at values along a column to check for common elements. It simply places or glues DataFrames side by side (along rows index) or below one another (along columns).
Let us look at examples to clear this up. (We will be using subsets of the students DataSet which we have encountered earlier).
As in the previous section for Merge, let’s start by loading the dataset.
# load first dataset load_commonlounge_dataset("students_concat1") df_concat1 = pd.read_csv("/tmp/students_concat1.csv") # load second dataset load_commonlounge_dataset("students_concat2") df_concat2 = pd.read_csv("/tmp/students_concat2.csv") # load second dataset load_commonlounge_dataset("students_concat3") df_concat3 = pd.read_csv("/tmp/students_concat3.csv") # load second dataset load_commonlounge_dataset("students_concat4") df_concat4 = pd.read_csv("/tmp/students_concat4.csv") # display first dataset print("df_concat1 DataFrame") print(df_concat1) print('') # display second dataset print("df_concat2 DataFrame") print(df_concat2) print('') # display third dataset print("df_concat3 DataFrame") print(df_concat3) print('') # display fourth dataset print("df_concat4 DataFrame") print(df_concat4)
These datasets contain data of different students from the same college. The
Student ID is unique for every individual student. The features have already been explained in the previous section on Merge.
The Datasets either contain data about the same set of students about different features, or data of different students about the same features. I will explain them as we move through the tutorial.
concat() function is used for concatenating DataFrames.
The syntax is as follows:
pandas.concat(objs, axis=0, join="outer", ignore_index=False)
objsparameter contains the DataFrames that need to be concatenated. We need to pass the DataFrames as a list of DataFrames.
axisparameter controls whether the concatenation happens along columns or rows.
- “0” means along the row index
- “1” means along the columns.
joinparameter controls whether the joining is “inner” or “outer”. We will discuss this in the next few sections. By default it is “outer”.
ignore_indexis set to “False”, the concatenated DataFrame retains the index from the original DataFrames. If it is set to “True”, then it labels the concatenating axis with 0 to n-1. We will see this in the next sections also.
Let us look at the DataFrames
df_concat1contains information about different aspects such as rank, marks and university.
df_concat2contains the subject specific marks for the same set of students.
Now since the marks are of the same students, we would like to simply attach the
df_concat2 DataFrame side-by-side or column-wise with the
We will do so by passing the argument “1” to
axis parameter. The other parameters can have the default value for now.
Let us concatenate the two DataFrames and see:
combined = pd.concat([df_concat1, df_concat2], axis=1) # display concatenated DataFrame print(combined)
So we see an example of concatenation along the columns. The two DataFrames are simply glued side by side and all the columns labels are concatenated.
Now, let’s consider the case where two DataFrames have the same features for two different sets of students.
In our case, both
df_concat3 contain information about
Marks obtained and
Board. Moreover, if you notice the
student IDs in
df_concat3, there is no overlap.
In such a scenario, we would like to attach the second DataFrame below the first DataFrame. This would give us a combined DataFrame with all the students.
We will do so by passing “0” to the
axis parameter. Let the other parameters remain with their default value for now:
combined = pd.concat([df_concat1, df_concat3], axis=0) # display concatenated DataFrame print(combined)
It is exactly as we wanted.
There is only one slight problem. If you notice the row index, they have been retained from the original DataFrames. We do not want that. We want the new DataFrame to have index from 0 to n-1.
We can do that by setting the
ignore_index parameter to “True”. Let us do that and see.
# concatenated with ignore_index=True combined = pd.concat([df_concat1, df_concat3], axis=0, ignore_index=True) # display concatenated DataFrame print(combined)
Now our problem has been solved!
Now let us deal with a slightly more complicated case. The
df_concat4 dataset is for a different set of students, and it also doesn’t have all the same features as
df_concat1. It has some common features.
In such a scenario, we can concatenate in two different ways:
- Combine the DataFrames one below the other by keeping only the columns which are common to both the DataFrames. This is achieved by passing the argument “inner” to the
- Combine DataFrames while keeping all the columns than appear in either of the DataFrames. Columns outside the intersection will be filled with
NaNvalues. This is achieved by passing the argument “outer” to the
Let us see both the examples.
If we pass “inner” argument to the
join parameter, and concatenate with
axis=0, only common columns will be kept.
Let us see how it looks.
combined = pd.concat([df_concat1, df_concat4], axis=0, join="inner", ignore_index=True) # display concatenated DataFrames print(combined)
In this case,
Physics columns will not be included from first DataFrame, and
Maths column will not be included from the second DataFrame, as these columns only appear in either one DataFrame.
Outer join is also the default behaviour of the
concat() function. Let us see how it works:
combined = pd.concat([df_concat1, df_concat4], axis=0, join="outer", ignore_index=True) # display concatenated DataFrames print(combined)
We can see that all the columns has been retained and
NaN has been used to fill up columns which are not common.
- Merging matches rows between two DataFrames based on specified columns. In this tutorial, we performed merging based on
- We studied four different kinds of merge:
- Inner Join: Only rows common to both the DataFrames are preserved. Also known as taking the intersection of the two DataFrames.
- Outer Join: Rows appearing in either DataFrame are preserved. Also known as taking the union of the two DataFrames.
- Left Join: All rows from the left DataFrame, and data from matching rows from the right DataFrame are preserved.
- Right Join: All rows from the right DataFrame, and data from matching rows from the left DataFrame are preserved.
- Concatenation glues DataFrames side by side (along rows) or below one another (along columns).
- We studied two different kinds of concatenation:
- Inner Join: Only rows / columns common to both the DataFrames are preserved.
- Outer Join: Rows / columns appearing in either DataFrame are preserved.
merged_df = pd.merge(df_merge1, df_merge2, how=join_type, on=column_names) merged_df = pd.merge(df_merge1, df_merge2, how=join_type, left_on=column_names, right_on=column_names) # join_type can be 'inner', 'outer', 'left' or 'right'
combined_df = pandas.concat(list_of_dataframes, axis=0, join=join_type, ignore_index=False) # join_type can be 'inner' or 'outer'