Pandas Concat Two or More DataFrames

How Does Pandas Concat Work?

The pandas.concat( ) function combines the data from multiple Series and/or DataFrames fast and in an intuitive manner. It is one of the most basic data wrangling operations used in Pandas. In general, we draw some conclusions from the data by analyzing it. The confidence in our conclusions increases as we include more variables or meta-data about our data. This is achieved by combining data from a variety of different data sources. The basic Pandas objects, Series, and DataFrames are created by keeping these relational operations in mind. For example, pd.concat([df1, df2]) concatenates two DataFrames df1, df2 together horizontally and results in a new DataFrame.

Pandas Concat Two or More DataFrames

The most important and widely used use-case of Pandas concat – pd.concat( ) is to concatenate DataFrames.

For example, when you’re buying a new smartphone, often you might like to compare the specifications and price of the phones. This makes you take an informed decision. Such a comparison can be viewed below as an example from the amazon website for recent OnePlus phones.

In the above image, the data about four different smartphones are concatenated with their features as an index.

Let us construct two DataFrames and combine them to see how it works.

>>> import pandas as pd

>>> df1 = pd.DataFrame(

... {"Key": ["A", "B", "A", "C"], "C1":[1, 2, 3, 4], "C2": [10, 20, 30, 40]})

>>> df1.index = ["L1", "L2", "L3", "L4"]

>>> print(df1) Key C1 C2

L1 A 1 10

L2 B 2 20

L3 A 3 30

L4 C 4 40

>>> df2 = pd.DataFrame(

... {"Key": ["A", "B", "C", "D"], "C3": [100, 200, 300, 400]})

>>> df2.index = ["R1", "R2", "R3", "R4"]

>>> print(df2) Key C3

R1 A 100

R2 B 200

R3 C 300

R4 D 400

From the official Pandas documentation of Pandas concat;

The two major arguments used in pandas.concat( ) from the above image are,

objs – A sequence of Series and/or DataFrame objects
axis – Axis along which objs are concatenated

Out of the two arguments, objs remains constant. But, based on the value of the axis, the concatenation operation differs. Possible values of the axis are,

axis = 0 – Concatenate or stack the DataFrames down the rows
axis = 1 – Concatenate or stack the DataFrames along the columns

Remember this axis argument functionality, because it comes in many other Pandas functions. Let us see them in action using the above created Dataframes.

1. Row-Wise Concatenation (axis = 0 / ’index’)

>>> df3 = pd.concat([df1, df2], axis=0)

>>> print(df3) Key C1 C2 C3

L1 A 1.0 10.0 NaN

L2 B 2.0 20.0 NaN

L3 A 3.0 30.0 NaN

L4 C 4.0 40.0 NaN

R1 A NaN NaN 100.0

R2 B NaN NaN 200.0

R3 C NaN NaN 300.0

R4 D NaN NaN 400.0

>>> df3_dash = pd.concat([df1, df2])

>>> print(df3_dash) Key C1 C2 C3

L1 A 1.0 10.0 NaN

L2 B 2.0 20.0 NaN

L3 A 3.0 30.0 NaN

L4 C 4.0 40.0 NaN

R1 A NaN NaN 100.0

R2 B NaN NaN 200.0

R3 C NaN NaN 300.0

R4 D NaN NaN 400.0

>>> print(len(df3) == len(df1) + len(df2))

True

Any number of DataFrames can be given in the first argument which has a list of DataFrames like [df1, df2, df3, ..., dfn].

Some observations from the above results:

Note the outputs of df3 and df3_dash are the same. So, we need not explicitly mention the axis when we want to concatenate down the rows.
The number of rows in the output DataFrame = Total number of rows in all the input DataFrames.
The columns of the output DataFrame = Combination of distinct columns of all the input DataFrames.
There are unique columns present in the input DataFrames. The corresponding values at the row labels of different input DataFrames are filled with NaNs (Not a Number – missing values) in the output DataFrame.

Let’s visualize the above process in the following animation:

2. Column-Wise Concatenation (axis = 1 / ’columns’)

>>> df3 = pd.concat([df1, df2], axis=1)

>>> print(df3) Key C1 C2 Key C3

L1 A 1.0 10.0 NaN NaN

L2 B 2.0 20.0 NaN NaN

L3 A 3.0 30.0 NaN NaN

L4 C 4.0 40.0 NaN NaN

R1 NaN NaN NaN A 100.0

R2 NaN NaN NaN B 200.0

R3 NaN NaN NaN C 300.0

R4 NaN NaN NaN D 400.0

>>> print("The unique row indexes of df1 and df2:", '\n\t', df1.index.append(df2.index).unique())

The unique row indexes of df1 and df2: Index(['L1', 'L2', 'L3', 'L4', 'R1', 'R2', 'R3', 'R4'], dtype='object')

>>> print("The row indexes of df3:", "\n\t", df3.index)

The row indexes of df3: Index(['L1', 'L2', 'L3', 'L4', 'R1', 'R2', 'R3', 'R4'], dtype='object')

>>> print("The column indexes of df1 and df2:", "\n\t", df1.columns.append(df2.columns))

The column indexes of df1 and df2: Index(['Key', 'C1', 'C2', 'Key', 'C3'], dtype='object')

>>> print("The column indexes of df3:", "\n\t", df3.columns)

The column indexes of df3: Index(['Key', 'C1', 'C2', 'Key', 'C3'], dtype='object')

Some observations from the above results:

The DataFrames are concatenated side by side.
The columns in the output DataFrame = Total columns in all the input DataFrames.
Rows in the output DataFrame = Unique rows in all the input DataFrames.
There are unique rows present in all the input DataFrames. The corresponding values at the column labels of different input DataFrames are filled with NaNs (Not a Number – missing values) in the output DataFrame.

Let’s visualize the above process in the following animation:

Pandas Concat Columns

Please take a look at the initial OnePlus phones comparison table from the amazon website. A column in that table constitutes all the specifications of a given smartphone. Such all equivalent specifications (row labels) of all varieties (phones – column labels) are concatenated as columns to form the final comparison table.

So, to concatenate columns, we should have the same row indexes. In Pandas, the Series data structure is exactly designed to represent the columns and their combination forms the DataFrame data structure.

Let us construct two Series and concatenate them as columns to form a resultant DataFrame.

>>> ser1 = pd.Series([10, 20, 30, 40], name='C1')

>>> ser2 = pd.Series([100, 200, 300, 400], name='C2')

>>> print("Series 1:", "\n", ser1, "\n\n", "Series 2:", "\n", ser2)

Series 1:

0 10

1 20

2 30

3 40

Name: C1, dtype: int64 Series 2:

0 100

1 200

2 300

3 400

Name: C2, dtype: int64

>>> df = pd.concat([ser1, ser2], axis=1)

>>> print("DataFrame:", "\n", df)

DataFrame: C1 C2

0 10 100

1 20 200

2 30 300

3 40 400

Pandas Concat MultiIndex

Let us consider a use-case where we have hourly weather data for 4 hours about two cities. The data that we have are only the temperature (degC) and wind speed (kmph). One way of storing their data is to store them in different DataFrames per city. It can be done the following way,

>>> Date_Hourly = pd.date_range(start = '2020-11-20', periods = 4, freq = 'H')

>>> df_city1 = pd.DataFrame(

... {"temp(degC)": [27, 24, 22, 20],

... "windspeed(kmph)": [18, 17, 17, 18]},

... index = Date_Hourly

... )

>>> df_city2 = pd.DataFrame(

... {"temp(degC)": [30, 33, 33, 34],

... "windspeed(kmph)": [23, 25, 27, 30]},

... index = Date_Hourly

... )

>>> print("Weather Data of City 1:", "\n", df_city1)

Weather Data of City 1: temp(degC) windspeed(kmph)

2020-11-20 00:00:00 27 18

2020-11-20 01:00:00 24 17

2020-11-20 02:00:00 22 17

2020-11-20 03:00:00 20 18

>>> print("Weather Data of City 2:", "\n", df_city2)

Weather Data of City 2: temp(degC) windspeed(kmph)

2020-11-20 00:00:00 30 23

2020-11-20 01:00:00 33 25

2020-11-20 02:00:00 33 27

2020-11-20 03:00:00 34 30

Now, we might want to collect data of two cities into one DataFrame for easier analysis. MultiIndex keys serve as identifiers to specify the source of the data. This can be achieved by MultiIndex concatenation.

Multi-Index Concatenation is done in two ways;

1. Row-Wise Concatenation (axis = 0 / ’index’)

>>> df_concat_rowwise = pd.concat([df_city1, df_city2], axis=0, keys=['City1', 'City2'])

>>> print("Row-Wise Multi-Index Concatenation:", "\n", df_concat_rowwise)

Row-Wise Multi-Index Concatenation: temp(degC) windspeed(kmph)

City1 2020-11-20 00:00:00 27 18 2020-11-20 01:00:00 24 17 2020-11-20 02:00:00 22 17 2020-11-20 03:00:00 20 18

City2 2020-11-20 00:00:00 30 23 2020-11-20 01:00:00 33 25 2020-11-20 02:00:00 33 27 2020-11-20 03:00:00 34 30

2. Column-Wise Concatenation (axis = 1 / ’columns’)

>>> df_concat_rowwise = pd.concat([df_city1, df_city2], axis=1, keys=['City1', 'City2']) >>> print("Column-Wise Multi-Index Concatenation:", "\n", df_concat_colwise)

Column-Wise Multi-Index Concatenation: City1 City2 temp(degC) windspeed(kmph) temp(degC) windspeed(kmph)

2020-11-20 00:00:00 27 18 30 23

2020-11-20 01:00:00 24 17 33 25

2020-11-20 02:00:00 22 17 33 27

2020-11-20 03:00:00 20 18 34 30

The same can be achieved for many cities. After concatenation, all of the data is in one single DataFrame. This makes us analyze the weather efficiently instead of fetching data from multiple sources.

Pandas concat vs append

Concatenation along the rows (axis = 0) is very common. If you observe the weather data scenario after each hour data gets appended in the next row. So, for that purpose, a method called append( ) is built on top of DataFrame to append another DataFrame row-wise. This makes you achieve the same results as pd.concat( ) with few keystrokes.

It can be implemented as follows,

>>> df1 = pd.DataFrame({'C1': ['A', 'B', 'C', 'D']})

>>> df2 = pd.DataFrame({'C1': ['E', 'F', 'G', 'H']})

>>> print("DataFrame 1:", "\n", df1)

DataFrame 1: C1

0 A

1 B

2 C

3 D

>>> print("DataFrame 2:", "\n", df2)

DataFrame 2: C1

0 E

1 F

2 G

3 H

>>> pd.concat([df1, df2]) C1

0 A

1 B

2 C

3 D

0 E

1 F

2 G

3 H

>>> df1.append(df2) C1

0 A

1 B

2 C

3 D

0 E

1 F

2 G

3 H

You can observe above the same results for pd.concat([df1, df2]) and df1.append(df2).

Pandas concat slow

Each and every time we do a concatenation operation, it creates a new DataFrame. DataFrame concatenation operates equivalent to an SQL join operation. So, the output DataFrame’s index is formed first by join operation. Resolving all the mismatches between indexes of input DataFrames makes it slow. In some scenarios, indexes might not be of importance. In such cases, we can ignore indexes to make the concat operation faster.

Ignoring the index is done by the following way,

>>> df = pd.DataFrame({'C1': [10, 20, 30, 40]}, index=['R1', 'R2', 'R3', 'R4'])

>>> df C1

R1 10

R2 20

R3 30

R4 40

>>> df.reset_index(drop=True) C1

0 10

1 20

2 30

3 40

Along with concat, all other Pandas functions are executed by utilizing only a single core in the CPU. Operations on smaller datasets run in a seamless manner. As the dataset size increases, the functions of Pandas start to throttle because they do only one operation at once.

Modin is the python package created to speed up the execution of Pandas functions. It distributes the computation load to all the available cores. It does so by fragmenting the DatFrame and making the function run on DataFrame fragments in other cores parallelly. Please look after this article to know about it in detail.

The post How Does Pandas Concat Work? first appeared on Finxter.

https://www.sickgaming.net/blog/2020/11/...ncat-work/

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[Tut] Making $65 per Hour on Upwork with Pandas	xSicKxBot	0	1,435	05-24-2023, 08:16 PM Last Post: xSicKxBot
	[Tut] Pandas Series Object – A Helpful Guide with Examples	xSicKxBot	0	1,440	05-01-2023, 01:30 AM Last Post: xSicKxBot
	[Tut] Pandas Boolean Indexing	xSicKxBot	0	1,410	04-16-2023, 10:54 AM Last Post: xSicKxBot
	[Tut] Python List of Dicts to Pandas DataFrame	xSicKxBot	0	1,687	04-11-2023, 04:15 AM Last Post: xSicKxBot
	[Tut] How to Filter Data from an Excel File in Python with Pandas	xSicKxBot	0	1,330	10-31-2022, 05:36 AM Last Post: xSicKxBot
	[Tut] Solidity Bytes and String Arrays, Concat, Allocating Memory, and Array Literals	xSicKxBot	0	1,301	10-25-2022, 09:13 AM Last Post: xSicKxBot
	[Tut] How to Convert Pandas DataFrame/Series to NumPy Array?	xSicKxBot	0	1,313	10-24-2022, 02:13 PM Last Post: xSicKxBot
	[Tut] How to Apply a Function to Each Cell in a Pandas DataFrame?	xSicKxBot	0	1,161	08-23-2022, 05:25 PM Last Post: xSicKxBot
	[Tut] How to Get the Last N Rows of a Pandas DataFrame?	xSicKxBot	0	1,280	07-12-2022, 02:10 PM Last Post: xSicKxBot
	[Tut] pd.agg() – Aggregating Data in Pandas	xSicKxBot	0	1,226	07-01-2022, 12:41 PM Last Post: xSicKxBot

xSicKxBot

Pandas Concat Two or More DataFrames

1. Row-Wise Concatenation (axis = 0 / ’index’)

2. Column-Wise Concatenation (axis = 1 / ’columns’)

Pandas Concat Columns

Pandas Concat MultiIndex

1. Row-Wise Concatenation (axis = 0 / ’index’)

2. Column-Wise Concatenation (axis = 1 / ’columns’)

Pandas concat vs append

Pandas concat slow