Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] How Does Pandas Concat Work?

#1
How Does Pandas Concat Work?

The pandas.concat( ) function combines the data from multiple Series and/or DataFrames fast and in an intuitive manner. It is one of the most basic data wrangling operations used in Pandas. In general, we draw some conclusions from the data by analyzing it. The confidence in our conclusions increases as we include more variables or meta-data about our data. This is achieved by combining data from a variety of different data sources. The basic Pandas objects, Series, and DataFrames are created by keeping these relational operations in mind. For example, pd.concat([df1, df2]) concatenates two DataFrames df1, df2 together horizontally and results in a new DataFrame.

Pandas Concat Two or More DataFrames


The most important and widely used use-case of Pandas concat – pd.concat( ) is to concatenate DataFrames.

For example, when you’re buying a new smartphone, often you might like to compare the specifications and price of the phones. This makes you take an informed decision. Such a comparison can be viewed below as an example from the amazon website for recent OnePlus phones.


In the above image, the data about four different smartphones are concatenated with their features as an index.

Let us construct two DataFrames and combine them to see how it works.

>>> import pandas as pd
>>> df1 = pd.DataFrame(
... {"Key": ["A", "B", "A", "C"], "C1":[1, 2, 3, 4], "C2": [10, 20, 30, 40]})
>>> df1.index = ["L1", "L2", "L3", "L4"]
>>> print(df1) Key C1 C2
L1 A 1 10
L2 B 2 20
L3 A 3 30
L4 C 4 40
>>> df2 = pd.DataFrame(
... {"Key": ["A", "B", "C", "D"], "C3": [100, 200, 300, 400]})
>>> df2.index = ["R1", "R2", "R3", "R4"]
>>> print(df2) Key C3
R1 A 100
R2 B 200
R3 C 300
R4 D 400

From the official Pandas documentation of Pandas concat;


The two major arguments used in pandas.concat( ) from the above image are,

  • objs – A sequence of Series and/or DataFrame objects
  • axis – Axis along which objs are concatenated

Out of the two arguments, objs remains constant. But, based on the value of the axis, the concatenation operation differs. Possible values of the axis are,

  • axis = 0 – Concatenate or stack the DataFrames down the rows
  • axis = 1 – Concatenate or stack the DataFrames along the columns

Remember this axis argument functionality, because it comes in many other Pandas functions. Let us see them in action using the above created Dataframes.

1. Row-Wise Concatenation (axis = 0 / ’index’)


>>> df3 = pd.concat([df1, df2], axis=0)
>>> print(df3) Key C1 C2 C3
L1 A 1.0 10.0 NaN
L2 B 2.0 20.0 NaN
L3 A 3.0 30.0 NaN
L4 C 4.0 40.0 NaN
R1 A NaN NaN 100.0
R2 B NaN NaN 200.0
R3 C NaN NaN 300.0
R4 D NaN NaN 400.0
>>> df3_dash = pd.concat([df1, df2])
>>> print(df3_dash) Key C1 C2 C3
L1 A 1.0 10.0 NaN
L2 B 2.0 20.0 NaN
L3 A 3.0 30.0 NaN
L4 C 4.0 40.0 NaN
R1 A NaN NaN 100.0
R2 B NaN NaN 200.0
R3 C NaN NaN 300.0
R4 D NaN NaN 400.0
>>> print(len(df3) == len(df1) + len(df2))
True

Any number of DataFrames can be given in the first argument which has a list of DataFrames like [df1, df2, df3, ..., dfn].

Some observations from the above results:

  • Note the outputs of df3 and df3_dash are the same. So, we need not explicitly mention the axis when we want to concatenate down the rows.
  • The number of rows in the output DataFrame = Total number of rows in all the input DataFrames.
  • The columns of the output DataFrame = Combination of distinct columns of all the input DataFrames.
  • There are unique columns present in the input DataFrames. The corresponding values at the row labels of different input DataFrames are filled with NaNs (Not a Number – missing values) in the output DataFrame.

Let’s visualize the above process in the following animation:


2. Column-Wise Concatenation (axis = 1 / ’columns’)


>>> df3 = pd.concat([df1, df2], axis=1)
>>> print(df3) Key C1 C2 Key C3
L1 A 1.0 10.0 NaN NaN
L2 B 2.0 20.0 NaN NaN
L3 A 3.0 30.0 NaN NaN
L4 C 4.0 40.0 NaN NaN
R1 NaN NaN NaN A 100.0
R2 NaN NaN NaN B 200.0
R3 NaN NaN NaN C 300.0
R4 NaN NaN NaN D 400.0
>>> print("The unique row indexes of df1 and df2:", '\n\t', df1.index.append(df2.index).unique())
The unique row indexes of df1 and df2: Index(['L1', 'L2', 'L3', 'L4', 'R1', 'R2', 'R3', 'R4'], dtype='object')
>>> print("The row indexes of df3:", "\n\t", df3.index)
The row indexes of df3: Index(['L1', 'L2', 'L3', 'L4', 'R1', 'R2', 'R3', 'R4'], dtype='object')
>>> print("The column indexes of df1 and df2:", "\n\t", df1.columns.append(df2.columns))
The column indexes of df1 and df2: Index(['Key', 'C1', 'C2', 'Key', 'C3'], dtype='object')
>>> print("The column indexes of df3:", "\n\t", df3.columns)
The column indexes of df3: Index(['Key', 'C1', 'C2', 'Key', 'C3'], dtype='object')

Some observations from the above results:

  • The DataFrames are concatenated side by side.
  • The columns in the output DataFrame = Total columns in all the input DataFrames.
  • Rows in the output DataFrame = Unique rows in all the input DataFrames.
  • There are unique rows present in all the input DataFrames. The corresponding values at the column labels of different input DataFrames are filled with NaNs (Not a Number – missing values) in the output DataFrame.

Let’s visualize the above process in the following animation:


Pandas Concat Columns


Please take a look at the initial OnePlus phones comparison table from the amazon website. A column in that table constitutes all the specifications of a given smartphone. Such all equivalent specifications (row labels) of all varieties (phones – column labels) are concatenated as columns to form the final comparison table.

So, to concatenate columns, we should have the same row indexes. In Pandas, the Series data structure is exactly designed to represent the columns and their combination forms the DataFrame data structure.

Let us construct two Series and concatenate them as columns to form a resultant DataFrame.

>>> ser1 = pd.Series([10, 20, 30, 40], name='C1')
>>> ser2 = pd.Series([100, 200, 300, 400], name='C2')
>>> print("Series 1:", "\n", ser1, "\n\n", "Series 2:", "\n", ser2)
Series 1:
0 10
1 20
2 30
3 40
Name: C1, dtype: int64 Series 2:
0 100
1 200
2 300
3 400
Name: C2, dtype: int64
>>> df = pd.concat([ser1, ser2], axis=1)
>>> print("DataFrame:", "\n", df)
DataFrame: C1 C2
0 10 100
1 20 200
2 30 300
3 40 400

Pandas Concat MultiIndex


Let us consider a use-case where we have hourly weather data for 4 hours about two cities. The data that we have are only the temperature (degC) and wind speed (kmph). One way of storing their data is to store them in different DataFrames per city. It can be done the following way,

>>> Date_Hourly = pd.date_range(start = '2020-11-20', periods = 4, freq = 'H')
>>> df_city1 = pd.DataFrame(
... {"temp(degC)": [27, 24, 22, 20],
... "windspeed(kmph)": [18, 17, 17, 18]},
... index = Date_Hourly
... )
>>> df_city2 = pd.DataFrame(
... {"temp(degC)": [30, 33, 33, 34],
... "windspeed(kmph)": [23, 25, 27, 30]},
... index = Date_Hourly
... )
>>> print("Weather Data of City 1:", "\n", df_city1)
Weather Data of City 1: temp(degC) windspeed(kmph)
2020-11-20 00:00:00 27 18
2020-11-20 01:00:00 24 17
2020-11-20 02:00:00 22 17
2020-11-20 03:00:00 20 18
>>> print("Weather Data of City 2:", "\n", df_city2)
Weather Data of City 2: temp(degC) windspeed(kmph)
2020-11-20 00:00:00 30 23
2020-11-20 01:00:00 33 25
2020-11-20 02:00:00 33 27
2020-11-20 03:00:00 34 30

Now, we might want to collect data of two cities into one DataFrame for easier analysis. MultiIndex keys serve as identifiers to specify the source of the data. This can be achieved by MultiIndex concatenation.

Multi-Index Concatenation is done in two ways;

1. Row-Wise Concatenation (axis = 0 / ’index’)


>>> df_concat_rowwise = pd.concat([df_city1, df_city2], axis=0, keys=['City1', 'City2'])
>>> print("Row-Wise Multi-Index Concatenation:", "\n", df_concat_rowwise)
Row-Wise Multi-Index Concatenation: temp(degC) windspeed(kmph)
City1 2020-11-20 00:00:00 27 18 2020-11-20 01:00:00 24 17 2020-11-20 02:00:00 22 17 2020-11-20 03:00:00 20 18
City2 2020-11-20 00:00:00 30 23 2020-11-20 01:00:00 33 25 2020-11-20 02:00:00 33 27 2020-11-20 03:00:00 34 30

2. Column-Wise Concatenation (axis = 1 / ’columns’)


>>> df_concat_rowwise = pd.concat([df_city1, df_city2], axis=1, keys=['City1', 'City2']) >>> print("Column-Wise Multi-Index Concatenation:", "\n", df_concat_colwise)
Column-Wise Multi-Index Concatenation: City1 City2 temp(degC) windspeed(kmph) temp(degC) windspeed(kmph)
2020-11-20 00:00:00 27 18 30 23
2020-11-20 01:00:00 24 17 33 25
2020-11-20 02:00:00 22 17 33 27
2020-11-20 03:00:00 20 18 34 30

The same can be achieved for many cities. After concatenation, all of the data is in one single DataFrame. This makes us analyze the weather efficiently instead of fetching data from multiple sources.

Pandas concat vs append


Concatenation along the rows (axis = 0) is very common. If you observe the weather data scenario after each hour data gets appended in the next row. So, for that purpose, a method called append( ) is built on top of DataFrame to append another DataFrame row-wise. This makes you achieve the same results as pd.concat( ) with few keystrokes.

It can be implemented as follows,

>>> df1 = pd.DataFrame({'C1': ['A', 'B', 'C', 'D']})
>>> df2 = pd.DataFrame({'C1': ['E', 'F', 'G', 'H']})
>>> print("DataFrame 1:", "\n", df1)
DataFrame 1: C1
0 A
1 B
2 C
3 D
>>> print("DataFrame 2:", "\n", df2)
DataFrame 2: C1
0 E
1 F
2 G
3 H
>>> pd.concat([df1, df2]) C1
0 A
1 B
2 C
3 D
0 E
1 F
2 G
3 H
>>> df1.append(df2) C1
0 A
1 B
2 C
3 D
0 E
1 F
2 G
3 H

You can observe above the same results for pd.concat([df1, df2]) and df1.append(df2).

Pandas concat slow


Each and every time we do a concatenation operation, it creates a new DataFrame. DataFrame concatenation operates equivalent to an SQL join operation. So, the output DataFrame’s index is formed first by join operation. Resolving all the mismatches between indexes of input DataFrames makes it slow. In some scenarios, indexes might not be of importance. In such cases, we can ignore indexes to make the concat operation faster.

Ignoring the index is done by the following way,

>>> df = pd.DataFrame({'C1': [10, 20, 30, 40]}, index=['R1', 'R2', 'R3', 'R4'])
>>> df C1
R1 10
R2 20
R3 30
R4 40
>>> df.reset_index(drop=True) C1
0 10
1 20
2 30
3 40

Along with concat, all other Pandas functions are executed by utilizing only a single core in the CPU. Operations on smaller datasets run in a seamless manner. As the dataset size increases, the functions of Pandas start to throttle because they do only one operation at once.

Modin is the python package created to speed up the execution of Pandas functions. It distributes the computation load to all the available cores. It does so by fragmenting the DatFrame and making the function run on DataFrame fragments in other cores parallelly. Please look after this article to know about it in detail.

The post How Does Pandas Concat Work? first appeared on Finxter.



https://www.sickgaming.net/blog/2020/11/...ncat-work/
Reply



Possibly Related Threads…
Thread Author Replies Views Last Post
  [Tut] Making $65 per Hour on Upwork with Pandas xSicKxBot 0 1,323 05-24-2023, 08:16 PM
Last Post: xSicKxBot
  [Tut] Pandas Series Object – A Helpful Guide with Examples xSicKxBot 0 1,313 05-01-2023, 01:30 AM
Last Post: xSicKxBot
  [Tut] Pandas Boolean Indexing xSicKxBot 0 1,311 04-16-2023, 10:54 AM
Last Post: xSicKxBot
  [Tut] Python List of Dicts to Pandas DataFrame xSicKxBot 0 1,534 04-11-2023, 04:15 AM
Last Post: xSicKxBot
  [Tut] How to Filter Data from an Excel File in Python with Pandas xSicKxBot 0 1,227 10-31-2022, 05:36 AM
Last Post: xSicKxBot
  [Tut] Solidity Bytes and String Arrays, Concat, Allocating Memory, and Array Literals xSicKxBot 0 1,204 10-25-2022, 09:13 AM
Last Post: xSicKxBot
  [Tut] How to Convert Pandas DataFrame/Series to NumPy Array? xSicKxBot 0 1,219 10-24-2022, 02:13 PM
Last Post: xSicKxBot
  [Tut] How to Apply a Function to Each Cell in a Pandas DataFrame? xSicKxBot 0 1,076 08-23-2022, 05:25 PM
Last Post: xSicKxBot
  [Tut] How to Get the Last N Rows of a Pandas DataFrame? xSicKxBot 0 1,175 07-12-2022, 02:10 PM
Last Post: xSicKxBot
  [Tut] pd.agg() – Aggregating Data in Pandas xSicKxBot 0 1,132 07-01-2022, 12:41 PM
Last Post: xSicKxBot

Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016