Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] How to Filter Data from an Excel File in Python with Pandas

#1
How to Filter Data from an Excel File in Python with Pandas

5/5 – (1 vote)

Problem Formulation and Solution Overview


This article will show different ways to read and filter an Excel file in Python.

To make it more interesting, we have the following scenario:

Sven is a Senior Coder at K-Paddles. K-Paddles manufactures Kayak Paddles made of Kevlar for the White Water Rafting Community. Sven has been asked to read an Excel file and run reports. This Excel file contains two (2) worksheets, Employees and Sales.


To follow along, download the kp_data.xlsx file and place it into the current working directory.


? Question: How would we write code to filter an Excel file in Python?

We can accomplish this task by one of the following options:


Method 1: Use read_excel() and the & operator


This method uses the read_excel() function to read an XLSX file into a DataFrame and an expression to filter the results.

This example imports the above-noted Excel file into a DataFrame. The Employees worksheet is accessed, and the following filter is applied:

? Give me the DataFrame rows for all employees who work in the Sales Department, and earn more than $55,000/annum.

Let’s convert this to Python code.

import pandas as pd cols = ['First', 'Last', 'Dept', 'Salary']
df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols)
df_salary = df_emps[(df_emps['Dept'] == 'Sales') & (df_emps['Salary'] > 55000)] df_salary.to_excel('sales_55.xlsx', sheet_name='Sales Salaries Greater Than 55K')

The first line in the above code snippet imports the Pandas library. This allows access to and manipulation of the XLSX file. Just so you know, the openpyxl library must be installed before continuing.

The following line defines the four (4) columns to retrieve from the XLSX file and saves them to the variable cols as a List.

?Note: Open the Excel file and review the data to follow along.

Import the Excel File to Python


On the next line in the code snippet, read_excel() is called and passed three (3) arguments:

  • The name of the Excel file to import (kp_data.xlsx).
  • The worksheet name. The first worksheet in the Excel file is always read unless stated otherwise. For this example, our Excel file contains two (2) worksheets: Employees and Sales. The Employees worksheet can be referenced using sheet_name=0 or sheet_name='Employees'. Both produce the same result.
  • The columns to retrieve from the Excel workheet (usecols=cols).

The results save to df_emps.

Filter the DataFrame


The highlighted line applies a filter that references the DataFrame columns to base the filter on and the & operator to allow for more than one (1) filter criteria.

In Python, this filter:

df_salary = df_emps[(df_emps['Dept'] == 'Sales') & (df_emps['Salary'] > 55000)]

Equates to this:

? Give me the DataFrame rows for all employees who work in the Sales Department, and earn more than $55,000/annum.

These results save to sales_55.xlsx with a worksheet ‘Sales Salaries Greater Than 55K‘ and placed into the current working directory.

Contents of Filtered Excel File



YouTube Video


Method 2: Use read_excel() and loc[]


This method uses the read_excel() function to read an XLSX file into a DataFrame and loc[] to filter the results. The loc[] function can access either a group of rows or columns based on their label names.

This example imports the above-noted Excel file into a DataFrame. The Employees worksheet is accessed, and the following filter is applied:

? Give me the DataFrame rows for all employees who work in the IT Department, and live in the United States.

Let’s convert this to Python code.

import pandas as pd
from openpyxl import load_workbook cols = ['First', 'Last', 'Dept', 'Country']
df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols)
df_it = df_emps.loc[(df_emps.Dept == 'IT') & (df_emps.Country == 'United States')] book = load_workbook('kp_data.xlsx')
writer = pd.ExcelWriter('kp_data.xlsx', engine='openpyxl')
writer.book = book
df_it.to_excel(writer, sheet_name = 'IT - US')
writer.save()
writer.close()

The first line in the above code snippet imports the Pandas library. This allows access to and manipulation of the XLSX file. Just so you know, the openpyxl library must be installed before continuing.

The following line imports openpyxl. This is required, in this case, to save the filtered results to a new worksheet in the same Excel file.

The following line defines the four (4) columns to retrieve from the XLSX file and saves them to the variable cols as a List.

?Note: Open the Excel file and review the data to follow along.

Import the Excel File to Python


On the next line in the code snippet, read_excel() is called and passed three (3) arguments:

  • The name of the Excel file to import (kp_data.xlsx).
  • The worksheet name. The first worksheet in the Excel file is always read unless stated otherwise. For this example, our Excel file contains two (2) worksheets: Employees and Sales. The Employees worksheet can be referenced using sheet_name=0 or sheet_name='Employees'. Both produce the same result.
  • The columns to retrieve from the Excel worksheet (usecols=cols).

The results save to df_it.

Filter the DataFrame


The highlighted line applies a filter using loc[] and passes the filter to return specific rows from the DataFrame.

In Python, this filter:

df_it = df_emps.loc[(df_emps.Dept == 'IT') & (df_emps.Country == 'United States')]

Equates to this:

? Give me the DataFrame rows for all employees who work in the IT Department, and live in the United States.

Saves Results to Worksheet in Same Excel File


In the bottom highlighted section of the above code, the Excel file is re-opened using load_workbook(). Then, a writer object is declared, the results filtered, and written to a new worksheet, called IT - US and the file is saved and closed.


YouTube Video


Method 3: Use read_excel() and iloc[]


This method uses the read_excel() function to read an XLSX file into a DataFrame and iloc[] to filter the results. The iloc[] function accesses either a group of rows or columns based on their location (integer value).

This example imports required Pandas library and the above-noted Excel file into a DataFrame. The Sales worksheet is then accessed.

This worksheet contains the yearly sale totals for K-Paddles paddles. These results are filtered to the first six (6) rows in the DataFrame and columns shown below.

import pandas as pd cols = ['Month', 'Aspire', 'Adventurer', 'Maximizer']
df_sales = pd.read_excel('kp_data.xlsx', sheet_name='Sales', usecols=cols)
df_aspire = df_sales.iloc[0:6]
print(df_aspire)

The results are output to the terminal.


Month Aspire Adventurer Maximizer
0 1 2500 5200 21100
1 2 2630 5100 18330
2 3 2140 4550 22470
3 4 3400 5870 22270
4 5 3600 4560 20960
5 6 2760 4890 20140


Method 4: Use read_excel(), index[] and loc[]


This method uses the read_excel() function to read an XLSX file into a DataFrame in conjunction with index[] and loc[] to filter the results. The loc[] function can access either a group of rows or columns based on their label names.

This example imports the required Pandas library and the above-noted Excel file into a DataFrame. The Sales worksheet is then accessed. This worksheet contains the yearly sale totals for K-Paddles paddles.

These results are filtered to view the results for the Pinnacle paddle using index[] and passing it a start and stop position (stop-1).

import pandas as pd cols = ['Month', 'Pinnacle']
df_pinnacle = pd.read_excel('kp_data.xlsx', sheet_name='Sales', usecols=cols)
print(df_pinnacle.loc[df_pinnacle.index[0:5], ['Month', 'Pinnacle']])

The results are output to the terminal.


Month Pinnacle
0 1 1500
1 2 1200
2 3 1340
3 4 1130
4 5 1740

YouTube Video


Method 5: Use read_excel() and isin()


This method uses the read_excel() function to read an XLSX file into a DataFrame using isin() to filter the results. The isin() function filters the results down to the records that match the criteria passed as an argument.

This example imports required Pandas library and the above-noted Excel file into a DataFrame. The Employees worksheet is then accessed.

These results are filtered to view the results for all employees who reside in Chicago.

import pandas as pd cols = ['First', 'Last', 'City']
df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols)
print(df_emps[df_emps.City.isin(['Chicago'])])

The results are output to the terminal.


First Last City
2 Luna Sanders Chicago
3 Penelope Jordan Chicago
9 Madeline Walker Chicago
34 Caroline Jenkins Chicago


Summary


This article has provided five (5) ways to filter data from an Excel file using Python to select the best fit for your coding requirements.

Good Luck & Happy Coding!


Programmer Humor – Blockchain


“Blockchains are like grappling hooks, in that it’s extremely cool when you encounter a problem for which they’re the right solution, but it happens way too rarely in real life.” source xkcd



https://www.sickgaming.net/blog/2022/10/...th-pandas/
Reply



Possibly Related Threads…
Thread Author Replies Views Last Post
  [Tut] Making $65 per Hour on Upwork with Pandas xSicKxBot 0 1,320 05-24-2023, 08:16 PM
Last Post: xSicKxBot
  [Tut] Pandas Series Object – A Helpful Guide with Examples xSicKxBot 0 1,311 05-01-2023, 01:30 AM
Last Post: xSicKxBot
  [Tut] Pandas Boolean Indexing xSicKxBot 0 1,309 04-16-2023, 10:54 AM
Last Post: xSicKxBot
  [Tut] Python List of Dicts to Pandas DataFrame xSicKxBot 0 1,529 04-11-2023, 04:15 AM
Last Post: xSicKxBot
  [Tut] How to Create and Run a Batch File That Runs a Python Script? xSicKxBot 0 1,240 11-09-2022, 09:53 PM
Last Post: xSicKxBot
  [Tut] Python Create JSON File xSicKxBot 0 1,252 11-03-2022, 01:09 PM
Last Post: xSicKxBot
  [Tut] How to Convert Pandas DataFrame/Series to NumPy Array? xSicKxBot 0 1,213 10-24-2022, 02:13 PM
Last Post: xSicKxBot
  [Tut] How to Return a File From a Function in Python? xSicKxBot 0 1,327 10-21-2022, 09:47 AM
Last Post: xSicKxBot
  [Tut] Python Time Series Forecast on Bitcoin Data (Part II) xSicKxBot 0 1,350 10-03-2022, 07:46 AM
Last Post: xSicKxBot
  [Tut] How to Delete a Line from a File in Python? xSicKxBot 0 1,216 09-24-2022, 10:31 AM
Last Post: xSicKxBot

Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016