05-24-2022, 07:11 PM
The Ultimate Guide to Data Cleaning in Python and Pandas
<div><div class="kk-star-ratings kksr-valign-top kksr-align-left " data-payload="{"align":"left","id":"375538","slug":"default","valign":"top","reference":"auto","count":"1","readonly":"","score":"5","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})"}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend"> 5/5 – (1 vote) </div>
</div>
<h2>What is Data Cleaning?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="548" height="367" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-214.png" alt="" class="wp-image-375665" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-214.png 548w, https://blog.finxter.com/wp-content/uplo...00x201.png 300w" sizes="(max-width: 548px) 100vw, 548px" /></figure>
</div>
<p>Data cleaning describes the process of turning messy data into clean datasets that can be used for research and data science purposes. For example, tidy data will be in a wide format: every column contains a variable, and every row contains one case. Also, data cleaning means getting rid of corrupt data with very little utility.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/13.1.0/72x72/1f4c4.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Related Tutorial:</strong> <a rel="noreferrer noopener" href="https://blog.finxter.com/data-preparation-in-data-science/" data-type="post" data-id="339885" target="_blank">Data Preparation in Python</a></p>
<h2>How to Clean Unstructured Data in Python?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="457" height="685" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-215.png" alt="" class="wp-image-375667" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-215.png 457w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 457px) 100vw, 457px" /></figure>
</div>
<p>Most data in the real world is messy and unstructured or semi-structured. Working in <a href="https://blog.finxter.com/data-scientist-income-and-opportunity/" data-type="post" data-id="332478">d</a><a href="https://blog.finxter.com/data-scientist-income-and-opportunity/" data-type="post" data-id="332478" target="_blank" rel="noreferrer noopener">a</a><a href="https://blog.finxter.com/data-scientist-income-and-opportunity/" data-type="post" data-id="332478">ta science</a>, most of your time will be spent on cleaning and structuring data.</p>
<p>In research, data from surveys or experiments is mostly already structured into <a href="https://blog.finxter.com/python-excel-styling-your-worksheets/" data-type="post" data-id="37112" target="_blank" rel="noreferrer noopener">Excel</a> or <a href="https://blog.finxter.com/csv-to-pdf-part-1/" data-type="post" data-id="191878" target="_blank" rel="noreferrer noopener">CSV</a> tables.</p>
<p>In companies, data can be stored in databases, Excel files, or distributed all over the company. This data can be in emails, documents, folders, images, and note apps. </p>
<p>This <a rel="noreferrer noopener" href="https://docs.sailpoint.com/pdf/?file=https://docs.sailpoint.com/wp-content/uploads/Dimensional-Research-Unstructured-Data-Report.pdf&_gl=1*90kytj*_ga*MTQ5NTE5MTEwOS4xNjI3MjkzMDA4*_ga_SS72Z4HXJM*MTYyNzI5MzAwNy4xLjAuMTYyNzI5MzAxMy41NA..&_ga=2.51530153.1269593442.1627293008-1495191109.1627293008" data-type="URL" data-id="https://docs.sailpoint.com/pdf/?file=https://docs.sailpoint.com/wp-content/uploads/Dimensional-Research-Unstructured-Data-Report.pdf&_gl=1*90kytj*_ga*MTQ5NTE5MTEwOS4xNjI3MjkzMDA4*_ga_SS72Z4HXJM*MTYyNzI5MzAwNy4xLjAuMTYyNzI5MzAxMy41NA..&_ga=2.51530153.1269593442.1627293008-1495191109.1627293008" target="_blank">study</a> shows, that most companies are having problems handling unstructured or semi-structured data, and almost half of them don’t even know where their data is located.</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://docs.sailpoint.com/pdf/?file=https://docs.sailpoint.com/wp-content/uploads/Dimensional-Research-Unstructured-Data-Report.pdf&_gl=1*90kytj*_ga*MTQ5NTE5MTEwOS4xNjI3MjkzMDA4*_ga_SS72Z4HXJM*MTYyNzI5MzAwNy4xLjAuMTYyNzI5MzAxMy41NA..&_ga=2.51530153.1269593442.1627293008-1495191109.1627293008" target="_blank" rel="noopener"><img loading="lazy" width="1024" height="538" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-187-1024x538.png" alt="" class="wp-image-375552" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-187-1024x538.png 1024w, https://blog.finxter.com/wp-content/uplo...00x158.png 300w, https://blog.finxter.com/wp-content/uplo...68x403.png 768w, https://blog.finxter.com/wp-content/uplo...ge-187.png 1276w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></figure>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://docs.sailpoint.com/pdf/?file=https://docs.sailpoint.com/wp-content/uploads/Dimensional-Research-Unstructured-Data-Report.pdf&_gl=1*90kytj*_ga*MTQ5NTE5MTEwOS4xNjI3MjkzMDA4*_ga_SS72Z4HXJM*MTYyNzI5MzAwNy4xLjAuMTYyNzI5MzAxMy41NA..&_ga=2.51530153.1269593442.1627293008-1495191109.1627293008" target="_blank" rel="noopener"><img loading="lazy" width="1024" height="259" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-188-1024x259.png" alt="" class="wp-image-375560" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-188-1024x259.png 1024w, https://blog.finxter.com/wp-content/uplo...300x76.png 300w, https://blog.finxter.com/wp-content/uplo...68x194.png 768w, https://blog.finxter.com/wp-content/uplo...ge-188.png 1091w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></figure>
</div>
<p>Unstructured data includes videos, images, and text or speech messages. Unstructured data from the web is mainly acquired by <a href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="post" data-id="17311" target="_blank" rel="noreferrer noopener">web scraping</a>.</p>
<p>Semi-structured data is data found in documents, emails, social media posts, and if acquired from the web, it can be in <a rel="noreferrer noopener" href="https://blog.finxter.com/generating-html-documents-in-python/" data-type="post" data-id="310531" target="_blank">HTML</a>, <a href="https://blog.finxter.com/parse-json-data-in-python/" data-type="post" data-id="197286">JSON</a>, or any other web format. </p>
<h2>Is Web Scraping Legal?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="682" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-216-1024x682.png" alt="" class="wp-image-375668" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-216-1024x682.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-216.png 1028w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/13.1.0/72x72/1f6d1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Important</strong>: Even though web scraping is possible does not mean it is always legal! </p>
<p>If the data is publicly available and not copyrighted, it is mostly safe to scrape. But also pay attention to data privacy laws and do not scrape personal data. </p>
<p>Scraping data from social media websites, for example, is mostly illegal, as it is not publicly available without logging in and contains personal data. </p>
<p>There are also many services to get data via an API. To be safe, refer to <a rel="noreferrer noopener" href="https://www.scraperapi.com/featured/is-web-scraping-legal/" data-type="URL" data-id="https://www.scraperapi.com/featured/is-web-scraping-legal/" target="_blank">this guide</a> about what is legal when web scraping:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="467" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-213-1024x467.png" alt="" class="wp-image-375646" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-213-1024x467.png 1024w, https://blog.finxter.com/wp-content/uplo...00x137.png 300w, https://blog.finxter.com/wp-content/uplo...68x351.png 768w, https://blog.finxter.com/wp-content/uplo...ge-213.png 1216w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption><a href="https://www.scraperapi.com/featured/is-web-scraping-legal/" data-type="URL" data-id="https://www.scraperapi.com/featured/is-web-scraping-legal/" target="_blank" rel="noreferrer noopener">Source</a></figcaption></figure>
</div>
<h2>What is the Goal of Data Cleaning?</h2>
<p>The goal of data cleaning and cleaning unstructured or semi-structured data is to create tidy data with which you can work. Tidy data will be in a wide format: every column contains a variable, and every row contains one case.</p>
<p>To demonstrate both perspectives, this article is divided into two parts: </p>
<ul>
<li>First, we will scrape, load, and wrangle some semi-structured data from the web.</li>
<li>Second, we will clean this data. This second step is also valid for structured data as it is about finding missing, outliers and duplicates.</li>
</ul>
<p>I recommend doing an <a href="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/" data-type="post" data-id="335731" target="_blank" rel="noreferrer noopener">exploratory data analysis</a> before or during cleaning data to get a good feeling of the data you have. You can easily combine exploring and cleaning.</p>
<figure class="wp-block-embed is-type-wp-embed is-provider-finxter wp-block-embed-finxter">
<div class="wp-block-embed__wrapper">
<blockquote class="wp-embedded-content" data-secret="rxeMafg5VW"><p><a href="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/">Easy Exploratory Data Analysis (EDA) in Python with Visualization</a></p></blockquote>
<p><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" title="“Easy Exploratory Data Analysis (EDA) in Python with Visualization” — Finxter" src="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/embed/#?secret=adYXatEd60#?secret=rxeMafg5VW" data-secret="rxeMafg5VW" width="600" height="338" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</div>
</figure>
<h2>Python HTML scraping</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="575" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-217-1024x575.png" alt="" class="wp-image-375677" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-217-1024x575.png 1024w, https://blog.finxter.com/wp-content/uplo...00x168.png 300w, https://blog.finxter.com/wp-content/uplo...68x431.png 768w, https://blog.finxter.com/wp-content/uplo...ge-217.png 1220w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>First, we will scrape a table from Wikipedia of the largest cities in the world. Scraping from Wikipedia is legal because</p>
<ol>
<li>The data is publicly available </li>
<li>The data works under a creative commons deed, which means the content is free to copy, share and adapt</li>
</ol>
<p>First, we load our packages: </p>
<ul>
<li><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank">pandas</a> as our go-to library for data wrangling and analysis, </li>
<li>the standard library for <a href="https://blog.finxter.com/python-requests-library/" data-type="post" data-id="37796" target="_blank" rel="noreferrer noopener">HTTP requests</a>, and </li>
<li><a href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="post" data-id="17311" target="_blank" rel="noreferrer noopener">BeautifulSoup</a> for easily reading HTML data.</li>
</ul>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd # pandas for data wrangling, cleaning, and analysis
import requests # for http requests
from bs4 import BeautifulSoup # html reading</pre>
<p>The wiki page that holds the table we are looking for can be found here: </p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/List_of_largest_cities">https://en.wikipedia.org/wiki/List_of_largest_cities</a> </li>
</ul>
<p>We pass this URL into <a rel="noreferrer noopener" href="https://blog.finxter.com/python-requests-library-2/" data-type="post" data-id="37804" target="_blank">requests</a>. The table in the HTML page is within <code><table class></code> , and the tables in wiki pages are called wiki tables. We can check this by looking at the HTML page in our browser or the HTML text file later to confirm we pulled the right table. </p>
<p>With <code>requests.get(url).text</code>, we pull the HTML from the page. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">url='https://en.wikipedia.org/wiki/List_of_largest_cities'
url_response=requests.get(url).text</pre>
<p>BeautifulSoup will pull the data table from the HTML file and save us time. We will pass the <code>url_response</code> from our request into the <code>html.parser</code>. With <code>soup.find()</code> we can tell it to look exactly for the <code>wikitable</code>. The output also tells us the name of the table.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">soup = BeautifulSoup(url_response.text, 'html.parser')
html_listofcities=soup.find('table',{'class':'wikitable'})
html_listofcities
</pre>
<p>Output: </p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="264" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-189-1024x264.png" alt="" class="wp-image-375571" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-189-1024x264.png 1024w, https://blog.finxter.com/wp-content/uplo...300x77.png 300w, https://blog.finxter.com/wp-content/uplo...68x198.png 768w, https://blog.finxter.com/wp-content/uplo...ge-189.png 1391w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Then we use pandas to <a href="https://blog.finxter.com/python-input-output-html/" data-type="post" data-id="53450" target="_blank" rel="noreferrer noopener">read the HTML</a> file and turn it into a pandas data frame, just like we would load any other data into Python.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df=pd.read_html(str(listofcities))
cities=pd.DataFrame(df[0])
print(cities.head())
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="642" height="453" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-190.png" alt="" class="wp-image-375572" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-190.png 642w, https://blog.finxter.com/wp-content/uplo...00x212.png 300w" sizes="(max-width: 642px) 100vw, 642px" /></figure>
</div>
<p>The data looks messy. Now we get to clean!</p>
<h2>Python HTML Cleaning</h2>
<p>Let us inspect our newfound data:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Cities</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="444" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-191-1024x444.png" alt="" class="wp-image-375573" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-191-1024x444.png 1024w, https://blog.finxter.com/wp-content/uplo...00x130.png 300w, https://blog.finxter.com/wp-content/uplo...68x333.png 768w, https://blog.finxter.com/wp-content/uplo...ge-191.png 1378w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>I will drop the columns of the metropolitan area and the urban area because I am just interested in the population of the actual city. This can be done in several ways. </p>
<p>Here are two:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.drop(cities.columns[[7, 8, 9, 10, 11, 12]], axis=1, inplace=True)
# or
cities= cities.drop(cities.iloc[:,7:], axis = 1)
</pre>
<p>Then we will drop the first header column as it does not contain any useful information and rename the remaining header column.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Dropping the first index column
cities.columns = cities.columns.droplevel(0) </pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="1010" height="430" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-192.png" alt="" class="wp-image-375575" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-192.png 1010w, https://blog.finxter.com/wp-content/uplo...00x128.png 300w, https://blog.finxter.com/wp-content/uplo...68x327.png 768w" sizes="(max-width: 1010px) 100vw, 1010px" /></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># rename the header column
cities.columns = ['city', 'country', 'UN_2018_population_estimates', 'city_definition', 'city_population', 'city_area_km2', ' city_density/km2']</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="414" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-193-1024x414.png" alt="" class="wp-image-375576" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-193-1024x414.png 1024w, https://blog.finxter.com/wp-content/uplo...00x121.png 300w, https://blog.finxter.com/wp-content/uplo...68x310.png 768w, https://blog.finxter.com/wp-content/uplo...ge-193.png 1054w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The heading looks clean. Now we explore the dataset to find information to clean.</p>
<p>With <code>df.info()</code> and <code><a href="https://blog.finxter.com/pandas-dataframe-describe-method/" data-type="post" data-id="343482" target="_blank" rel="noreferrer noopener">df.describe()</a></code> we get a quick overview of the data we scraped. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.info()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="503" height="283" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-194.png" alt="" class="wp-image-375578" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-194.png 503w, https://blog.finxter.com/wp-content/uplo...00x169.png 300w" sizes="(max-width: 503px) 100vw, 503px" /></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.describe()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="526" height="293" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-195.png" alt="" class="wp-image-375579" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-195.png 526w, https://blog.finxter.com/wp-content/uplo...00x167.png 300w" sizes="(max-width: 526px) 100vw, 526px" /></figure>
</div>
<p>It is immediately clear that the <code>city_density/km2</code> is not a <a rel="noreferrer noopener" href="https://blog.finxter.com/python-float-function/" data-type="post" data-id="22782" target="_blank">float</a> even though it is supposed to be numerical. </p>
<p>Inspecting the data frame, you might have already noticed that the columns contain numbers following numbers in brackets, like <code>[12]</code>. This turns this data into an object, so we will have to get rid of this.</p>
<p>There are different ways to <a href="https://blog.finxter.com/how-to-remove-everything-after-the-last-character-in-a-string/" data-type="post" data-id="22625" target="_blank" rel="noreferrer noopener">remove characters from a string in Python</a>. We could just remove the last three characters of each string. </p>
<p>However, this would not work if some of our data points do not have the brackets at the end or more than that. So we’ll use the <a rel="noreferrer noopener" href="https://blog.finxter.com/introduction-to-slicing-in-python/" data-type="post" data-id="731" target="_blank">slicing</a> method <a rel="noreferrer noopener" href="https://blog.finxter.com/python-string-partition/" data-type="URL" data-id="https://blog.finxter.com/python-string-partition/" target="_blank"><code>str.partition()</code></a> to cut the brackets from our numbers. </p>
<p>First, we make sure our object type is a string that we can work string operations on. Then we apply the <code>str.partition()</code> method and advise the function to cut off at the first bracket <code>[</code>.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities[' city_density/km2'] = cities['city_density/km2'].astype('string')
city_density_str = cities['city_density/km2'].str.partition('[')
print(city_density_str)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="227" height="431" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-196.png" alt="" class="wp-image-375580" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-196.png 227w, https://blog.finxter.com/wp-content/uplo...58x300.png 158w" sizes="(max-width: 227px) 100vw, 227px" /></figure>
</div>
<p>What we want is the first column, so we pick this one [0] and write it back on a variable. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">city_density_str_col_1 = city_density_str[0]
print(city_density_str_col_1)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="296" height="236" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-197.png" alt="" class="wp-image-375581"/></figure>
</div>
<p>The commas in the variable will prevent us from converting the string into a float, so we’ll remove the comma with <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-string-replace-2/" data-type="post" data-id="26083" target="_blank">str.replace()</a></code> before turning the string to a float with <code><a href="https://blog.finxter.com/python-string-to-float/" data-type="URL" data-id="https://blog.finxter.com/python-string-to-float/">s.astype('float')</a></code> and assigning it back to our data frame.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">city_density_str_col_1 = city_density_str_col_1.str.replace(',','')
cities['city_density/km2'] = city_density_str_col_1.astype('float')
print(cities['city_density/km2'])
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="425" height="243" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-198.png" alt="" class="wp-image-375585" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-198.png 425w, https://blog.finxter.com/wp-content/uplo...00x172.png 300w" sizes="(max-width: 425px) 100vw, 425px" /></figure>
</div>
<p>The variable now shows up when we look at <code><a href="https://blog.finxter.com/pandas-dataframe-describe-method/" data-type="post" data-id="343482" target="_blank" rel="noreferrer noopener">df.describe()</a></code> and we’ll want the results rounded for better readability:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.describe().round(1)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="645" height="303" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-199.png" alt="" class="wp-image-375587" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-199.png 645w, https://blog.finxter.com/wp-content/uplo...00x141.png 300w" sizes="(max-width: 645px) 100vw, 645px" /></figure>
</div>
<h2>Cleaning Structured Data in Python</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="577" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-218-1024x577.png" alt="" class="wp-image-375680" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-218-1024x577.png 1024w, https://blog.finxter.com/wp-content/uplo...00x169.png 300w, https://blog.finxter.com/wp-content/uplo...68x433.png 768w, https://blog.finxter.com/wp-content/uplo...ge-218.png 1216w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Following the cleaning of the scraped data we can now use it like a structured data frame with data we collected or downloaded. </p>
<p>This also can be cleaned of <a href="https://blog.finxter.com/pandas-nan/" data-type="post" data-id="16311" target="_blank" rel="noreferrer noopener">missing data</a>, <a href="https://blog.finxter.com/how-to-find-outliers-in-python-easily/" data-type="post" data-id="2998" target="_blank" rel="noreferrer noopener">outliers</a> and <a href="https://blog.finxter.com/how-to-remove-duplicates-from-a-python-list/" data-type="post" data-id="7686" target="_blank" rel="noreferrer noopener">duplicates</a> but does not always need data wrangling. However, with a data frame with many strings the cleaning process also often involves a lot of <a href="https://blog.finxter.com/python-string-methods/" data-type="post" data-id="25974" target="_blank" rel="noreferrer noopener">string manipulation</a>.</p>
<p><strong>Important note:</strong></p>
<p class="has-global-color-8-background-color has-background">If you want to apply <a href="https://blog.finxter.com/cheat-sheet-6-pillar-machine-learning-algorithms/" data-type="post" data-id="2613" target="_blank" rel="noreferrer noopener">machine learning algorithms</a> to your data, do split your dataset before feature engineering and data transformation as this can create data leakage!</p>
<p>Dropping duplicates is <a href="https://blog.finxter.com/pandas-dataframe-drop_duplicates-method/" data-type="post" data-id="343624" target="_blank" rel="noreferrer noopener">easy</a>: </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities = cities.drop_duplicates()</pre>
<p>There didn’t seem to be duplicates in our <code>df</code>, as the size remained the same.</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="184" height="43" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-200.png" alt="" class="wp-image-375589"/></figure>
</div>
<p>When dealing with missing values, we must decide how to handle them based on our data. </p>
<p>We can either</p>
<ul>
<li>Drop missing values</li>
<li>Replace or impute the values</li>
<li>Leave missing values in the dataset </li>
<li>Transform the information that they’re missing into a new variable</li>
</ul>
<p>First, we inspect our missing data. The function <code><a href="https://blog.finxter.com/pandas-dataframe-isna-and-isnull-method/" data-type="post" data-id="343839" target="_blank" rel="noreferrer noopener">df.isnull()</a></code> is a boolean function, that tells us for the whole data frame if data is missing or not. </p>
<p>We can sum it up to determine, how many values are missing in each column.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.isnull().sum()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="287" height="169" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-201.png" alt="" class="wp-image-375594"/></figure>
</div>
<p>We can drop rows with missing values completely. </p>
<p>This will cause us to lose useful information in other columns. But as the first row is completely empty anyway, we can drop this one. </p>
<p>The <code>df.dropna()</code> function has useful features that help us pick what missing data we want to remove. So, I just want to remove the one row, or all of them if there are more, with all missing values.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities = cities.dropna(how='all')</pre>
<p>This will look like this:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="417" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-202-1024x417.png" alt="" class="wp-image-375595" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-202-1024x417.png 1024w, https://blog.finxter.com/wp-content/uplo...00x122.png 300w, https://blog.finxter.com/wp-content/uplo...68x313.png 768w, https://blog.finxter.com/wp-content/uplo...ge-202.png 1082w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>What is left are the missing values for 8 cities for population, area and density. We will replace those. </p>
<p>Of course, you can look up the data on Wikipedia and reinsert them. For the sake of the exercise and because most of the time it is not possible to look up missing data, we will not do this.</p>
<p>The dataset now contains the data of the 73 biggest cities in the world, using the average of these to impute the missing values in the other 8 is the only and closest guess we have. This does not create much more information but keeps us from losing other information from these 8 cities. </p>
<p>The alternative option would be to drop those 8 cities completely.</p>
<p>So, we’ll replace the missing values in the area column with the average area size of all the other cities. First, we create the mean of the city area sizes, then we fill the missing values in the column with this value. </p>
<p>Pandas has the right function for this: <code><a href="https://blog.finxter.com/pandas-dataframe-fillna-method/" data-type="post" data-id="343812" target="_blank" rel="noreferrer noopener">df.fillna()</a></code></p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">city_area_mean = cities['city_area_km2'].mean()
cities['city_area_km2'] = cities['city_area_km2'].fillna(value=city_area_mean)
cities.isnull().sum()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="283" height="166" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-203.png" alt="" class="wp-image-375597"/></figure>
</div>
<p>The output shows that we now have replaced and eliminated several missing values.</p>
<p>For the population, we luckily have another column that shows the population estimates of the UN in 2018 for each city. </p>
<p>So, we can use these to impute the missing population data, as it is as close as we can get to replacing them accurately. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities['city_population'] = cities['city_population'].fillna(cities['UN_2018_population_estimates'])
cities['city_population'].describe().round(1)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="322" height="175" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-204.png" alt="" class="wp-image-375599" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-204.png 322w, https://blog.finxter.com/wp-content/uplo...00x163.png 300w" sizes="(max-width: 322px) 100vw, 322px" /></figure>
</div>
<p>Now we still have missing values in the city density column. This one we can calculate now by dividing the population by the area. </p>
<p>So, we create a new variable to calculate the density with our new imputed data. Then we fill the missing values with this calculated density.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">city_density = cities['city_population']/cities['city_area_km2']
cities['city_density/km2'] = cities['city_density/km2'].fillna(value=city_density)
cities['city_density/km2'].describe().round(1)
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="323" height="189" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-205.png" alt="" class="wp-image-375600" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-205.png 323w, https://blog.finxter.com/wp-content/uplo...00x176.png 300w" sizes="(max-width: 323px) 100vw, 323px" /></figure>
</div>
<p>We can check back our missing values and the description of our dataset.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.isnull().sum()</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="299" height="167" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-206.png" alt="" class="wp-image-375601"/></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.describe().round(1)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="654" height="303" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-207.png" alt="" class="wp-image-375602" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-207.png 654w, https://blog.finxter.com/wp-content/uplo...00x139.png 300w" sizes="(max-width: 654px) 100vw, 654px" /></figure>
</div>
<p>There is still one value missing in our city definition. Let’s have a look at these categories. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities['city_definition'].value_counts()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="301" height="526" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-208.png" alt="" class="wp-image-375603" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-208.png 301w, https://blog.finxter.com/wp-content/uplo...72x300.png 172w" sizes="(max-width: 301px) 100vw, 301px" /></figure>
</div>
<p>As we don’t know if the missing city is a municipality or a capital, we could just replace the missing value with the generic description of “city”, as we know they all are cities. </p>
<p>If you’d want to calculate the differences between these categories, it would be useful to categorize and merge these single entries into bigger categories. </p>
<p>For now, we will just replace the missing value with “city”, as I am more interested in the size of the cities than the category.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities['city_definition'] = cities['city_definition'].fillna('City')
cities.isnull().sum()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="281" height="166" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-209.png" alt="" class="wp-image-375604"/></figure>
</div>
<p>Great! We got rid of all the missing values.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/13.1.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Info</strong>: For many statistical operations, missing values will be dropped by default and don’t create a problem. For machine learning algorithms missing values must be removed before modelling.</p>
<p>We can also create dummy variables (information is missing/ not missing) as the fact that the data is missing might be useful information. This way, the fact that they’re missing can be included in the data analysis process. </p>
<h2>Visualization</h2>
<p>Now we visualize our data and check for outliers with a seaborn scatterplot.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import seaborn as sns
sns.scatterplot(data=cities, x="city_population", y="city_area_km2", size="city_population")
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="525" height="336" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-210.png" alt="" class="wp-image-375605" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-210.png 525w, https://blog.finxter.com/wp-content/uplo...00x192.png 300w" sizes="(max-width: 525px) 100vw, 525px" /></figure>
</div>
<p>The city in the right top corner is clearly an outlier, but not one we would want to remove or equalize as it is not a measurement error. It is just the biggest city (or metropolitan area) in the world! </p>
<p>Let’s find out which one it is with <code><a href="https://blog.finxter.com/pandas-dataframe-sort_values-method/" data-type="post" data-id="343891" target="_blank" rel="noreferrer noopener">df.sort_values()</a></code>, using <code>ascending=False</code> to sort the city population from high to low.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.sort_values('city_population', ascending=False)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="427" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-211-1024x427.png" alt="" class="wp-image-375608" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-211-1024x427.png 1024w, https://blog.finxter.com/wp-content/uplo...00x125.png 300w, https://blog.finxter.com/wp-content/uplo...68x321.png 768w, https://blog.finxter.com/wp-content/uplo...ge-211.png 1054w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The biggest city on earth by size and city population is the municipality Chongqing in China with over 32 million inhabitants!</p>
<p>To visualize our cities more beautifully, we can use a <code>sns.relplot()</code> with color and different sizing.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">sns.relplot(x="city_population", y="city_area_km2", hue="country", size="city_population", sizes=(40, 400), alpha=.5, palette="muted", height=6, data=cities)
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="531" height="721" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-212.png" alt="" class="wp-image-375611" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-212.png 531w, https://blog.finxter.com/wp-content/uplo...21x300.png 221w" sizes="(max-width: 531px) 100vw, 531px" /></figure>
</div>
<p>Now keep on cleaning!</p>
</div>
https://www.sickgaming.net/blog/2022/05/...nd-pandas/
<div><div class="kk-star-ratings kksr-valign-top kksr-align-left " data-payload="{"align":"left","id":"375538","slug":"default","valign":"top","reference":"auto","count":"1","readonly":"","score":"5","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})"}">
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend"> 5/5 – (1 vote) </div>
</div>
<h2>What is Data Cleaning?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="548" height="367" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-214.png" alt="" class="wp-image-375665" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-214.png 548w, https://blog.finxter.com/wp-content/uplo...00x201.png 300w" sizes="(max-width: 548px) 100vw, 548px" /></figure>
</div>
<p>Data cleaning describes the process of turning messy data into clean datasets that can be used for research and data science purposes. For example, tidy data will be in a wide format: every column contains a variable, and every row contains one case. Also, data cleaning means getting rid of corrupt data with very little utility.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/13.1.0/72x72/1f4c4.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Related Tutorial:</strong> <a rel="noreferrer noopener" href="https://blog.finxter.com/data-preparation-in-data-science/" data-type="post" data-id="339885" target="_blank">Data Preparation in Python</a></p>
<h2>How to Clean Unstructured Data in Python?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="457" height="685" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-215.png" alt="" class="wp-image-375667" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-215.png 457w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 457px) 100vw, 457px" /></figure>
</div>
<p>Most data in the real world is messy and unstructured or semi-structured. Working in <a href="https://blog.finxter.com/data-scientist-income-and-opportunity/" data-type="post" data-id="332478">d</a><a href="https://blog.finxter.com/data-scientist-income-and-opportunity/" data-type="post" data-id="332478" target="_blank" rel="noreferrer noopener">a</a><a href="https://blog.finxter.com/data-scientist-income-and-opportunity/" data-type="post" data-id="332478">ta science</a>, most of your time will be spent on cleaning and structuring data.</p>
<p>In research, data from surveys or experiments is mostly already structured into <a href="https://blog.finxter.com/python-excel-styling-your-worksheets/" data-type="post" data-id="37112" target="_blank" rel="noreferrer noopener">Excel</a> or <a href="https://blog.finxter.com/csv-to-pdf-part-1/" data-type="post" data-id="191878" target="_blank" rel="noreferrer noopener">CSV</a> tables.</p>
<p>In companies, data can be stored in databases, Excel files, or distributed all over the company. This data can be in emails, documents, folders, images, and note apps. </p>
<p>This <a rel="noreferrer noopener" href="https://docs.sailpoint.com/pdf/?file=https://docs.sailpoint.com/wp-content/uploads/Dimensional-Research-Unstructured-Data-Report.pdf&_gl=1*90kytj*_ga*MTQ5NTE5MTEwOS4xNjI3MjkzMDA4*_ga_SS72Z4HXJM*MTYyNzI5MzAwNy4xLjAuMTYyNzI5MzAxMy41NA..&_ga=2.51530153.1269593442.1627293008-1495191109.1627293008" data-type="URL" data-id="https://docs.sailpoint.com/pdf/?file=https://docs.sailpoint.com/wp-content/uploads/Dimensional-Research-Unstructured-Data-Report.pdf&_gl=1*90kytj*_ga*MTQ5NTE5MTEwOS4xNjI3MjkzMDA4*_ga_SS72Z4HXJM*MTYyNzI5MzAwNy4xLjAuMTYyNzI5MzAxMy41NA..&_ga=2.51530153.1269593442.1627293008-1495191109.1627293008" target="_blank">study</a> shows, that most companies are having problems handling unstructured or semi-structured data, and almost half of them don’t even know where their data is located.</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://docs.sailpoint.com/pdf/?file=https://docs.sailpoint.com/wp-content/uploads/Dimensional-Research-Unstructured-Data-Report.pdf&_gl=1*90kytj*_ga*MTQ5NTE5MTEwOS4xNjI3MjkzMDA4*_ga_SS72Z4HXJM*MTYyNzI5MzAwNy4xLjAuMTYyNzI5MzAxMy41NA..&_ga=2.51530153.1269593442.1627293008-1495191109.1627293008" target="_blank" rel="noopener"><img loading="lazy" width="1024" height="538" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-187-1024x538.png" alt="" class="wp-image-375552" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-187-1024x538.png 1024w, https://blog.finxter.com/wp-content/uplo...00x158.png 300w, https://blog.finxter.com/wp-content/uplo...68x403.png 768w, https://blog.finxter.com/wp-content/uplo...ge-187.png 1276w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></figure>
</div>
<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://docs.sailpoint.com/pdf/?file=https://docs.sailpoint.com/wp-content/uploads/Dimensional-Research-Unstructured-Data-Report.pdf&_gl=1*90kytj*_ga*MTQ5NTE5MTEwOS4xNjI3MjkzMDA4*_ga_SS72Z4HXJM*MTYyNzI5MzAwNy4xLjAuMTYyNzI5MzAxMy41NA..&_ga=2.51530153.1269593442.1627293008-1495191109.1627293008" target="_blank" rel="noopener"><img loading="lazy" width="1024" height="259" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-188-1024x259.png" alt="" class="wp-image-375560" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-188-1024x259.png 1024w, https://blog.finxter.com/wp-content/uplo...300x76.png 300w, https://blog.finxter.com/wp-content/uplo...68x194.png 768w, https://blog.finxter.com/wp-content/uplo...ge-188.png 1091w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></figure>
</div>
<p>Unstructured data includes videos, images, and text or speech messages. Unstructured data from the web is mainly acquired by <a href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="post" data-id="17311" target="_blank" rel="noreferrer noopener">web scraping</a>.</p>
<p>Semi-structured data is data found in documents, emails, social media posts, and if acquired from the web, it can be in <a rel="noreferrer noopener" href="https://blog.finxter.com/generating-html-documents-in-python/" data-type="post" data-id="310531" target="_blank">HTML</a>, <a href="https://blog.finxter.com/parse-json-data-in-python/" data-type="post" data-id="197286">JSON</a>, or any other web format. </p>
<h2>Is Web Scraping Legal?</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="682" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-216-1024x682.png" alt="" class="wp-image-375668" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-216-1024x682.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-216.png 1028w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/13.1.0/72x72/1f6d1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Important</strong>: Even though web scraping is possible does not mean it is always legal! </p>
<p>If the data is publicly available and not copyrighted, it is mostly safe to scrape. But also pay attention to data privacy laws and do not scrape personal data. </p>
<p>Scraping data from social media websites, for example, is mostly illegal, as it is not publicly available without logging in and contains personal data. </p>
<p>There are also many services to get data via an API. To be safe, refer to <a rel="noreferrer noopener" href="https://www.scraperapi.com/featured/is-web-scraping-legal/" data-type="URL" data-id="https://www.scraperapi.com/featured/is-web-scraping-legal/" target="_blank">this guide</a> about what is legal when web scraping:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="467" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-213-1024x467.png" alt="" class="wp-image-375646" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-213-1024x467.png 1024w, https://blog.finxter.com/wp-content/uplo...00x137.png 300w, https://blog.finxter.com/wp-content/uplo...68x351.png 768w, https://blog.finxter.com/wp-content/uplo...ge-213.png 1216w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption><a href="https://www.scraperapi.com/featured/is-web-scraping-legal/" data-type="URL" data-id="https://www.scraperapi.com/featured/is-web-scraping-legal/" target="_blank" rel="noreferrer noopener">Source</a></figcaption></figure>
</div>
<h2>What is the Goal of Data Cleaning?</h2>
<p>The goal of data cleaning and cleaning unstructured or semi-structured data is to create tidy data with which you can work. Tidy data will be in a wide format: every column contains a variable, and every row contains one case.</p>
<p>To demonstrate both perspectives, this article is divided into two parts: </p>
<ul>
<li>First, we will scrape, load, and wrangle some semi-structured data from the web.</li>
<li>Second, we will clean this data. This second step is also valid for structured data as it is about finding missing, outliers and duplicates.</li>
</ul>
<p>I recommend doing an <a href="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/" data-type="post" data-id="335731" target="_blank" rel="noreferrer noopener">exploratory data analysis</a> before or during cleaning data to get a good feeling of the data you have. You can easily combine exploring and cleaning.</p>
<figure class="wp-block-embed is-type-wp-embed is-provider-finxter wp-block-embed-finxter">
<div class="wp-block-embed__wrapper">
<blockquote class="wp-embedded-content" data-secret="rxeMafg5VW"><p><a href="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/">Easy Exploratory Data Analysis (EDA) in Python with Visualization</a></p></blockquote>
<p><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" title="“Easy Exploratory Data Analysis (EDA) in Python with Visualization” — Finxter" src="https://blog.finxter.com/easy-exploratory-data-analysis-eda-in-python-with-visualization/embed/#?secret=adYXatEd60#?secret=rxeMafg5VW" data-secret="rxeMafg5VW" width="600" height="338" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</div>
</figure>
<h2>Python HTML scraping</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="575" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-217-1024x575.png" alt="" class="wp-image-375677" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-217-1024x575.png 1024w, https://blog.finxter.com/wp-content/uplo...00x168.png 300w, https://blog.finxter.com/wp-content/uplo...68x431.png 768w, https://blog.finxter.com/wp-content/uplo...ge-217.png 1220w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>First, we will scrape a table from Wikipedia of the largest cities in the world. Scraping from Wikipedia is legal because</p>
<ol>
<li>The data is publicly available </li>
<li>The data works under a creative commons deed, which means the content is free to copy, share and adapt</li>
</ol>
<p>First, we load our packages: </p>
<ul>
<li><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank">pandas</a> as our go-to library for data wrangling and analysis, </li>
<li>the standard library for <a href="https://blog.finxter.com/python-requests-library/" data-type="post" data-id="37796" target="_blank" rel="noreferrer noopener">HTTP requests</a>, and </li>
<li><a href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="post" data-id="17311" target="_blank" rel="noreferrer noopener">BeautifulSoup</a> for easily reading HTML data.</li>
</ul>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd # pandas for data wrangling, cleaning, and analysis
import requests # for http requests
from bs4 import BeautifulSoup # html reading</pre>
<p>The wiki page that holds the table we are looking for can be found here: </p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/List_of_largest_cities">https://en.wikipedia.org/wiki/List_of_largest_cities</a> </li>
</ul>
<p>We pass this URL into <a rel="noreferrer noopener" href="https://blog.finxter.com/python-requests-library-2/" data-type="post" data-id="37804" target="_blank">requests</a>. The table in the HTML page is within <code><table class></code> , and the tables in wiki pages are called wiki tables. We can check this by looking at the HTML page in our browser or the HTML text file later to confirm we pulled the right table. </p>
<p>With <code>requests.get(url).text</code>, we pull the HTML from the page. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">url='https://en.wikipedia.org/wiki/List_of_largest_cities'
url_response=requests.get(url).text</pre>
<p>BeautifulSoup will pull the data table from the HTML file and save us time. We will pass the <code>url_response</code> from our request into the <code>html.parser</code>. With <code>soup.find()</code> we can tell it to look exactly for the <code>wikitable</code>. The output also tells us the name of the table.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">soup = BeautifulSoup(url_response.text, 'html.parser')
html_listofcities=soup.find('table',{'class':'wikitable'})
html_listofcities
</pre>
<p>Output: </p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="264" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-189-1024x264.png" alt="" class="wp-image-375571" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-189-1024x264.png 1024w, https://blog.finxter.com/wp-content/uplo...300x77.png 300w, https://blog.finxter.com/wp-content/uplo...68x198.png 768w, https://blog.finxter.com/wp-content/uplo...ge-189.png 1391w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Then we use pandas to <a href="https://blog.finxter.com/python-input-output-html/" data-type="post" data-id="53450" target="_blank" rel="noreferrer noopener">read the HTML</a> file and turn it into a pandas data frame, just like we would load any other data into Python.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df=pd.read_html(str(listofcities))
cities=pd.DataFrame(df[0])
print(cities.head())
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="642" height="453" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-190.png" alt="" class="wp-image-375572" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-190.png 642w, https://blog.finxter.com/wp-content/uplo...00x212.png 300w" sizes="(max-width: 642px) 100vw, 642px" /></figure>
</div>
<p>The data looks messy. Now we get to clean!</p>
<h2>Python HTML Cleaning</h2>
<p>Let us inspect our newfound data:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">Cities</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="444" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-191-1024x444.png" alt="" class="wp-image-375573" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-191-1024x444.png 1024w, https://blog.finxter.com/wp-content/uplo...00x130.png 300w, https://blog.finxter.com/wp-content/uplo...68x333.png 768w, https://blog.finxter.com/wp-content/uplo...ge-191.png 1378w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>I will drop the columns of the metropolitan area and the urban area because I am just interested in the population of the actual city. This can be done in several ways. </p>
<p>Here are two:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.drop(cities.columns[[7, 8, 9, 10, 11, 12]], axis=1, inplace=True)
# or
cities= cities.drop(cities.iloc[:,7:], axis = 1)
</pre>
<p>Then we will drop the first header column as it does not contain any useful information and rename the remaining header column.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Dropping the first index column
cities.columns = cities.columns.droplevel(0) </pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="1010" height="430" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-192.png" alt="" class="wp-image-375575" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-192.png 1010w, https://blog.finxter.com/wp-content/uplo...00x128.png 300w, https://blog.finxter.com/wp-content/uplo...68x327.png 768w" sizes="(max-width: 1010px) 100vw, 1010px" /></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># rename the header column
cities.columns = ['city', 'country', 'UN_2018_population_estimates', 'city_definition', 'city_population', 'city_area_km2', ' city_density/km2']</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="414" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-193-1024x414.png" alt="" class="wp-image-375576" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-193-1024x414.png 1024w, https://blog.finxter.com/wp-content/uplo...00x121.png 300w, https://blog.finxter.com/wp-content/uplo...68x310.png 768w, https://blog.finxter.com/wp-content/uplo...ge-193.png 1054w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The heading looks clean. Now we explore the dataset to find information to clean.</p>
<p>With <code>df.info()</code> and <code><a href="https://blog.finxter.com/pandas-dataframe-describe-method/" data-type="post" data-id="343482" target="_blank" rel="noreferrer noopener">df.describe()</a></code> we get a quick overview of the data we scraped. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.info()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="503" height="283" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-194.png" alt="" class="wp-image-375578" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-194.png 503w, https://blog.finxter.com/wp-content/uplo...00x169.png 300w" sizes="(max-width: 503px) 100vw, 503px" /></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.describe()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="526" height="293" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-195.png" alt="" class="wp-image-375579" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-195.png 526w, https://blog.finxter.com/wp-content/uplo...00x167.png 300w" sizes="(max-width: 526px) 100vw, 526px" /></figure>
</div>
<p>It is immediately clear that the <code>city_density/km2</code> is not a <a rel="noreferrer noopener" href="https://blog.finxter.com/python-float-function/" data-type="post" data-id="22782" target="_blank">float</a> even though it is supposed to be numerical. </p>
<p>Inspecting the data frame, you might have already noticed that the columns contain numbers following numbers in brackets, like <code>[12]</code>. This turns this data into an object, so we will have to get rid of this.</p>
<p>There are different ways to <a href="https://blog.finxter.com/how-to-remove-everything-after-the-last-character-in-a-string/" data-type="post" data-id="22625" target="_blank" rel="noreferrer noopener">remove characters from a string in Python</a>. We could just remove the last three characters of each string. </p>
<p>However, this would not work if some of our data points do not have the brackets at the end or more than that. So we’ll use the <a rel="noreferrer noopener" href="https://blog.finxter.com/introduction-to-slicing-in-python/" data-type="post" data-id="731" target="_blank">slicing</a> method <a rel="noreferrer noopener" href="https://blog.finxter.com/python-string-partition/" data-type="URL" data-id="https://blog.finxter.com/python-string-partition/" target="_blank"><code>str.partition()</code></a> to cut the brackets from our numbers. </p>
<p>First, we make sure our object type is a string that we can work string operations on. Then we apply the <code>str.partition()</code> method and advise the function to cut off at the first bracket <code>[</code>.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities[' city_density/km2'] = cities['city_density/km2'].astype('string')
city_density_str = cities['city_density/km2'].str.partition('[')
print(city_density_str)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="227" height="431" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-196.png" alt="" class="wp-image-375580" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-196.png 227w, https://blog.finxter.com/wp-content/uplo...58x300.png 158w" sizes="(max-width: 227px) 100vw, 227px" /></figure>
</div>
<p>What we want is the first column, so we pick this one [0] and write it back on a variable. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">city_density_str_col_1 = city_density_str[0]
print(city_density_str_col_1)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="296" height="236" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-197.png" alt="" class="wp-image-375581"/></figure>
</div>
<p>The commas in the variable will prevent us from converting the string into a float, so we’ll remove the comma with <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-string-replace-2/" data-type="post" data-id="26083" target="_blank">str.replace()</a></code> before turning the string to a float with <code><a href="https://blog.finxter.com/python-string-to-float/" data-type="URL" data-id="https://blog.finxter.com/python-string-to-float/">s.astype('float')</a></code> and assigning it back to our data frame.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">city_density_str_col_1 = city_density_str_col_1.str.replace(',','')
cities['city_density/km2'] = city_density_str_col_1.astype('float')
print(cities['city_density/km2'])
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="425" height="243" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-198.png" alt="" class="wp-image-375585" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-198.png 425w, https://blog.finxter.com/wp-content/uplo...00x172.png 300w" sizes="(max-width: 425px) 100vw, 425px" /></figure>
</div>
<p>The variable now shows up when we look at <code><a href="https://blog.finxter.com/pandas-dataframe-describe-method/" data-type="post" data-id="343482" target="_blank" rel="noreferrer noopener">df.describe()</a></code> and we’ll want the results rounded for better readability:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.describe().round(1)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="645" height="303" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-199.png" alt="" class="wp-image-375587" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-199.png 645w, https://blog.finxter.com/wp-content/uplo...00x141.png 300w" sizes="(max-width: 645px) 100vw, 645px" /></figure>
</div>
<h2>Cleaning Structured Data in Python</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="577" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-218-1024x577.png" alt="" class="wp-image-375680" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-218-1024x577.png 1024w, https://blog.finxter.com/wp-content/uplo...00x169.png 300w, https://blog.finxter.com/wp-content/uplo...68x433.png 768w, https://blog.finxter.com/wp-content/uplo...ge-218.png 1216w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Following the cleaning of the scraped data we can now use it like a structured data frame with data we collected or downloaded. </p>
<p>This also can be cleaned of <a href="https://blog.finxter.com/pandas-nan/" data-type="post" data-id="16311" target="_blank" rel="noreferrer noopener">missing data</a>, <a href="https://blog.finxter.com/how-to-find-outliers-in-python-easily/" data-type="post" data-id="2998" target="_blank" rel="noreferrer noopener">outliers</a> and <a href="https://blog.finxter.com/how-to-remove-duplicates-from-a-python-list/" data-type="post" data-id="7686" target="_blank" rel="noreferrer noopener">duplicates</a> but does not always need data wrangling. However, with a data frame with many strings the cleaning process also often involves a lot of <a href="https://blog.finxter.com/python-string-methods/" data-type="post" data-id="25974" target="_blank" rel="noreferrer noopener">string manipulation</a>.</p>
<p><strong>Important note:</strong></p>
<p class="has-global-color-8-background-color has-background">If you want to apply <a href="https://blog.finxter.com/cheat-sheet-6-pillar-machine-learning-algorithms/" data-type="post" data-id="2613" target="_blank" rel="noreferrer noopener">machine learning algorithms</a> to your data, do split your dataset before feature engineering and data transformation as this can create data leakage!</p>
<p>Dropping duplicates is <a href="https://blog.finxter.com/pandas-dataframe-drop_duplicates-method/" data-type="post" data-id="343624" target="_blank" rel="noreferrer noopener">easy</a>: </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities = cities.drop_duplicates()</pre>
<p>There didn’t seem to be duplicates in our <code>df</code>, as the size remained the same.</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="184" height="43" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-200.png" alt="" class="wp-image-375589"/></figure>
</div>
<p>When dealing with missing values, we must decide how to handle them based on our data. </p>
<p>We can either</p>
<ul>
<li>Drop missing values</li>
<li>Replace or impute the values</li>
<li>Leave missing values in the dataset </li>
<li>Transform the information that they’re missing into a new variable</li>
</ul>
<p>First, we inspect our missing data. The function <code><a href="https://blog.finxter.com/pandas-dataframe-isna-and-isnull-method/" data-type="post" data-id="343839" target="_blank" rel="noreferrer noopener">df.isnull()</a></code> is a boolean function, that tells us for the whole data frame if data is missing or not. </p>
<p>We can sum it up to determine, how many values are missing in each column.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.isnull().sum()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="287" height="169" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-201.png" alt="" class="wp-image-375594"/></figure>
</div>
<p>We can drop rows with missing values completely. </p>
<p>This will cause us to lose useful information in other columns. But as the first row is completely empty anyway, we can drop this one. </p>
<p>The <code>df.dropna()</code> function has useful features that help us pick what missing data we want to remove. So, I just want to remove the one row, or all of them if there are more, with all missing values.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities = cities.dropna(how='all')</pre>
<p>This will look like this:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="417" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-202-1024x417.png" alt="" class="wp-image-375595" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-202-1024x417.png 1024w, https://blog.finxter.com/wp-content/uplo...00x122.png 300w, https://blog.finxter.com/wp-content/uplo...68x313.png 768w, https://blog.finxter.com/wp-content/uplo...ge-202.png 1082w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>What is left are the missing values for 8 cities for population, area and density. We will replace those. </p>
<p>Of course, you can look up the data on Wikipedia and reinsert them. For the sake of the exercise and because most of the time it is not possible to look up missing data, we will not do this.</p>
<p>The dataset now contains the data of the 73 biggest cities in the world, using the average of these to impute the missing values in the other 8 is the only and closest guess we have. This does not create much more information but keeps us from losing other information from these 8 cities. </p>
<p>The alternative option would be to drop those 8 cities completely.</p>
<p>So, we’ll replace the missing values in the area column with the average area size of all the other cities. First, we create the mean of the city area sizes, then we fill the missing values in the column with this value. </p>
<p>Pandas has the right function for this: <code><a href="https://blog.finxter.com/pandas-dataframe-fillna-method/" data-type="post" data-id="343812" target="_blank" rel="noreferrer noopener">df.fillna()</a></code></p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">city_area_mean = cities['city_area_km2'].mean()
cities['city_area_km2'] = cities['city_area_km2'].fillna(value=city_area_mean)
cities.isnull().sum()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="283" height="166" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-203.png" alt="" class="wp-image-375597"/></figure>
</div>
<p>The output shows that we now have replaced and eliminated several missing values.</p>
<p>For the population, we luckily have another column that shows the population estimates of the UN in 2018 for each city. </p>
<p>So, we can use these to impute the missing population data, as it is as close as we can get to replacing them accurately. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities['city_population'] = cities['city_population'].fillna(cities['UN_2018_population_estimates'])
cities['city_population'].describe().round(1)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="322" height="175" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-204.png" alt="" class="wp-image-375599" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-204.png 322w, https://blog.finxter.com/wp-content/uplo...00x163.png 300w" sizes="(max-width: 322px) 100vw, 322px" /></figure>
</div>
<p>Now we still have missing values in the city density column. This one we can calculate now by dividing the population by the area. </p>
<p>So, we create a new variable to calculate the density with our new imputed data. Then we fill the missing values with this calculated density.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">city_density = cities['city_population']/cities['city_area_km2']
cities['city_density/km2'] = cities['city_density/km2'].fillna(value=city_density)
cities['city_density/km2'].describe().round(1)
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="323" height="189" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-205.png" alt="" class="wp-image-375600" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-205.png 323w, https://blog.finxter.com/wp-content/uplo...00x176.png 300w" sizes="(max-width: 323px) 100vw, 323px" /></figure>
</div>
<p>We can check back our missing values and the description of our dataset.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.isnull().sum()</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="299" height="167" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-206.png" alt="" class="wp-image-375601"/></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.describe().round(1)</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="654" height="303" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-207.png" alt="" class="wp-image-375602" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-207.png 654w, https://blog.finxter.com/wp-content/uplo...00x139.png 300w" sizes="(max-width: 654px) 100vw, 654px" /></figure>
</div>
<p>There is still one value missing in our city definition. Let’s have a look at these categories. </p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities['city_definition'].value_counts()</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="301" height="526" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-208.png" alt="" class="wp-image-375603" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-208.png 301w, https://blog.finxter.com/wp-content/uplo...72x300.png 172w" sizes="(max-width: 301px) 100vw, 301px" /></figure>
</div>
<p>As we don’t know if the missing city is a municipality or a capital, we could just replace the missing value with the generic description of “city”, as we know they all are cities. </p>
<p>If you’d want to calculate the differences between these categories, it would be useful to categorize and merge these single entries into bigger categories. </p>
<p>For now, we will just replace the missing value with “city”, as I am more interested in the size of the cities than the category.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities['city_definition'] = cities['city_definition'].fillna('City')
cities.isnull().sum()
</pre>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="281" height="166" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-209.png" alt="" class="wp-image-375604"/></figure>
</div>
<p>Great! We got rid of all the missing values.</p>
<p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/13.1.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Info</strong>: For many statistical operations, missing values will be dropped by default and don’t create a problem. For machine learning algorithms missing values must be removed before modelling.</p>
<p>We can also create dummy variables (information is missing/ not missing) as the fact that the data is missing might be useful information. This way, the fact that they’re missing can be included in the data analysis process. </p>
<h2>Visualization</h2>
<p>Now we visualize our data and check for outliers with a seaborn scatterplot.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import seaborn as sns
sns.scatterplot(data=cities, x="city_population", y="city_area_km2", size="city_population")
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="525" height="336" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-210.png" alt="" class="wp-image-375605" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-210.png 525w, https://blog.finxter.com/wp-content/uplo...00x192.png 300w" sizes="(max-width: 525px) 100vw, 525px" /></figure>
</div>
<p>The city in the right top corner is clearly an outlier, but not one we would want to remove or equalize as it is not a measurement error. It is just the biggest city (or metropolitan area) in the world! </p>
<p>Let’s find out which one it is with <code><a href="https://blog.finxter.com/pandas-dataframe-sort_values-method/" data-type="post" data-id="343891" target="_blank" rel="noreferrer noopener">df.sort_values()</a></code>, using <code>ascending=False</code> to sort the city population from high to low.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">cities.sort_values('city_population', ascending=False)</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img loading="lazy" width="1024" height="427" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-211-1024x427.png" alt="" class="wp-image-375608" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-211-1024x427.png 1024w, https://blog.finxter.com/wp-content/uplo...00x125.png 300w, https://blog.finxter.com/wp-content/uplo...68x321.png 768w, https://blog.finxter.com/wp-content/uplo...ge-211.png 1054w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>The biggest city on earth by size and city population is the municipality Chongqing in China with over 32 million inhabitants!</p>
<p>To visualize our cities more beautifully, we can use a <code>sns.relplot()</code> with color and different sizing.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">sns.relplot(x="city_population", y="city_area_km2", hue="country", size="city_population", sizes=(40, 400), alpha=.5, palette="muted", height=6, data=cities)
</pre>
<p>Output:</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" width="531" height="721" src="https://blog.finxter.com/wp-content/uploads/2022/05/image-212.png" alt="" class="wp-image-375611" srcset="https://blog.finxter.com/wp-content/uploads/2022/05/image-212.png 531w, https://blog.finxter.com/wp-content/uplo...21x300.png 221w" sizes="(max-width: 531px) 100vw, 531px" /></figure>
</div>
<p>Now keep on cleaning!</p>
</div>
https://www.sickgaming.net/blog/2022/05/...nd-pandas/