01-26-2023, 06:39 AM
Basketball Statistics – Page Scraping Using Python and BeautifulSoup
<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload='{"align":"left","id":"1081082","slug":"default","valign":"top","ignore":"","reference":"auto","class":"","count":"1","legendonly":"","readonly":"","score":"5","starsonly":"","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})","font_factor":"1.25"}'>
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</p></div>
<p>In this blog series, powerful Python libraries are leveraged to help uncover some hidden statistical truths in basketball. The first step in any data-driven approach is to identify and collect the data needed.</p>
<p>Luckily for us, <a href="https://www.basketball-reference.com/">Basketball-Reference.com</a> hosts pages of basketball data that can be easily scraped. The processes of this walkthrough can be easily applied to any number of their pages, but for this case, we plan on scraping seasonal statistics of multiple rookie classes.</p>
<h2>Project Overview</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="576" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-293.png" alt="" class="wp-image-1081116" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-293.png 576w, https://blog.finxter.com/wp-content/uplo...95x300.png 195w" sizes="(max-width: 576px) 100vw, 576px" /></figure>
</div>
<p><strong>The Objectives:</strong></p>
<ol type="1" start="1">
<li>Identify the Data Source</li>
<li>Download the Page</li>
<li>Identify Important Page Elements</li>
<li>Pre-Clean and Extract</li>
<li>Archive</li>
</ol>
<p><strong>The Tools:</strong></p>
<ul>
<li>Requests</li>
<li>Beautiful Soup</li>
<li>Pandas</li>
</ul>
<p>Though we will inevitably be working with many specialized libraries throughout this project, the above packages will suffice for now.</p>
<h2>Identifying the Data Source</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="592" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-294.png" alt="" class="wp-image-1081117" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-294.png 592w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 592px) 100vw, 592px" /></figure>
</div>
<p><a rel="noreferrer noopener" href="https://www.basketball-reference.com/" target="_blank">Basketball-Reference.com</a> hosts hundreds of curated pages on basketball statistics that range from seasonal averages of typical box score categories like points, rebounds, and shooting percentages, all the way down to the play-by-play action of each game played in the last 20 or so years. One can easily lose their way in this statistical tsunami if there isn’t a clear goal set on what exactly to look for.</p>
<p>The goal here in this post is simple: get rookie data that will help in assessing a young player’s true value and potential.</p>
<p>The following link is one such page. It lists all the relevant statistics of rookies in a particular season.</p>
<p><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f449.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Link</strong>: <a href="https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html" target="_blank" rel="noreferrer noopener">https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html</a></p>
<p>In order to accumulate enough data to make solid statistical inferences on players, one year of data won’t cut it. There need to be dozens of years’ worth of data collected to help filter through the noise and come to a conclusion on a player’s future potential.</p>
<p>If an action can be manually repeated, it makes itself a great candidate for automation. In this case, the number in the URL above corresponds to the respective year of that rookie class. Powered by that knowledge, let’s start putting together our first lines of code.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
import pandas as pd
from bs4 import BeautifulSoup years = list(range(1990, 2017)) url_base = "https://www.basketball-reference.com/leagues/NBA_{}_rookies-season-stats.html"</pre>
<p>In creating the two variables referenced above, our thought process is as follows.</p>
<ol type="1" start="1">
<li>The appropriate packages are imported</li>
<li><code>url_base</code> serves to store the pre-formatted string variable of the target URL</li>
<li>The <code>years</code> list variable specifies the ranged of the desired years, 1990 up to 2017</li>
</ol>
<h2>Downloading the Page Data</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="623" height="872" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-295.png" alt="" class="wp-image-1081118" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-295.png 623w, https://blog.finxter.com/wp-content/uplo...14x300.png 214w" sizes="(max-width: 623px) 100vw, 623px" /></figure>
</div>
<p>In scraping web pages, it’s imperative to remove as much overhead as possible. Seeing as the site stores all their information on the HTML front end, the page can be easily downloaded and locally stored in its entirety.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># iterates through each year and downloads page into an HTML file
for year in years: url = url_base.format(year) data = requests.get(url) # page is save as an html and placed in Rookies folder with open("notebooks/Rookies/{}.html".format(year), "w+") as f: f.write(data.text)</pre>
<p>The <code>for</code> loop iterates through the <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a> variable <code>years</code>.</p>
<p>The curly braces found within the <code>url</code>’s string will serve to allow the format to substitute it with the currently iterated year.</p>
<p>For example, in its first iteration, the <code>url</code> value will be <code>'https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html'</code>. </p>
<p>On its second iteration, the subsequent year would be referenced instead (<a href="https://www.basketball-reference.com/leagues/NBA_1991_rookies-season-stats.html" target="_blank" rel="noreferrer noopener">https://www.basketball-reference.com/leagues/NBA_1991_rookies-season-stats.html</a>)</p>
<p>The <code>data</code> variable acts as a placeholder for the <code>requests.get()</code> function and references of the currently iterated <code>url</code> string value.</p>
<p>The requests method then uses the newly formatted URL string to retrieve the page in question.</p>
<p>The subsequent <code>with open()</code> reads and writes (<code>w+</code>) the page data from our <code>requests.get (data.text)</code>, and locally stores the newly created HTML files.</p>
<p>Why download the page and store it locally?</p>
<p>To avoid a common growing pain in site scraping, we store these pages as local HTML files. </p>
<p>See, when making a visit to a page site, the server hosting said page has to honor your request and send back the appropriate data to your browser. But having one specific client asking for the same information over and over puts undue strain on the server. </p>
<p>The server admin is well within their rights to block these persistent requests for the sake of being able to optimally provide this service to others online.</p>
<p>By downloading these HTML files on your local machine, you avoid two things:</p>
<ol type="1" start="1">
<li>Having to wait longer than usual to collect the same data</li>
<li>Being blocked from visiting the page, halting data collection altogether</li>
</ol>
<h2>Identifying Important Page Elements</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="592" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-296.png" alt="" class="wp-image-1081119" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-296.png 592w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 592px) 100vw, 592px" /></figure>
</div>
<p>To scrape data elements of these recently downloaded pages using Python, there needs to be a means to understand what properties these HTML elements have. In order to identify these properties, we need to inspect the page itself.</p>
<h3><a></a>How to Inspect</h3>
<p>We’ll need to dive deeper into the inner workings of this document, but I promise I won’t make this an exercise on learning HTML.</p>
<p>If you know how to inspect HTML objects, feel free to jump ahead. Otherwise, please follow along on how to inspect page elements.</p>
<h4><a></a>Option 1: Developer Tools</h4>
<ol type="1" start="1">
<li>Click on the three vertical dots on Chrome’s top menu bar</li>
<li>Choose “More tools”</li>
<li>Select Developer tools.</li>
</ol>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="604" height="508" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-288.png" alt="" class="wp-image-1081092" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-288.png 604w, https://blog.finxter.com/wp-content/uplo...00x252.png 300w" sizes="(max-width: 604px) 100vw, 604px" /></figure>
</div>
<h4>Option 2: Menu Select</h4>
<ol type="1" start="1">
<li>Right-click on the web page</li>
<li>Choose “Inspect” to access the Developer tools panel</li>
</ol>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="489" height="493" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-289.png" alt="" class="wp-image-1081093" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-289.png 489w, https://blog.finxter.com/wp-content/uplo...98x300.png 298w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w" sizes="(max-width: 489px) 100vw, 489px" /></figure>
</div>
<h3>Inspecting the Page</h3>
<p>Seeing that all of these pages are locally stored, we can choose to view them by either going into the file system to open them in our desired browser, or, we can continue to build our code by implementing the following snippet of code.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">with open("notebooks/Rookies/2000.html") as f: page = f.read()</pre>
<p>Below is the loaded page with <strong>Developer Tools</strong> docked to the right. Notice how hovering the mouse cursor on the HTML line containing the class ID rookies highlights the table element on the page?</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="624" height="217" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-290.png" alt="" class="wp-image-1081095" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-290.png 624w, https://blog.finxter.com/wp-content/uplo...00x104.png 300w" sizes="(max-width: 624px) 100vw, 624px" /></figure>
</div>
<p>All the desired data of this page is housed in that table element. Before hastily sucking up all of this data as is, now is the best time to consider whether everything on this table is worth collecting.</p>
<h2>Pre-Clean</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="592" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-297.png" alt="" class="wp-image-1081120" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-297.png 592w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 592px) 100vw, 592px" /></figure>
</div>
<p>Pre-cleaning might not be a frequent word in your vocabulary, but for those of you seeing yourself scraping data regularly, it <em>should </em>be. If you want to avoid the frustration of wasted hours of progress on a data collection project, it’s best to first separate the chaff from the wheat.</p>
<p>For instance, take note of the three elements boxed in red.</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="624" height="397" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-291.png" alt="" class="wp-image-1081096" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-291.png 624w, https://blog.finxter.com/wp-content/uplo...00x191.png 300w" sizes="(max-width: 624px) 100vw, 624px" /></figure>
</div>
<p>One row serves as the “main” table header. The other two rows are duplicate instances of the same artifacts found at the top. This pattern repeats every 20th row.</p>
<p>Upon further inspection of these elements, it’s revealed that all of these rows have the same <code>tr</code> (table row) HTML tag. What distinguishes each of these elements from any others are their class names.</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="624" height="319" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-292.png" alt="" class="wp-image-1081097" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-292.png 624w, https://blog.finxter.com/wp-content/uplo...00x153.png 300w" sizes="(max-width: 624px) 100vw, 624px" /></figure>
</div>
<ol>
<li><strong>Main Header Row</strong><br />a. <code>Class = over_header</code></li>
<li><strong>Repeat Header Rows</strong><br />a. <code>Class = over_header thead</code></li>
<li><strong>Statistics Category Row</strong><br />a. <code>Class = thead</code></li>
</ol>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># array to house list of dataframes
dfs = [] # unnecessary table rows to be removed
classes = ["over_header", "over_header thead", "thead"]
</pre>
<ol type="1" start="1">
<li><code>dfs</code> will be used later on to house several data frames</li>
<li>The <code>classes</code> array object will hold all the unwanted table row element’s class names.</li>
</ol>
<p>Knowing that these elements provide no statistical value, rather than simply “skipping over” them in our parse, they should instead be completely omitted. That’s to say, permanently removed from any future considerations.</p>
<p>The <code>decompose</code> method serves to remove unwanted elements in a page. As per the <a rel="noreferrer noopener" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose" target="_blank">official Beautiful Soup page.</a></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">decompose()</pre>
<p><code>Tag.decompose()</code> removes a tag from the tree, then <em>completely destroys it and its contents</em>.</p>
<p>Below is a snippet of code where the <code>decompose</code> method is optimized using multiple <code>for</code> loops.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># for loop to iterate through the years for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") # for loop cleans up unnecessary table # headers from reappearing in rows for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose()
</pre>
<ol type="1" start="1">
<li>First <code>for</code> loop is used to iterate through the values of our <code>years</code> list object</li>
<li>The <code>with</code> method provides our code the structure for the page variable to read locally stored HTML files when called on</li>
<li>An HTML parser class is initialized by instantiating the <a rel="noreferrer noopener" href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="post" data-id="17311" target="_blank">BeautifulSoup</a> class and passing in both the page string object and <code>html.parser</code>.</li>
<li>Second <code>for</code> loop iterates through the values in the classes array</li>
<li>Third <code>for</code> loop utilizes Beautiful Soup’s <code>find_all</code> method to identify elements that have both <code>tr</code> tags and class names matching those in classes</li>
<li><code>tr.decompose</code> serves to omit each of the identified table row elements from the page entirely</li>
</ol>
<p>Let’s look to build on this by extracting the data we <em>do</em> want.</p>
<h2>Extracting the Data</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="623" height="412" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-298.png" alt="" class="wp-image-1081121" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-298.png 623w, https://blog.finxter.com/wp-content/uplo...00x198.png 300w" sizes="(max-width: 623px) 100vw, 623px" /></figure>
</div>
<p>We can finally start working on the part of the code that actually extracts data from the table. </p>
<p>Remember that the table in with all of the relevant data has the HTML unique ID rookies. The following additions to our code will serve to parse the data of this table.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># the years we wish to parse for
years = list(range(1990, 2017)) # array to house list of dataframes
dfs = [] # unnecessary table headers to be removed
classes = ["over_header","over_header thead", "thead"] for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") #for loop cleans up unnecessary table headers from reappearing in rows for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose() ### Start Scraping Block ### #identifies, scrapes, and loads rookie tables into one dataframe rookie_table = soup.find(id="rookies") rookies = pd.read_html(str(rookie_table))[0] rookies["Year"] = year dfs.append(rookies) # new variable turns list of dataframes into single dataframe
all_rookies = pd.concat(dfs)
</pre>
<p>For what follows <code>### Start Scraping Block ###</code></p>
<ol type="1" start="1">
<li>The <code>rookie_table</code> variable serves to help identify this, and only this table on the page</li>
<li>Seeing that the <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">Pandas</a> package can read HTML tables, the rookie table is loaded into Pandas using the <code>read_html</code> method, passing the <code>rookie_table</code> as a string</li>
<li>Tacking on to end <code>[0]</code> to turn it from a <a href="https://blog.finxter.com/python-join-list-of-dataframes/" data-type="post" data-id="9780" target="_blank" rel="noreferrer noopener">list of dataframes</a> into a single dataframe</li>
<li>A “<code>Year</code>” column is added to the <code>rookies</code> dataframe</li>
<li><code>dfs.append(rookies)</code> serves to house all of tables of every rookie year in the order they were iterated into a list of dataframes</li>
<li>The Pandas method <code><a href="https://blog.finxter.com/how-does-pandas-concat-work/" data-type="post" data-id="17172" target="_blank" rel="noreferrer noopener">concat</a></code> is used to combine that list of dataframes into one single dataframe: <code>all_rookies</code></li>
</ol>
<h2><a></a>Archiving</h2>
<p>Our final step involves taking all of this useful, clean information and archiving it in easily readable CSV format. Tacking on this line to the end of our code (outside of any loops!) <em>will</em> serve to be useful when deciding to come back and reference the data collected.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># dataframe archived as local CSV
all_rookies.to_csv("archive/NBA_Rookies_1990-2016.csv")
</pre>
<h2>Final Product</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="592" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-299.png" alt="" class="wp-image-1081124" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-299.png 592w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 592px) 100vw, 592px" /></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
import pandas as pd
from bs4 import BeautifulSoup # the years we wish to parse for
years = list(range(1990, 2017)) # array to house list of dataframes
dfs = [] # unnecessary table headers to be removed
classes = ["over_header","over_header thead", "thead"] # loop iterates through years
for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") #second for loop clears unnecessary table headers for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose() # identifies, scrapes, and loads rookie tables into one dataframe table_rookies = soup.find(id="rookies") rookies = pd.read_html(str(table_rookies))[0] rookies["Year"] = year dfs.append(rookies) #new variable turns list of dataframes into single dataframe
all_rookies = pd.concat(dfs) #dataframe archived as local CSV
all_rookies.to_csv("archive/NBA_Rookies_1990-2016.csv")</pre>
<h2>Closing</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="623" height="415" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-300.png" alt="" class="wp-image-1081125" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-300.png 623w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w" sizes="(max-width: 623px) 100vw, 623px" /></figure>
</div>
<p>Again, the process followed in this walkthrough will undoubtedly apply to most every other page on <em>Basketball-Reference.com</em>.</p>
<p>There are five simple steps worth taking in each instance.</p>
<ol type="1" start="1">
<li>Identify the Page URL</li>
<li>Download the Page</li>
<li>Identify the Elements</li>
<li>Pre-Clean and Extract</li>
<li>Archive</li>
</ol>
<p>Following these five steps will help guarantee a quick and successful scraping experience. </p>
<p>Next up in this series will be actually using this data to gain insight into future player potential. So be on the lookout for future installments!</p>
<p>We’ll share them here:</p>
</div>
https://www.sickgaming.net/blog/2023/01/...tifulsoup/
<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload='{"align":"left","id":"1081082","slug":"default","valign":"top","ignore":"","reference":"auto","class":"","count":"1","legendonly":"","readonly":"","score":"5","starsonly":"","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})","font_factor":"1.25"}'>
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</p></div>
<p>In this blog series, powerful Python libraries are leveraged to help uncover some hidden statistical truths in basketball. The first step in any data-driven approach is to identify and collect the data needed.</p>
<p>Luckily for us, <a href="https://www.basketball-reference.com/">Basketball-Reference.com</a> hosts pages of basketball data that can be easily scraped. The processes of this walkthrough can be easily applied to any number of their pages, but for this case, we plan on scraping seasonal statistics of multiple rookie classes.</p>
<h2>Project Overview</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="576" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-293.png" alt="" class="wp-image-1081116" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-293.png 576w, https://blog.finxter.com/wp-content/uplo...95x300.png 195w" sizes="(max-width: 576px) 100vw, 576px" /></figure>
</div>
<p><strong>The Objectives:</strong></p>
<ol type="1" start="1">
<li>Identify the Data Source</li>
<li>Download the Page</li>
<li>Identify Important Page Elements</li>
<li>Pre-Clean and Extract</li>
<li>Archive</li>
</ol>
<p><strong>The Tools:</strong></p>
<ul>
<li>Requests</li>
<li>Beautiful Soup</li>
<li>Pandas</li>
</ul>
<p>Though we will inevitably be working with many specialized libraries throughout this project, the above packages will suffice for now.</p>
<h2>Identifying the Data Source</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="592" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-294.png" alt="" class="wp-image-1081117" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-294.png 592w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 592px) 100vw, 592px" /></figure>
</div>
<p><a rel="noreferrer noopener" href="https://www.basketball-reference.com/" target="_blank">Basketball-Reference.com</a> hosts hundreds of curated pages on basketball statistics that range from seasonal averages of typical box score categories like points, rebounds, and shooting percentages, all the way down to the play-by-play action of each game played in the last 20 or so years. One can easily lose their way in this statistical tsunami if there isn’t a clear goal set on what exactly to look for.</p>
<p>The goal here in this post is simple: get rookie data that will help in assessing a young player’s true value and potential.</p>
<p>The following link is one such page. It lists all the relevant statistics of rookies in a particular season.</p>
<p><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f449.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Link</strong>: <a href="https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html" target="_blank" rel="noreferrer noopener">https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html</a></p>
<p>In order to accumulate enough data to make solid statistical inferences on players, one year of data won’t cut it. There need to be dozens of years’ worth of data collected to help filter through the noise and come to a conclusion on a player’s future potential.</p>
<p>If an action can be manually repeated, it makes itself a great candidate for automation. In this case, the number in the URL above corresponds to the respective year of that rookie class. Powered by that knowledge, let’s start putting together our first lines of code.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
import pandas as pd
from bs4 import BeautifulSoup years = list(range(1990, 2017)) url_base = "https://www.basketball-reference.com/leagues/NBA_{}_rookies-season-stats.html"</pre>
<p>In creating the two variables referenced above, our thought process is as follows.</p>
<ol type="1" start="1">
<li>The appropriate packages are imported</li>
<li><code>url_base</code> serves to store the pre-formatted string variable of the target URL</li>
<li>The <code>years</code> list variable specifies the ranged of the desired years, 1990 up to 2017</li>
</ol>
<h2>Downloading the Page Data</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="623" height="872" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-295.png" alt="" class="wp-image-1081118" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-295.png 623w, https://blog.finxter.com/wp-content/uplo...14x300.png 214w" sizes="(max-width: 623px) 100vw, 623px" /></figure>
</div>
<p>In scraping web pages, it’s imperative to remove as much overhead as possible. Seeing as the site stores all their information on the HTML front end, the page can be easily downloaded and locally stored in its entirety.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># iterates through each year and downloads page into an HTML file
for year in years: url = url_base.format(year) data = requests.get(url) # page is save as an html and placed in Rookies folder with open("notebooks/Rookies/{}.html".format(year), "w+") as f: f.write(data.text)</pre>
<p>The <code>for</code> loop iterates through the <a href="https://blog.finxter.com/python-lists/" data-type="post" data-id="7332" target="_blank" rel="noreferrer noopener">list</a> variable <code>years</code>.</p>
<p>The curly braces found within the <code>url</code>’s string will serve to allow the format to substitute it with the currently iterated year.</p>
<p>For example, in its first iteration, the <code>url</code> value will be <code>'https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html'</code>. </p>
<p>On its second iteration, the subsequent year would be referenced instead (<a href="https://www.basketball-reference.com/leagues/NBA_1991_rookies-season-stats.html" target="_blank" rel="noreferrer noopener">https://www.basketball-reference.com/leagues/NBA_1991_rookies-season-stats.html</a>)</p>
<p>The <code>data</code> variable acts as a placeholder for the <code>requests.get()</code> function and references of the currently iterated <code>url</code> string value.</p>
<p>The requests method then uses the newly formatted URL string to retrieve the page in question.</p>
<p>The subsequent <code>with open()</code> reads and writes (<code>w+</code>) the page data from our <code>requests.get (data.text)</code>, and locally stores the newly created HTML files.</p>
<p>Why download the page and store it locally?</p>
<p>To avoid a common growing pain in site scraping, we store these pages as local HTML files. </p>
<p>See, when making a visit to a page site, the server hosting said page has to honor your request and send back the appropriate data to your browser. But having one specific client asking for the same information over and over puts undue strain on the server. </p>
<p>The server admin is well within their rights to block these persistent requests for the sake of being able to optimally provide this service to others online.</p>
<p>By downloading these HTML files on your local machine, you avoid two things:</p>
<ol type="1" start="1">
<li>Having to wait longer than usual to collect the same data</li>
<li>Being blocked from visiting the page, halting data collection altogether</li>
</ol>
<h2>Identifying Important Page Elements</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="592" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-296.png" alt="" class="wp-image-1081119" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-296.png 592w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 592px) 100vw, 592px" /></figure>
</div>
<p>To scrape data elements of these recently downloaded pages using Python, there needs to be a means to understand what properties these HTML elements have. In order to identify these properties, we need to inspect the page itself.</p>
<h3><a></a>How to Inspect</h3>
<p>We’ll need to dive deeper into the inner workings of this document, but I promise I won’t make this an exercise on learning HTML.</p>
<p>If you know how to inspect HTML objects, feel free to jump ahead. Otherwise, please follow along on how to inspect page elements.</p>
<h4><a></a>Option 1: Developer Tools</h4>
<ol type="1" start="1">
<li>Click on the three vertical dots on Chrome’s top menu bar</li>
<li>Choose “More tools”</li>
<li>Select Developer tools.</li>
</ol>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="604" height="508" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-288.png" alt="" class="wp-image-1081092" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-288.png 604w, https://blog.finxter.com/wp-content/uplo...00x252.png 300w" sizes="(max-width: 604px) 100vw, 604px" /></figure>
</div>
<h4>Option 2: Menu Select</h4>
<ol type="1" start="1">
<li>Right-click on the web page</li>
<li>Choose “Inspect” to access the Developer tools panel</li>
</ol>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="489" height="493" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-289.png" alt="" class="wp-image-1081093" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-289.png 489w, https://blog.finxter.com/wp-content/uplo...98x300.png 298w, https://blog.finxter.com/wp-content/uplo...50x150.png 150w" sizes="(max-width: 489px) 100vw, 489px" /></figure>
</div>
<h3>Inspecting the Page</h3>
<p>Seeing that all of these pages are locally stored, we can choose to view them by either going into the file system to open them in our desired browser, or, we can continue to build our code by implementing the following snippet of code.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">with open("notebooks/Rookies/2000.html") as f: page = f.read()</pre>
<p>Below is the loaded page with <strong>Developer Tools</strong> docked to the right. Notice how hovering the mouse cursor on the HTML line containing the class ID rookies highlights the table element on the page?</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="624" height="217" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-290.png" alt="" class="wp-image-1081095" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-290.png 624w, https://blog.finxter.com/wp-content/uplo...00x104.png 300w" sizes="(max-width: 624px) 100vw, 624px" /></figure>
</div>
<p>All the desired data of this page is housed in that table element. Before hastily sucking up all of this data as is, now is the best time to consider whether everything on this table is worth collecting.</p>
<h2>Pre-Clean</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="592" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-297.png" alt="" class="wp-image-1081120" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-297.png 592w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 592px) 100vw, 592px" /></figure>
</div>
<p>Pre-cleaning might not be a frequent word in your vocabulary, but for those of you seeing yourself scraping data regularly, it <em>should </em>be. If you want to avoid the frustration of wasted hours of progress on a data collection project, it’s best to first separate the chaff from the wheat.</p>
<p>For instance, take note of the three elements boxed in red.</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="624" height="397" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-291.png" alt="" class="wp-image-1081096" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-291.png 624w, https://blog.finxter.com/wp-content/uplo...00x191.png 300w" sizes="(max-width: 624px) 100vw, 624px" /></figure>
</div>
<p>One row serves as the “main” table header. The other two rows are duplicate instances of the same artifacts found at the top. This pattern repeats every 20th row.</p>
<p>Upon further inspection of these elements, it’s revealed that all of these rows have the same <code>tr</code> (table row) HTML tag. What distinguishes each of these elements from any others are their class names.</p>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="624" height="319" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-292.png" alt="" class="wp-image-1081097" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-292.png 624w, https://blog.finxter.com/wp-content/uplo...00x153.png 300w" sizes="(max-width: 624px) 100vw, 624px" /></figure>
</div>
<ol>
<li><strong>Main Header Row</strong><br />a. <code>Class = over_header</code></li>
<li><strong>Repeat Header Rows</strong><br />a. <code>Class = over_header thead</code></li>
<li><strong>Statistics Category Row</strong><br />a. <code>Class = thead</code></li>
</ol>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># array to house list of dataframes
dfs = [] # unnecessary table rows to be removed
classes = ["over_header", "over_header thead", "thead"]
</pre>
<ol type="1" start="1">
<li><code>dfs</code> will be used later on to house several data frames</li>
<li>The <code>classes</code> array object will hold all the unwanted table row element’s class names.</li>
</ol>
<p>Knowing that these elements provide no statistical value, rather than simply “skipping over” them in our parse, they should instead be completely omitted. That’s to say, permanently removed from any future considerations.</p>
<p>The <code>decompose</code> method serves to remove unwanted elements in a page. As per the <a rel="noreferrer noopener" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose" target="_blank">official Beautiful Soup page.</a></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">decompose()</pre>
<p><code>Tag.decompose()</code> removes a tag from the tree, then <em>completely destroys it and its contents</em>.</p>
<p>Below is a snippet of code where the <code>decompose</code> method is optimized using multiple <code>for</code> loops.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># for loop to iterate through the years for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") # for loop cleans up unnecessary table # headers from reappearing in rows for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose()
</pre>
<ol type="1" start="1">
<li>First <code>for</code> loop is used to iterate through the values of our <code>years</code> list object</li>
<li>The <code>with</code> method provides our code the structure for the page variable to read locally stored HTML files when called on</li>
<li>An HTML parser class is initialized by instantiating the <a rel="noreferrer noopener" href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="post" data-id="17311" target="_blank">BeautifulSoup</a> class and passing in both the page string object and <code>html.parser</code>.</li>
<li>Second <code>for</code> loop iterates through the values in the classes array</li>
<li>Third <code>for</code> loop utilizes Beautiful Soup’s <code>find_all</code> method to identify elements that have both <code>tr</code> tags and class names matching those in classes</li>
<li><code>tr.decompose</code> serves to omit each of the identified table row elements from the page entirely</li>
</ol>
<p>Let’s look to build on this by extracting the data we <em>do</em> want.</p>
<h2>Extracting the Data</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="623" height="412" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-298.png" alt="" class="wp-image-1081121" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-298.png 623w, https://blog.finxter.com/wp-content/uplo...00x198.png 300w" sizes="(max-width: 623px) 100vw, 623px" /></figure>
</div>
<p>We can finally start working on the part of the code that actually extracts data from the table. </p>
<p>Remember that the table in with all of the relevant data has the HTML unique ID rookies. The following additions to our code will serve to parse the data of this table.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># the years we wish to parse for
years = list(range(1990, 2017)) # array to house list of dataframes
dfs = [] # unnecessary table headers to be removed
classes = ["over_header","over_header thead", "thead"] for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") #for loop cleans up unnecessary table headers from reappearing in rows for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose() ### Start Scraping Block ### #identifies, scrapes, and loads rookie tables into one dataframe rookie_table = soup.find(id="rookies") rookies = pd.read_html(str(rookie_table))[0] rookies["Year"] = year dfs.append(rookies) # new variable turns list of dataframes into single dataframe
all_rookies = pd.concat(dfs)
</pre>
<p>For what follows <code>### Start Scraping Block ###</code></p>
<ol type="1" start="1">
<li>The <code>rookie_table</code> variable serves to help identify this, and only this table on the page</li>
<li>Seeing that the <a href="https://blog.finxter.com/pandas-quickstart/" data-type="post" data-id="16511" target="_blank" rel="noreferrer noopener">Pandas</a> package can read HTML tables, the rookie table is loaded into Pandas using the <code>read_html</code> method, passing the <code>rookie_table</code> as a string</li>
<li>Tacking on to end <code>[0]</code> to turn it from a <a href="https://blog.finxter.com/python-join-list-of-dataframes/" data-type="post" data-id="9780" target="_blank" rel="noreferrer noopener">list of dataframes</a> into a single dataframe</li>
<li>A “<code>Year</code>” column is added to the <code>rookies</code> dataframe</li>
<li><code>dfs.append(rookies)</code> serves to house all of tables of every rookie year in the order they were iterated into a list of dataframes</li>
<li>The Pandas method <code><a href="https://blog.finxter.com/how-does-pandas-concat-work/" data-type="post" data-id="17172" target="_blank" rel="noreferrer noopener">concat</a></code> is used to combine that list of dataframes into one single dataframe: <code>all_rookies</code></li>
</ol>
<h2><a></a>Archiving</h2>
<p>Our final step involves taking all of this useful, clean information and archiving it in easily readable CSV format. Tacking on this line to the end of our code (outside of any loops!) <em>will</em> serve to be useful when deciding to come back and reference the data collected.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># dataframe archived as local CSV
all_rookies.to_csv("archive/NBA_Rookies_1990-2016.csv")
</pre>
<h2>Final Product</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="592" height="888" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-299.png" alt="" class="wp-image-1081124" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-299.png 592w, https://blog.finxter.com/wp-content/uplo...00x300.png 200w" sizes="(max-width: 592px) 100vw, 592px" /></figure>
</div>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import requests
import pandas as pd
from bs4 import BeautifulSoup # the years we wish to parse for
years = list(range(1990, 2017)) # array to house list of dataframes
dfs = [] # unnecessary table headers to be removed
classes = ["over_header","over_header thead", "thead"] # loop iterates through years
for year in years: with open("notebooks/Rookies/{}.html".format(year)) as f: page = f.read() soup = BeautifulSoup(page, "html.parser") #second for loop clears unnecessary table headers for i in classes: for tr in soup.find_all("tr", {"class":i}): tr.decompose() # identifies, scrapes, and loads rookie tables into one dataframe table_rookies = soup.find(id="rookies") rookies = pd.read_html(str(table_rookies))[0] rookies["Year"] = year dfs.append(rookies) #new variable turns list of dataframes into single dataframe
all_rookies = pd.concat(dfs) #dataframe archived as local CSV
all_rookies.to_csv("archive/NBA_Rookies_1990-2016.csv")</pre>
<h2>Closing</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="623" height="415" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-300.png" alt="" class="wp-image-1081125" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-300.png 623w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w" sizes="(max-width: 623px) 100vw, 623px" /></figure>
</div>
<p>Again, the process followed in this walkthrough will undoubtedly apply to most every other page on <em>Basketball-Reference.com</em>.</p>
<p>There are five simple steps worth taking in each instance.</p>
<ol type="1" start="1">
<li>Identify the Page URL</li>
<li>Download the Page</li>
<li>Identify the Elements</li>
<li>Pre-Clean and Extract</li>
<li>Archive</li>
</ol>
<p>Following these five steps will help guarantee a quick and successful scraping experience. </p>
<p>Next up in this series will be actually using this data to gain insight into future player potential. So be on the lookout for future installments!</p>
<p>We’ll share them here:</p>
</div>
https://www.sickgaming.net/blog/2023/01/...tifulsoup/