[Tut] Scrape a Bookstore in 5 Steps Python [Learn Project] - Printable Version +- Sick Gaming (https://www.sickgaming.net) +-- Forum: Programming (https://www.sickgaming.net/forum-76.html) +--- Forum: Python (https://www.sickgaming.net/forum-83.html) +--- Thread: [Tut] Scrape a Bookstore in 5 Steps Python [Learn Project] (/thread-99586.html) |
[Tut] Scrape a Bookstore in 5 Steps Python [Learn Project] - xSicKxBot - 06-17-2022 Scrape a Bookstore in 5 Steps Python [Learn Project] <div><div class="kk-star-ratings kksr-valign-top kksr-align-left " data-payload="{"align":"left","id":"422300","slug":"default","valign":"top","reference":"auto","count":"1","readonly":"","score":"5","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})"}"> <div class="kksr-stars"> <div class="kksr-stars-inactive"> <div class="kksr-star" data-star="1" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" data-star="2" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" data-star="3" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" data-star="4" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" data-star="5" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> </p></div> <div class="kksr-stars-active" style="width: 142.5px;"> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> </p></div> </div> <div class="kksr-legend"> 5/5 – (1 vote) </div> </div> <p><em><strong>Story</strong>: This series of articles assume you work in the IT Department of Mason Books. The Owner asks you to scrape the website of a competitor. He would like this information to gain insight into his pricing structure.</em></p> <p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Note</strong>: Before continuing, we recommend you possess, at minimum, a basic knowledge of <a rel="noreferrer noopener" href="https://www.w3schools.com/html/" target="_blank">HTML</a> and <a rel="noreferrer noopener" href="https://www.w3schools.com/css/default.asp" target="_blank">CSS</a> and have reviewed our articles on <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-scrape-html-tables-part-1/" target="_blank">How to Scrape HTML tables</a>.</p> <h2>What You’ll Build in This Project</h2> <p>Let’s navigate to <a rel="noreferrer noopener" href="https://books.toscrape.com/index.html" data-type="URL" data-id="https://books.toscrape.com/index.html" target="_blank">Books to Scrape </a>and review the format. </p> <div class="wp-block-image"> <figure class="aligncenter size-large"><img loading="lazy" width="1024" height="564" src="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-1024x564.png" alt="" class="wp-image-224055" srcset="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-1024x564.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-300x165.png 300w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a-768x423.png 768w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-01a.png 1247w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure> </div> <p>At first glance, you will notice:</p> <ul> <li>Book categories display on the left-hand side.</li> <li>There are, in total, 1,000 books listed on the website.</li> <li>Each web page shows 20 Books.</li> <li>Each price is in £ (in this instance, the UK pound).</li> <li>Each Book displays <strong>minimum </strong>details.</li> <li>To view <strong>complete </strong>details for a book, click on the image or the <code>Book Title</code> hyperlink. This hyperlink forwards to a page containing additional book details for the selected item (see below).</li> <li>The total number of website pages displays in the footer (<code>Page 1 of 50</code>).</li> </ul> <h2 class="wp-embed-aspect-16-9 wp-has-aspect-ratio" id="getting-started">Step 1: Install and Import Libraries for Project</h2> <p class="wp-embed-aspect-16-9 wp-has-aspect-ratio">Before any data manipulation can occur, three (3) new libraries will require installation.</p> <ul> <li>The <em><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="URL" data-id="https://blog.finxter.com/pandas-quickstart/" target="_blank">Pandas</a></em> library enables access to/from a <em>DataFrame</em>.</li> <li>The <em><a rel="noreferrer noopener" href="https://blog.finxter.com/best-python-requests-tutorials/" data-type="URL" data-id="https://blog.finxter.com/best-python-requests-tutorials/" target="_blank">Requests</a> </em>library provides access to the HTTP requests in Python.</li> <li>The <a rel="noreferrer noopener" href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" data-type="URL" data-id="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" target="_blank">Beautiful Soup </a>library enables data extraction from HTML and XML files.</li> </ul> <p>To install these libraries, navigate to an <a rel="noreferrer noopener" href="https://blog.finxter.com/best-python-ide/" data-type="post" data-id="8106" target="_blank">IDE</a> terminal. At the command prompt (<code>$</code>), execute the code below. For the terminal used in this example, the command prompt is a dollar sign (<code>$</code>). Your terminal prompt may be different.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install pandas</pre> <p>Hit the <code><Enter></code> key on the keyboard to start the installation process.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install requests</pre> <p>Hit the <code><Enter></code> key on the keyboard to start the installation process.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install beautifulsoup4</pre> <p>Hit the <code><Enter></code> key on the keyboard to start the installation process.</p> <p>If the installations were successful, a message displays in the terminal indicating the same.</p> <hr class="wp-block-separator has-css-opacity"/> <p>Feel free to view the PyCharm installation guides for the required libraries.</p> <ul> <li><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-pandas-in-python/" target="_blank"></a><a href="https://blog.finxter.com/how-to-install-pandas-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-pandas-on-pycharm/" target="_blank" rel="noreferrer noopener">How to install Pandas on PyCharm</a></li> <li><a href="https://blog.finxter.com/how-to-install-requests-in-python/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-requests-in-python/" target="_blank" rel="noreferrer noopener">How to install Requests on PyCharm</a></li> <li><a href="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" target="_blank" rel="noreferrer noopener">How to install BeautifulSoup4 on PyCharm</a></li> </ul> <hr class="wp-block-separator has-css-opacity"/> <p>Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.</p> <pre class="EnlighterJSRAW wp-embed-aspect-16-9 wp-has-aspect-ratio" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd import requests from bs4 import BeautifulSoup import time import urllib.request from csv import reader, writer</pre> <ul> <li>The <code>time</code> library is built-in with Python and does not require installation. This library contains <a rel="noreferrer noopener" href="https://blog.finxter.com/time-delay-in-python/" data-type="URL" data-id="https://blog.finxter.com/time-delay-in-python/" target="_blank"><code>time.sleep()</code></a> and is used to set a delay between page scrapes.</li> <li>The <code>urllib</code> library is built-in with Python and does not require installation. This library contains <a rel="noreferrer noopener" href="https://blog.finxter.com/time-delay-in-python/" data-type="URL" data-id="https://blog.finxter.com/time-delay-in-python/" target="_blank"><code>urllib.request</code></a> and is used to save images.</li> <li>The <code>csv </code>library is built-in <code><em><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="URL" data-id="https://blog.finxter.com/pandas-quickstart/" target="_blank">Pandas</a></em></code> and does not require additional installation. This library contains <code>reader and writer</code> methods to save data to a CSV file.</li> </ul> <h2>Step 2: Understand Basics and Scrape Your First Results</h2> <div class="wp-block-image"> <figure class="aligncenter size-full"><img loading="lazy" width="909" height="462" src="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a.png" alt="" class="wp-image-224220" srcset="https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a.png 909w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a-300x152.png 300w, https://blog.finxter.com/wp-content/uploads/2022/03/kmc-books-04a-768x390.png 768w" sizes="(max-width: 909px) 100vw, 909px" /></figure> </div> <p>In this step, you’ll perform the following tasks:</p> <ul id="block-990dfa6f-f2e6-423a-84d3-3fbfcb432a12"> <li>Reviewing the website to scrape.</li> <li>Understanding HTTP Status Codes.</li> <li>Connecting to the <a rel="noreferrer noopener" href="https://books.toscrape.com/index.html" target="_blank">Books to Scrape</a> website using the <code><a rel="noreferrer noopener" href="https://blog.finxter.com/python-requests-library/" target="_blank">requests</a> </code>library.</li> <li>Retrieving Total Pages to Scrape</li> <li>Closing the Open Connection.</li> </ul> <p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-1/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-1/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p> <h2>Step 3: Configure URL to Scrape and Avoid Spamming the Server</h2> <div class="wp-block-cover aligncenter is-light"><span aria-hidden="true" class="wp-block-cover__background has-background-dim"></span><img loading="lazy" width="886" height="672" class="wp-block-cover__image-background wp-image-422310" alt="" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-122.png" data-object-fit="cover" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-122.png 886w, https://blog.finxter.com/wp-content/uploads/2022/06/image-122-300x228.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-122-768x583.png 768w" sizes="(max-width: 886px) 100vw, 886px" /></p> <div class="wp-block-cover__inner-container"> <p class="has-text-align-center has-base-3-color has-text-color has-large-font-size"><strong>Rule: Don’t Spam the Server!</strong></p> </div> </div> <p>In this step, you’ll perform the following tasks:</p> <ul id="block-30f20a4a-690b-43a9-bf02-27dbdcbfb3a7"> <li>Configuring a page URL for scraping</li> <li>Setting a delay: <a href="https://blog.finxter.com/time-delay-in-python/"><code>time.sleep()</code> </a>to pause between page scrapes.</li> <li><a href="https://blog.finxter.com/python-loops/" target="_blank" rel="noreferrer noopener">Looping</a> through two (2) pages for testing purposes.</li> </ul> <p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-2/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-2/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p> <h2>Step 4: Save Book Details in a Python List</h2> <div class="wp-block-image"> <figure class="aligncenter size-large"><img loading="lazy" width="1024" height="709" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-123-1024x709.png" alt="" class="wp-image-422311" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-123-1024x709.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123-300x208.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123-768x532.png 768w, https://blog.finxter.com/wp-content/uploads/2022/06/image-123.png 1268w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure> </div> <p>In this step, you’ll perform the following tasks:</p> <ul> <li>Locating Book details.</li> <li>Writing code to retrieve this information for all Books.</li> <li>Saving <code>Book</code> details to a <a href="https://blog.finxter.com/python-lists/" target="_blank" rel="noreferrer noopener">List</a>.</li> </ul> <p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-3/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-3/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p> <h2>Step 5: Clean and Save the Scraped Output</h2> <div class="wp-block-image"> <figure class="aligncenter size-large"><img loading="lazy" width="1024" height="340" src="https://blog.finxter.com/wp-content/uploads/2022/06/image-124-1024x340.png" alt="" class="wp-image-422312" srcset="https://blog.finxter.com/wp-content/uploads/2022/06/image-124-1024x340.png 1024w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124-300x100.png 300w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124-768x255.png 768w, https://blog.finxter.com/wp-content/uploads/2022/06/image-124.png 1030w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure> </div> <p>In this step, you’ll perform the following tasks:</p> <ul> <li>Cleaning up the scraped code.</li> <li>Saving the output to a <a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-read-a-csv-file-into-a-python-list/" target="_blank">CSV </a>file.</li> </ul> <p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f30d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Learn More</strong>: Learn everything you need to know to reproduce this step in the <a href="https://blog.finxter.com/scraping-a-bookstore-part-4/" data-type="URL" data-id="https://blog.finxter.com/scraping-a-bookstore-part-4/" target="_blank" rel="noreferrer noopener">in-depth Finxter blog tutorial</a>.</p> <h2>Conclusion</h2> <p>This tutorial has guided you through the steps to create your first practical web scraping project: scraping the contents of a book store! </p> <p>Now, go out and use your skills wisely and to the benefit of humanity, my friend! <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f642.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p> <hr class="wp-block-separator has-alpha-channel-opacity"/> </div> https://www.sickgaming.net/blog/2022/06/14/scrape-a-bookstore-in-5-steps-python-learn-project/ |