[Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages - Printable Version +- Sick Gaming (https://www.sickgaming.net) +-- Forum: Programming (https://www.sickgaming.net/forum-76.html) +--- Forum: PHP Development (https://www.sickgaming.net/forum-82.html) +--- Thread: [Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages (/thread-101254.html) |
[Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages - xSicKxBot - 08-17-2023 [Tut] Web Scraping with PHP – Tutorial to Scrape Web Pages <div style="margin: 5px 5% 10px 5%;"><img src="https://www.sickgaming.net/blog/wp-content/uploads/2023/08/web-scraping-with-php-tutorial-to-scrape-web-pages.jpg" width="550" height="367" title="" alt="" /></div><div><div class="modified-on" readability="7.0697674418605"> by <a href="https://phppot.com/about/">Vincy</a>. Last modified on July 21st, 2023.</div> <p>Web scraping is a mechanism to crawl web pages using software tools or utilities. It reads the content of the website pages over a network stream.</p> <p>This technology is also known as web crawling or data extraction. In a previous tutorial, we learned <a href="https://phppot.com/php/extract-content-using-php-and-preview-like-facebook/">how to extract pages by its URL</a>.<br /><a class="demo" href="https://phppot.com/demo/web-scraping-php">View Demo</a></p> <p>There are more PHP libraries to support this feature. In this tutorial, we will see one of the popular web-scraping components named <strong>DomCrawler</strong>.</p> <p>This component is underneath the PHP Symfony framework. This article has the code for integrating and using this component to crawl web pages.</p> <p><img decoding="async" loading="lazy" class="alignnone size-large wp-image-20924" src="https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-550x367.jpg" alt="web scraping php" width="550" height="367" srcset="https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-550x367.jpg 550w, https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-300x200.jpg 300w, https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-768x512.jpg 768w, https://phppot.com/wp-content/uploads/2023/06/web-scraping-php.jpg 1200w" sizes="(max-width: 550px) 100vw, 550px"></p> <p>We can also create custom utilities to scrape the content from the remote pages. <a href="https://phppot.com/php/php-curl/">PHP allows built-in cURL functions</a> to process the network request-response cycle.</p> <h2>About DomCrawler</h2> <p>The DOMCrawler component of the Symfony library is for parsing the HTML and XML content.</p> <p>It constructs the crawl handle to reach any node of an HTML tree structure. It accepts queries to filter specific nodes from the input HTML or XML.</p> <p>It provides many crawling utilities and features.</p> <ol> <li>Node filtering by XPath queries.</li> <li>Node traversing by specifying the HTML selector by its position.</li> <li>Node name and value reading.</li> <li>HTML or XML insertion into the specified container tag.</li> </ol> <h2>Steps to create a web scraping tool in PHP</h2> <ol> <li>Install and instantiate an HTTP client library.</li> <li>Install and instantiate the crawler library to parse the response.</li> <li>Prepare parameters and bundle them with the request to scrape the remote content.</li> <li>Crawl response data and read the content.</li> </ol> <p>In this example, we used the HTTPClient library for sending the request.</p> <h2>Web scraping PHP example</h2> <p>This example creates a client instance and sends requests to the target URL. Then, it receives the web content in a response object.</p> <p>The PHP DOMCrawler parses the response data to filter out specific web content.</p> <p>In this example, the crawler reads the site title by parsing the <em>h1</em> text. Also, it parses the content from the site HTML filtered by the <em>paragraph</em> tag.</p> <p>The below image shows the example project structure with the PHP script to scrape the web content.</p> <p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-20923" src="https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-project-structure.jpg" alt="web scraping php project structure" width="313" height="134" srcset="https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-project-structure.jpg 313w, https://phppot.com/wp-content/uploads/2023/06/web-scraping-php-project-structure-300x128.jpg 300w" sizes="(max-width: 313px) 100vw, 313px"></p> <h3>How to install the Symfony framework library</h3> <p>We are using the popular Symfony to scrape the web content. It can be installed via Composer.<br />Following are the commands to install the dependencies.</p> <pre class="prettyprint"><code>composer require symfony/http-client symfony/dom-crawler composer require symfony/css-selector </code></pre> <p>After running these composer commands, a vendor folder can map the required dependencies with an autoload.php file. The below script imports the dependencies by this file.</p> <p class="code-heading">index.php</p> <pre class="prettyprint"><code class="language-php"><?php require 'vendor/autoload.php'; use Symfony\Component\HttpClient\HttpClient; use Symfony\Component\DomCrawler\Crawler; $httpClient = HttpClient::create(); // Website to be scraped $website = 'https://example.com'; // HTTP GET request and store the response $httpResponse = $httpClient->request('GET', $website); $websiteContent = $httpResponse->getContent(); $domCrawler = new Crawler($websiteContent); // Filter the H1 tag text $h1Text = $domCrawler->filter('h1')->text(); $paragraphText = $domCrawler->filter('p')->each(function (Crawler $node) { return $node->text(); }); // Scraped result echo "H1: " . $h1Text . "\n"; echo "Paragraphs:\n"; foreach ($paragraphText as $paragraph) { echo $paragraph . "\n"; } ?> </code></pre> <h2>Ways to process the web scrapped data</h2> <p>What will people do with the web-scraped data? The example code created for this article prints the content to the browser. In an actual application, this data can be used for many purposes.</p> <ol> <li>It gives data to find popular trends with the scraped news site contents.</li> <li>It generates leads for showing charts or statistics.</li> <li>It helps to extract images and store them in the application’s backend.</li> </ol> <p>If you want to see <a href="https://phppot.com/php/extract-images-from-url-in-excel-with-php-using-phpspreadsheet/">how to extract images from the pages</a>, the linked article has a simple code.</p> <h2>Caution</h2> <p>Web scraping is theft if you scrape against a website’s usage policy. You should read a website’s policy before scraping it. If the terms are unclear, you may get explicit permission from the website’s owner. Also, commercializing web-scraped content is a crime in most cases. Get permission before doing any such activities.</p> <p>Before crawling a site’s content, it is essential to read the website terms. It is to ensure that the public can be subject to scraping.</p> <p>People provide API access or feed to read the content. It is fair to do data extraction with proper API access provision. We have seen how to <a href="https://phppot.com/php/extracting-title-description-thumbnail-using-youtube-data-api/">extract the title, description and video thumbnail using YouTube API</a>.</p> <p>For learning purposes, you may host a dummy website with lorem ipsum content and scrape it.<br /><a class="demo" href="https://phppot.com/demo/web-scraping-php">View Demo</a></p> <p> <!-- #comments --> </p> <div class="related-articles"> <h2>Popular Articles</h2> </p></div> <p> <a href="https://phppot.com/php/web-scraping-php/#top" class="top">↑ Back to Top</a> </p> </div> https://www.sickgaming.net/blog/2023/06/05/web-scraping-with-php-tutorial-to-scrape-web-pages/ |