{"id":134202,"date":"2023-06-05T11:17:00","date_gmt":"2023-06-05T11:17:00","guid":{"rendered":"https:\/\/phppot.com\/?p=20883"},"modified":"2023-06-05T11:17:00","modified_gmt":"2023-06-05T11:17:00","slug":"web-scraping-with-php-tutorial-to-scrape-web-pages","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2023\/06\/05\/web-scraping-with-php-tutorial-to-scrape-web-pages\/","title":{"rendered":"Web Scraping with PHP \u2013 Tutorial to Scrape Web Pages"},"content":{"rendered":"<div class=\"modified-on\" readability=\"7.0697674418605\"> by <a href=\"https:\/\/phppot.com\/about\/\">Vincy<\/a>. Last modified on July 21st, 2023.<\/div>\n<p>Web scraping is a mechanism to crawl web pages using software tools or utilities. It reads the content of the website pages over a network stream.<\/p>\n<p>This technology is also known as web crawling or data extraction. In a previous tutorial, we learned <a href=\"https:\/\/phppot.com\/php\/extract-content-using-php-and-preview-like-facebook\/\">how to extract pages by its URL<\/a>.<br \/><a class=\"demo\" href=\"https:\/\/phppot.com\/demo\/web-scraping-php\">View Demo<\/a><\/p>\n<p>There are more PHP libraries to support this feature. In this tutorial, we will see one of the popular web-scraping components named <strong>DomCrawler<\/strong>.<\/p>\n<p>This component is underneath the PHP Symfony framework. This article has the code for integrating and using this component to crawl web pages.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-large wp-image-20924\" src=\"https:\/\/phppot.com\/wp-content\/uploads\/2023\/06\/web-scraping-php-550x367.jpg\" alt=\"web scraping php\" width=\"550\" height=\"367\" srcset=\"https:\/\/phppot.com\/wp-content\/uploads\/2023\/06\/web-scraping-php-550x367.jpg 550w, https:\/\/phppot.com\/wp-content\/uploads\/2023\/06\/web-scraping-php-300x200.jpg 300w, https:\/\/phppot.com\/wp-content\/uploads\/2023\/06\/web-scraping-php-768x512.jpg 768w, https:\/\/phppot.com\/wp-content\/uploads\/2023\/06\/web-scraping-php.jpg 1200w\" sizes=\"auto, (max-width: 550px) 100vw, 550px\"><\/p>\n<p>We can also create custom utilities to scrape the content from the remote pages. <a href=\"https:\/\/phppot.com\/php\/php-curl\/\">PHP allows built-in cURL functions<\/a> to process the network request-response cycle.<\/p>\n<h2>About DomCrawler<\/h2>\n<p>The DOMCrawler component of the Symfony library is for parsing the HTML and XML content.<\/p>\n<p>It constructs the crawl handle to reach any node of an HTML tree structure. It accepts queries to filter specific nodes from the input HTML or XML.<\/p>\n<p>It provides many crawling utilities and features.<\/p>\n<ol>\n<li>Node filtering by XPath queries.<\/li>\n<li>Node traversing by specifying the HTML selector by its position.<\/li>\n<li>Node name and value reading.<\/li>\n<li>HTML or XML insertion into the specified container tag.<\/li>\n<\/ol>\n<h2>Steps to create a web scraping tool in PHP<\/h2>\n<ol>\n<li>Install and instantiate an HTTP client library.<\/li>\n<li>Install and instantiate the crawler library to parse the response.<\/li>\n<li>Prepare parameters and bundle them with the request to scrape the remote content.<\/li>\n<li>Crawl response data and read the content.<\/li>\n<\/ol>\n<p>In this example, we used the HTTPClient library for sending the request.<\/p>\n<h2>Web scraping PHP example<\/h2>\n<p>This example creates a client instance and sends requests to the target URL. Then, it receives the web content in a response object.<\/p>\n<p>The PHP DOMCrawler parses the response data to filter out specific web content.<\/p>\n<p>In this example, the crawler reads the site title by parsing the&nbsp;<em>h1<\/em> text. Also, it parses the content from the site HTML filtered by the <em>paragraph<\/em> tag.<\/p>\n<p>The below image shows the example project structure with the PHP script to scrape the web content.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-20923\" src=\"https:\/\/phppot.com\/wp-content\/uploads\/2023\/06\/web-scraping-php-project-structure.jpg\" alt=\"web scraping php project structure\" width=\"313\" height=\"134\" srcset=\"https:\/\/phppot.com\/wp-content\/uploads\/2023\/06\/web-scraping-php-project-structure.jpg 313w, https:\/\/phppot.com\/wp-content\/uploads\/2023\/06\/web-scraping-php-project-structure-300x128.jpg 300w\" sizes=\"auto, (max-width: 313px) 100vw, 313px\"><\/p>\n<h3>How to install the Symfony framework library<\/h3>\n<p>We are using the popular Symfony to scrape the web content. It can be installed via Composer.<br \/>Following are the commands to install the dependencies.<\/p>\n<pre class=\"prettyprint\"><code>composer require symfony\/http-client symfony\/dom-crawler\ncomposer require symfony\/css-selector\n<\/code><\/pre>\n<p>After running these composer commands, a vendor folder can map the required dependencies with an autoload.php file. The below script imports the dependencies by this file.<\/p>\n<p class=\"code-heading\">index.php<\/p>\n<pre class=\"prettyprint\"><code class=\"language-php\">&lt;?php require 'vendor\/autoload.php'; use Symfony\\Component\\HttpClient\\HttpClient;\nuse Symfony\\Component\\DomCrawler\\Crawler; $httpClient = HttpClient::create(); \/\/ Website to be scraped\n$website = 'https:\/\/example.com'; \/\/ HTTP GET request and store the response\n$httpResponse = $httpClient-&gt;request('GET', $website);\n$websiteContent = $httpResponse-&gt;getContent(); $domCrawler = new Crawler($websiteContent); \/\/ Filter the H1 tag text\n$h1Text = $domCrawler-&gt;filter('h1')-&gt;text();\n$paragraphText = $domCrawler-&gt;filter('p')-&gt;each(function (Crawler $node) { return $node-&gt;text();\n}); \/\/ Scraped result\necho \"H1: \" . $h1Text . \"\\n\";\necho \"Paragraphs:\\n\";\nforeach ($paragraphText as $paragraph) { echo $paragraph . \"\\n\";\n}\n?&gt;\n<\/code><\/pre>\n<h2>Ways to process the web scrapped data<\/h2>\n<p>What will people do with the web-scraped data? The example code created for this article prints the content to the browser. In an actual application, this data can be used for many purposes.<\/p>\n<ol>\n<li>It gives data to find popular trends with the scraped news site contents.<\/li>\n<li>It generates leads for showing charts or statistics.<\/li>\n<li>It helps to extract images and store them in the application\u2019s backend.<\/li>\n<\/ol>\n<p>If you want to see <a href=\"https:\/\/phppot.com\/php\/extract-images-from-url-in-excel-with-php-using-phpspreadsheet\/\">how to extract images from the pages<\/a>, the linked article has a simple code.<\/p>\n<h2>Caution<\/h2>\n<p>Web scraping is theft if you scrape against a website\u2019s usage policy.&nbsp; You should read a website\u2019s policy before scraping it. If the terms are unclear, you may get explicit permission from the website\u2019s owner. Also, commercializing web-scraped content is a crime in most cases. Get permission before doing any such activities.<\/p>\n<p>Before crawling a site\u2019s content, it is essential to read the website terms. It is to ensure that the public can be subject to scraping.<\/p>\n<p>People provide API access or feed to read the content. It is fair to do data extraction with proper API access provision. We have seen how to <a href=\"https:\/\/phppot.com\/php\/extracting-title-description-thumbnail-using-youtube-data-api\/\">extract the title, description and video thumbnail using YouTube API<\/a>.<\/p>\n<p>For learning purposes, you may host a dummy website with lorem ipsum content and scrape it.<br \/><a class=\"demo\" href=\"https:\/\/phppot.com\/demo\/web-scraping-php\">View Demo<\/a><\/p>\n<p> <!-- #comments --> <\/p>\n<div class=\"related-articles\">\n<h2>Popular Articles<\/h2>\n<\/p><\/div>\n<p> <a href=\"https:\/\/phppot.com\/php\/web-scraping-php\/#top\" class=\"top\">\u2191 Back to Top<\/a> <\/p>\n","protected":false},"excerpt":{"rendered":"<p>by Vincy. Last modified on July 21st, 2023. Web scraping is a mechanism to crawl web pages using software tools or utilities. It reads the content of the website pages over a network stream. This technology is also known as web crawling or data extraction. In a previous tutorial, we learned how to extract pages [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":134203,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[65],"tags":[],"class_list":["post-134202","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-php-updates"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/134202","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=134202"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/134202\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media\/134203"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=134202"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=134202"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=134202"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}