![]() |
|
[Tut] Python BS4 – How to Scrape Absolute URL Instead of Relative Path - Printable Version +- Sick Gaming (https://www.sickgaming.net) +-- Forum: Programming (https://www.sickgaming.net/forum-76.html) +--- Forum: Python (https://www.sickgaming.net/forum-83.html) +--- Thread: [Tut] Python BS4 – How to Scrape Absolute URL Instead of Relative Path (/thread-103574.html) |
[Tut] Python BS4 – How to Scrape Absolute URL Instead of Relative Path - xSicKxBot - 12-02-2025 [Tut] Python BS4 – How to Scrape Absolute URL Instead of Relative Path <div> <div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload='{"align":"left","id":"22845","slug":"default","valign":"top","ignore":"","reference":"auto","class":"","count":"1","legendonly":"","readonly":"","score":"5","starsonly":"","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","title":"Python BS4 - How to Scrape Absolute URL Instead of Relative Path","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})","font_factor":"1.25"}'> <div class="kksr-stars"> <div class="kksr-stars-inactive"> <div class="kksr-star" data-star="1" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" data-star="2" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" data-star="3" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" data-star="4" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" data-star="5" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> </p></div> <div class="kksr-stars-active" style="width: 142.5px;"> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> <div class="kksr-star" style="padding-right: 5px"> <div class="kksr-icon" style="width: 24px; height: 24px;"></div> </p></div> </p></div> </div> <div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div> </p></div> <p class="has-global-color-8-background-color has-background"><strong>Summary: </strong>Use <a href="https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin" target="_blank" rel="noreferrer noopener"><code data-enlighter-language="generic" class="EnlighterJSRAW">urllib.parse.urljoin()</code></a> to scrape the base URL and the relative path and join them to extract the complete/<strong>absolute </strong>URL. You can also concatenate the base URL and the absolute path to derive the absolute path; but make sure to take care of erroneous situations like extra forward-slash in this case.</p> <h2 class="wp-block-heading">Quick Answer</h2> <p>When web scraping with BeautifulSoup in Python, you may encounter relative URLs (e.g., <code>/page2.html</code>) instead of absolute URLs (e.g., <code>http://example.com/page2.html</code>). To convert relative URLs to absolute URLs, you can use the <code>urljoin()</code> function from the <code>urllib.parse</code> module.</p> <p>Below is an example of how to extract absolute URLs from the <code>a</code> tags on a webpage using <code>BeautifulSoup</code> and <code>urljoin</code>:</p> <div class="wp-block-image"> <figure class="aligncenter size-full"><img decoding="async" fetchpriority="high" width="816" height="757" src="https://blog.finxter.com/wp-content/uploads/2023/09/image-130.png" alt="" class="wp-image-1651860" srcset="https://blog.finxter.com/wp-content/uploads/2023/09/image-130.png 816w, https://blog.finxter.com/wp-content/uploads/2023/09/image-130-300x278.png 300w, https://blog.finxter.com/wp-content/uploads/2023/09/image-130-768x712.png 768w" sizes="(max-width: 816px) 100vw, 816px" /></figure> </div> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup import requests from urllib.parse import urljoin # URL of the webpage you want to scrape url = 'http://example.com' # Send an HTTP request to the URL response = requests.get(url) response.raise_for_status() # Raise an error for bad responses # Parse the webpage content soup = BeautifulSoup(response.text, 'html.parser') # Find all the 'a' tags on the webpage for a_tag in soup.find_all('a'): # Get the href attribute from the 'a' tag href = a_tag.get('href') # Use urljoin to convert the relative URL to an absolute URL absolute_url = urljoin(url, href) # Print the absolute URL print(absolute_url)</pre> <p>In this example:</p> <ul> <li><code>url</code> is the URL of the webpage you want to scrape.</li> <li><code>response</code> is the HTTP response obtained by sending an HTTP GET request to the URL.</li> <li><code>soup</code> is a <code>BeautifulSoup</code> object that contains the parsed HTML content of the webpage.</li> <li><code>soup.find_all('a')</code> finds all the <code>a</code> tags on the webpage.</li> <li><code>a_tag.get('href')</code> gets the <code>href</code> attribute from an <code>a</code> tag, which is the relative URL.</li> <li><code>urljoin(url, href)</code> converts the relative URL to an absolute URL by joining it with the base URL.</li> <li><code>absolute_url</code> is the absolute URL, which is printed to the console.</li> </ul> <p>Now that you have a quick overview let’s dive into the specific problem more deeply and discuss various methods to solve this easily and effectively. <img decoding="async" src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f447.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p> <h2 class="wp-block-heading">Problem Formulation</h2> <p><strong>Problem: </strong>How do you extract all the absolute URLs from an HTML page?</p> <p><strong>Example: </strong>Consider the following webpage which has numerous links:</p> <div class="wp-block-image"> <figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="1024" height="355" src="https://blog.finxter.com/wp-content/uploads/2023/09/image-129-1024x355.png" alt="" class="wp-image-1651858" srcset="https://blog.finxter.com/wp-content/uploads/2023/09/image-129-1024x355.png 1024w, https://blog.finxter.com/wp-content/uploads/2023/09/image-129-300x104.png 300w, https://blog.finxter.com/wp-content/uploads/2023/09/image-129-768x266.png 768w, https://blog.finxter.com/wp-content/uploads/2023/09/image-129.png 1266w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure> </div> <div class="wp-block-image"> <figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" src="https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-1024x319.png" alt="" class="wp-image-22847" style="object-fit:contain;width:784px;height:243px" width="784" height="243" srcset="https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-1024x319.png 1024w, https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-300x93.png 300w, https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-768x239.png 768w, https://blog.finxter.com/wp-content/uploads/2021/02/relative_links-150x47.png 150w, https://blog.finxter.com/wp-content/uploads/2021/02/relative_links.png 1305w" sizes="auto, (max-width: 784px) 100vw, 784px" /><figcaption class="wp-element-caption"><img decoding="async" src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f517.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Link</strong>: <a href="https://sayonshubham.github.io/">https://sayonshubham.github.io/</a></figcaption></figure> </div> <p>Now, when you try to <a href="https://stackoverflow.com/questions/44001007/scrape-the-absolute-url-instead-of-a-relative-path-in-python">scrape</a> the links as highlighted above, you find that only the relative links/paths are extracted instead of the entire absolute path. Let us have a look at the code given below, which demonstrates what happens when you try to extract the <code>'href'</code> elements normally.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup import urllib.request from urllib.parse import urljoin import requests web_url = 'https://sayonshubham.github.io/' headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/31.0.1650.0 Safari/537.36"} # get() Request response = requests.get(web_url, headers=headers) # Store the webpage contents webpage = response.content # Check Status Code (Optional) # print(response.status_code) # Create a BeautifulSoup object out of the webpage content soup = BeautifulSoup(webpage, "html.parser") for i in soup.find_all('nav'): for url in i.find_all('a'): print(url['href'])</pre> <p><strong>Output:</strong></p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">/ /about /blog /finxter /</pre> <p>The above output is not what you desired. You wanted to extract the absolute paths as shown below:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">https://sayonshubham.github.io/ https://sayonshubham.github.io/about https://sayonshubham.github.io/blog https://sayonshubham.github.io/finxter https://sayonshubham.github.io/</pre> <p>Without further delay, let us go ahead and try to extract the absolute paths instead of the relative paths. </p> <h2 class="wp-block-heading">Method 1: Using <span class="has-inline-color has-luminous-vivid-orange-color">urllib.parse.urljoin()</span></h2> <p>The easiest solution to our problem is to use the <a href="https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin" target="_blank" rel="noreferrer noopener"><code>urllib.parse.urljoin()</code></a> method.</p> <p>According to the Python documentation: <code data-enlighter-language="generic" class="EnlighterJSRAW">urllib.parse.urljoin()</code> is used to construct a full/absolute URL by combining the “base URL” with another URL. The advantage of using the <code>urljoin()</code> is that it properly resolves the relative path, whether <code>BASE_URL</code> is the domain of the URL, or the absolute URL of the webpage.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from urllib.parse import urljoin URL_1 = 'http://www.example.com' URL_2 = 'http://www.example.com/something/index.html' print(urljoin(URL_1, '/demo')) print(urljoin(URL_2, '/demo'))</pre> <p><strong>Output:</strong></p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">http://www.example.com/demo http://www.example.com/demo</pre> <p>Now that we have an idea about <code data-enlighter-language="generic" class="EnlighterJSRAW">urljoin</code>, let us have a look at the following code which successfully resolves our problem and helps us to extract the complete/absolute paths from the HTML page.</p> <p><strong>Solution:</strong></p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup import urllib.request from urllib.parse import urljoin import requests web_url = 'https://sayonshubham.github.io/' headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/31.0.1650.0 Safari/537.36"} # get() Request response = requests.get(web_url, headers=headers) # Store the webpage contents webpage = response.content # Check Status Code (Optional) # print(response.status_code) # Create a BeautifulSoup object out of the webpage content soup = BeautifulSoup(webpage, "html.parser") for i in soup.find_all('nav'): for url in i.find_all('a'): print(urljoin(web_url, url.get('href')))</pre> <p><strong>Output:</strong></p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">https://sayonshubham.github.io/ https://sayonshubham.github.io/about https://sayonshubham.github.io/blog https://sayonshubham.github.io/finxter https://sayonshubham.github.io/</pre> <h2 class="wp-block-heading">Method 2: Concatenate The Base URL And Relative URL Manually</h2> <p>Another workaround to our problem is to concatenate the base part of the URL and the relative URLs manually, just like two ordinary strings. The problem, in this case, is that manually adding the strings might lead to “one-off” errors — try to spot the extra front slash characters <code>/</code> below:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">URL_1 = 'http://www.example.com/' print(URL_1+'/demo') # Output --> http://www.example.com//demo</pre> <p>Therefore to ensure proper concatenation, you have to modify your code accordingly such that any extra character that might lead to errors is removed. Let us have a look at the following code that helps us to concatenate the base and the relative paths without the presence of any extra forward slash.</p> <p><strong><em>Solution:</em></strong></p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from bs4 import BeautifulSoup import urllib.request from urllib.parse import urljoin import requests web_url = 'https://sayonshubham.github.io/' headers = {"User-Agent": "Mozilla/5.0 (CrKey armv7l 1.5.16041) AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/31.0.1650.0 Safari/537.36"} # get() Request response = requests.get(web_url, headers=headers) # Store the webpage contents webpage = response.content # Check Status Code (Optional) # print(response.status_code) # Create a BeautifulSoup object out of the webpage content soup = BeautifulSoup(webpage, "html.parser") for i in soup.find_all('nav'): for url in i.find_all('a'): # extract the href string x = url['href'] # remove the extra forward-slash if present if x[0] == '/': print(web_url + x[1:]) else: print(web_url+x)</pre> <p><strong>Output:</strong></p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">https://sayonshubham.github.io/ https://sayonshubham.github.io/about https://sayonshubham.github.io/blog https://sayonshubham.github.io/finxter https://sayonshubham.github.io/</pre> <p class="has-global-color-8-background-color has-background"><img decoding="async" src="https://s.w.org/images/core/emoji/14.0.0/72x72/26a0.png" alt="⚠" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong><span style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-vivid-red-color">Caution:</span></strong> This is not the recommended way of extracting the absolute path from a given HTML page. In situations when you have an automated script that needs to resolve a URL but at the time of writing the script you don’t know what website your script is visiting, in that case, this method won’t serve your purpose, and your go-to method would be to use <code data-enlighter-language="generic" class="EnlighterJSRAW">urlljoin</code>. Nevertheless, this method deserves to be mentioned because, in our case, it successfully serves the purpose and helps us to extract the absolute URLs.</p> <h2 class="wp-block-heading">Conclusion</h2> <p>In this article, we learned how to extract the absolute links from a given HTML page using BeautifulSoup. If you want to master the concepts of Pythons BeautifulSoup library and dive deep into the concepts along with examples and video lessons, please have a look at the following link and follow the articles one by one wherein you will find every aspect of BeautifulSoup explained in great details.</p> <figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube"><a href="https://blog.finxter.com/scraping-the-absolute-url-of-instead-of-the-relative-path-using-beautifulsoup/"><img decoding="async" src="https://blog.finxter.com/wp-content/plugins/wp-youtube-lyte/lyteCache.php?origThumbUrl=https%3A%2F%2Fi.ytimg.com%2Fvi%2FEGdHdWtVe6E%2Fhqdefault.jpg" alt="YouTube Video"></a><figcaption></figcaption></figure> <p><img decoding="async" src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f517.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/web-scraping-with-beautifulsoup-in-python/" target="_blank" rel="noreferrer noopener">Web Scraping With BeautifulSoup In Python</a></p> <p>With that, we come to the end of this tutorial! Please <strong><a href="http://blog.finxter.com/subscribe" target="_blank" rel="noreferrer noopener">stay tuned</a></strong> and <strong><a href="https://www.youtube.com/channel/UCRlWL2q80BnI4sA5ISrz9uw" target="_blank" rel="noreferrer noopener">subscribe</a></strong> for more interesting content in the future.</p> <p>The post <a rel="nofollow" href="https://blog.finxter.com/scraping-the-absolute-url-of-instead-of-the-relative-path-using-beautifulsoup/">Python BS4 – How to Scrape Absolute URL Instead of Relative Path</a> appeared first on <a rel="nofollow" href="https://blog.finxter.com">Be on the Right Side of Change</a>.</p> </div> https://www.sickgaming.net/blog/2023/09/28/python-bs4-how-to-scrape-absolute-url-instead-of-relative-path/ |