[Tut] How to Read an XLS File in Python? - Printable Version +- Sick Gaming (https://www.sickgaming.net) +-- Forum: Programming (https://www.sickgaming.net/forum-76.html) +--- Forum: Python (https://www.sickgaming.net/forum-83.html) +--- Thread: [Tut] How to Read an XLS File in Python? (/thread-99288.html) |
[Tut] How to Read an XLS File in Python? - xSicKxBot - 04-26-2022 How to Read an XLS File in Python? <div><h2 class="wp-embed-aspect-16-9 wp-has-aspect-ratio">Problem Formulation and Solution Overview</h2> <p class="wp-embed-aspect-16-9 wp-has-aspect-ratio">In this article, you’ll learn how to read an XML file and format the output in Python. </p> <p class="wp-embed-aspect-16-9 wp-has-aspect-ratio">To make it more fun, we have the following running scenario:</p> <p><em>Arman, a Music Appreciation student at the <a href="https://www.rcmusic.com/learning/oscar-peterson-school-of-music/music-appreciation" data-type="URL" data-id="https://www.rcmusic.com/learning/oscar-peterson-school-of-music/music-appreciation" target="_blank" rel="noreferrer noopener">Royal Conservatory of Music</a>, has been given course materials in an XML file format. Arman needs to purchase these books immediately. He’s into music, not computers. He needs you to format the output into a readable format.</em></p> <p><em>Navigate to the Appendix Data</em> section<em> and download the XML file to follow along. Then, move this file to the current working directory.</em></p> <p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/13.1.0/72x72/1f4ac.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Question</strong>: How would we read in ax XML file and format the output?</p> <p>We can accomplish this task by one of the following options:</p> <ul> <li><strong>Method 1</strong>: Use <a rel="noreferrer noopener" href="https://blog.finxter.com/installing-beautiful-soup/" data-type="URL" data-id="https://blog.finxter.com/installing-beautiful-soup/" target="_blank"><em>Beautiful Soup</em></a></li> <li><strong>Method 2</strong>: Use <a href="https://docs.python.org/3/library/xml.etree.elementtree.html" data-type="URL" data-id="https://docs.python.org/3/library/xml.etree.elementtree.html">XML eTree</a></li> <li><strong>Method 3</strong>: Use <a rel="noreferrer noopener" href="https://docs.python.org/3/library/xml.dom.minidom.html" data-type="URL" data-id="https://docs.python.org/3/library/xml.dom.minidom.html" target="_blank">Minidom</a></li> <li><strong>Method 4</strong>: Use <a href="https://blog.finxter.com/category/pandas-library/" data-type="URL" data-id="https://blog.finxter.com/category/pandas-library/" target="_blank" rel="noreferrer noopener">Pandas</a></li> </ul> <hr class="wp-block-separator"/> <p class="wp-embed-aspect-16-9 wp-has-aspect-ratio">Before any data manipulation can occur, two (2) new libraries will require installation.</p> <ul> <li>The <em><a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="URL" data-id="https://blog.finxter.com/pandas-quickstart/" target="_blank">Pandas</a> </em>library enables access to/from a DataFrame.</li> <li>The <a rel="noreferrer noopener" href="https://blog.finxter.com/installing-beautiful-soup/" data-type="URL" data-id="https://blog.finxter.com/installing-beautiful-soup/" target="_blank"><em>Beautiful Soup</em></a> library enables the parsing of XML and HTML files.</li> </ul> <p>To install these libraries, navigate to an <a rel="noreferrer noopener" href="https://blog.finxter.com/best-python-ide/" data-type="post" data-id="8106" target="_blank">IDE</a> terminal. At the command prompt (<code>$</code>), execute the code below. For the terminal used in this example, the command prompt is a dollar sign (<code>$</code>). Your terminal prompt may be different.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install pandas</pre> <p>Hit the <code><Enter></code> key on the keyboard to start the installation process.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">$ pip install bs4</pre> <p>Hit the <code><Enter></code> key on the keyboard to start the installation process.</p> <p>If the installations were successful, a message displays in the terminal indicating the same.</p> <hr class="wp-block-separator"/> <p>Feel free to view the PyCharm installation guide for the required libraries.</p> <ul> <li><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-a-library-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-a-library-on-pycharm/" target="_blank">How to install</a><a rel="noreferrer noopener" href="https://blog.finxter.com/how-to-install-pandas-in-python/" target="_blank"> Pandas on PyCharm</a></li> <li><a href="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" data-type="URL" data-id="https://blog.finxter.com/how-to-install-beautifulsoup-on-pycharm/" target="_blank" rel="noreferrer noopener">How to install Beautiful Soup on PyCharm</a></li> </ul> <hr class="wp-block-separator"/> <p id="block-3b5c9c73-276c-4b84-a1bc-d5ba1de38d56">Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.</p> <pre class="EnlighterJSRAW wp-embed-aspect-16-9 wp-has-aspect-ratio" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import pandas as pd from bs4 import BeautifulSoup import xml.etree.ElementTree as ET import base64, xml.dom.minidom from xml.dom.minidom import Node</pre> <p class="has-global-color-8-background-color has-background"><img src="https://s.w.org/images/core/emoji/13.1.0/72x72/1f4a1.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Note</strong>: The additional libraries indicated above do not require installation as they come built-in to Python.</p> <hr class="wp-block-separator"/> <h2>Method 1: Use Beautiful Soup</h2> <p class="has-global-color-8-background-color has-background">A clean, compact way to read an XML file is to use Python’s <a rel="noreferrer noopener" href="https://blog.finxter.com/installing-beautiful-soup/" data-type="URL" data-id="https://blog.finxter.com/installing-beautiful-soup/" target="_blank">Beautiful Soup</a> library. A “go-to” tool for web scraping and XML data extraction.</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="1" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">all_books = BeautifulSoup(open('books.xml'), 'xml') pretty_xml = all_books.prettify() print(pretty_xml)</pre> <ul> <li>The <em>books.xml</em> file is read and parsed using Beautiful Soup’s <a rel="noreferrer noopener" href="https://blog.finxter.com/parsing-xml-using-beautifulsoup-in-python/" data-type="URL" data-id="https://blog.finxter.com/parsing-xml-using-beautifulsoup-in-python/" target="_blank">XML parser</a>. The results are saved to <code>all_books</code>.</li> <li>Next, Beautiful Soup’s <a rel="noreferrer noopener" href="https://blog.finxter.com/parsing-xml-using-beautifulsoup-in-python/" data-type="URL" data-id="https://blog.finxter.com/parsing-xml-using-beautifulsoup-in-python/" target="_blank"><code>prettify()</code></a> method is used to improve the appearance of the output.</li> <li>Finally, the formatted output is sent to the terminal.</li> </ul> <p><strong>Output (snippet)</strong></p> <figure class="wp-block-table is-style-stripes"> <table> <tbody> <tr> <td><code><?xml version="1.0"?><br /><catalog> <br /> <book> <br /> <isbn>978-0393050714</isbn><br /> <title>Johann Sebastian Bach</title><br /> <price>$5.99</price><br /> </book><br /> ......<br /></catalog></code></td> </tr> </tbody> </table> </figure> <hr class="wp-block-separator"/> <h2>Method 2: Use XML eTree</h2> <p class="has-global-color-8-background-color has-background">The ElementTree library is built-in to Python and contains functions to read and parse XML and XML-like data structures. The hierarchical data format is based on a tree structure: a <em>root </em>representing the tree and <em>elements </em>representing the nodes. </p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="1" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">all_books = ET.parse('books.xml').getroot() for b in all_books.findall('book'): print ("{:<20} {:<30} {:<30}".format(b.find('isbn').text, b.find('title').text, b.find('price').text))</pre> <ul> <li>The <em>books.xml</em> file is read in and parsed using the <code><a rel="noreferrer noopener" href="https://docs.python.org/3/library/xml.etree.elementtree.html" data-type="URL" data-id="https://docs.python.org/3/library/xml.etree.elementtree.html" target="_blank">eTree parse()</a></code> function. The results are saved to <code>all_books</code>.</li> <li>Next, a <a rel="noreferrer noopener" href="https://blog.finxter.com/python-loops/" data-type="URL" data-id="https://blog.finxter.com/python-loops/" target="_blank">For</a> loop is instantiated. It traverses through each book in <code>all_books</code>. <ul> <li>Each book’s details are formatted into columns and output to the terminal.</li> </ul> </li> </ul> <p><strong>Output (snippet)</strong></p> <figure class="wp-block-table is-style-stripes"> <table> <tbody> <tr> <td>978-0393050714</td> <td>Johann Sebastian Bach</td> <td>$5.99</td> </tr> <tr> <td>978-1721260522</td> <td>Ludwig van Beethoven</td> <td>$9.99</td> </tr> <tr> <td>978-0679745822</td> <td>Johannes Brahms</td> <td>$7.99</td> </tr> <tr> <td>979-8653086533</td> <td>Frederic Chopin</td> <td>$7.99</td> </tr> <tr> <td>978-1580469036</td> <td>Claude Debussy</td> <td>$13.99</td> </tr> </tbody> </table> </figure> <hr class="wp-block-separator"/> <h2>Method 3: Use minidom</h2> <p class="has-global-color-8-background-color has-background">Minidom is a smaller version of DOM and comes with an API similar to other programming languages. However, feedback indicates this method is slow and a memory hogger. </p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="5" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">with open("books.xml",'r') as fp: data = fp.read() i = 0 all_books = xml.dom.minidom.parseString(data) for book in all_books.getElementsByTagName('book'): isbn = all_books.getElementsByTagName('isbn')[i].firstChild.nodeValue title = all_books.getElementsByTagName('title')[i].firstChild.nodeValue price = all_books.getElementsByTagName('price')[i].firstChild.nodeValue print ("{:<20} {:<25} {:<20}".format(isbn, title, price)) i +=1</pre> <ul> <li>The <em>books.xml</em> file is opened, and a file object, <code>fp</code> is created. <ul> <li>The contents of this file are read in and saved to <code>data</code>.</li> </ul> </li> <li>A counter variable <code>i</code> is created to loop through <code>all_books</code> and is assigned the value 0.</li> <li>Then <code>data</code> is read and parsed. The results save to <code>all_books</code>.</li> <li>A <a rel="noreferrer noopener" href="https://blog.finxter.com/python-loops/" data-type="URL" data-id="https://blog.finxter.com/python-loops/" target="_blank">For</a> loop is instantiated. It traverses through each book in <code>all_books</code>. <ul> <li>Four (4) variables are used to locate and save the appropriate values.</li> <li>They are formatted and output to the terminal in columns.</li> <li>The counter variable is increased by one (1).</li> </ul> </li> </ul> <p><strong>Output (snippet)</strong></p> <figure class="wp-block-table is-style-stripes"> <table> <tbody> <tr> <td>978-0393050714</td> <td>Johann Sebastian Bach</td> <td>$5.99</td> </tr> <tr> <td>978-1721260522</td> <td>Ludwig van Beethoven</td> <td>$9.99</td> </tr> <tr> <td>978-0679745822</td> <td>Johannes Brahms</td> <td>$7.99</td> </tr> <tr> <td>979-8653086533</td> <td>Frederic Chopin</td> <td>$7.99</td> </tr> <tr> <td>978-1580469036</td> <td>Claude Debussy</td> <td>$13.99</td> </tr> </tbody> </table> </figure> <hr class="wp-block-separator"/> <h2>Method 4: Use Pandas read_xml()</h2> <p class="has-global-color-8-background-color has-background">The <a rel="noreferrer noopener" href="https://blog.finxter.com/pandas-quickstart/" data-type="URL" data-id="https://blog.finxter.com/pandas-quickstart/" target="_blank">Pandas</a> library has an option to read in an XML file and convert it to a DataFrame in one easy step.</p> <pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="1" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">df = pd.read_xml('books.xml') print(df)</pre> <figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"> <div class="wp-block-embed__wrapper"> <iframe loading="lazy" title="Reading and writing XML with Pandas" width="780" height="439" src="https://www.youtube.com/embed/zIzH5GfHHD8?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </div> </figure> <ul> <li>The <em>books.xml</em> file is read in and saved to the DataFrame <code>df</code>.</li> <li>The output automatically formats into columns (including a header row) and is output to the terminal.</li> </ul> <p><strong>Output (snippet)</strong></p> <figure class="wp-block-table is-style-stripes"> <table> <tbody> <tr> <td></td> <td class="has-text-align-right" data-align="right">isbn</td> <td class="has-text-align-right" data-align="right">title</td> <td class="has-text-align-right" data-align="right">price</td> </tr> <tr> <td>0</td> <td class="has-text-align-right" data-align="right">978-0393050714</td> <td class="has-text-align-right" data-align="right">Johann Sebastian Bach</td> <td class="has-text-align-right" data-align="right">$5.99</td> </tr> <tr> <td>1</td> <td class="has-text-align-right" data-align="right">978-1721260522</td> <td class="has-text-align-right" data-align="right">Ludwig van Beethoven</td> <td class="has-text-align-right" data-align="right">$9.99</td> </tr> <tr> <td>2</td> <td class="has-text-align-right" data-align="right">978-0679745822</td> <td class="has-text-align-right" data-align="right">Johannes Brahms</td> <td class="has-text-align-right" data-align="right">$7.99</td> </tr> <tr> <td>3</td> <td class="has-text-align-right" data-align="right">979-8653086533</td> <td class="has-text-align-right" data-align="right">Frederic Chopin</td> <td class="has-text-align-right" data-align="right">$7.99</td> </tr> <tr> <td>4</td> <td class="has-text-align-right" data-align="right">978-1580469036</td> <td class="has-text-align-right" data-align="right">Claude Debussy</td> <td class="has-text-align-right" data-align="right">$13.99</td> </tr> </tbody> </table> </figure> <hr class="wp-block-separator"/> <h2>Appendix Data</h2> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""><?xml version="1.0"?> <catalog> <book> <isbn>978-0393050714</isbn> <title>Johann Sebastian Bach</title> <price>$5.99</price> </book> <book> <isbn>978-1721260522</isbn> <title>Ludwig van Beethoven</title> <price>$9.99</price> </book> <book> <isbn>978-0679745822</isbn> <title>Johannes Brahms</title> <price>$7.99</price> </book> <book> <isbn>979-8653086533</isbn> <title>Frederic Chopin</title> <price>$7.99</price> </book> <book> <isbn>978-1580469036</isbn> <title>Claude Debussy</title> <price>$13.99</price> </book> <book> <isbn>978-0520043176</isbn> <title>Joseph Haydn</title> <price>$25.99</price> </book> <book> <isbn>978-1981659968</isbn> <title>Wolfgang Amadeus Mozart</title> <price>$8.99</price> </book> <book> <isbn>978-1482379990</isbn> <title>Franz Schubert</title> <price>$26.99</price> </book> <book> <isbn>978-0486257488</isbn> <title>Robert Schumann</title> <price>$14.99</price> </book> <book> <isbn>978-0486442723</isbn> <title>Peter Tchaikovsky</title> <price>$12.95</price> </book> </catalog></pre> <hr class="wp-block-separator"/> <h2>Summary</h2> <p>After reviewing the above methods in conjunction with Arman’s requirements, we decide that Method 4 best meets his needs.</p> <p>Problem Solved! Happy Coding!</p> <hr class="wp-block-separator"/> </div> https://www.sickgaming.net/blog/2022/04/22/how-to-read-an-xls-file-in-python/ |