![]() |
[Tut] Python Regex Split - Printable Version +- Sick Gaming (https://www.sickgaming.net) +-- Forum: Programming (https://www.sickgaming.net/forum-76.html) +--- Forum: Python (https://www.sickgaming.net/forum-83.html) +--- Thread: [Tut] Python Regex Split (/thread-93278.html) |
[Tut] Python Regex Split - xSicKxBot - 01-23-2020 Python Regex Split <div><p>Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!</p> <p>This article is all about the <strong>re.split(pattern, string)</strong> method of Python’s <a rel="noreferrer noopener" target="_blank" href="https://docs.python.org/3/library/re.html">re library</a>.</p> <p>Let’s answer the following question:</p> <h2>How Does re.split() Work in Python?</h2> <p><strong>The <strong>re.split(pattern, string, maxsplit=0, flags=0)</strong> method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those.</strong></p> <p>Here’s a minimal example:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> string = 'Learn Python with\t Finxter!' >>> re.split('\s+', string) ['Learn', 'Python', 'with', 'Finxter!']</pre> <p>The string contains four words that are separated by whitespace characters (in particular: the empty space ‘ ‘ and the tabular character ‘\t’). You use the regular expression ‘\s+’ to match all occurrences of a positive number of subsequent whitespaces. The matched substrings serve as delimiters. The result is the string divided along those delimiters.</p> <p>But that’s not all! Let’s have a look at the formal definition of the split method.</p> <p><strong>Specification</strong></p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">re.split(pattern, string, maxsplit=0, flags=0)</pre> <p>The method has four arguments—two of which are optional.</p> <ul> <li><strong>pattern</strong>: the regular expression pattern you want to use as a delimiter.</li> <li><strong>string</strong>: the text you want to break up into a list of strings.</li> <li><strong>maxsplit</strong> (optional argument): the maximum number of split operations (= the size of the returned list). Per default, the maxsplit argument is 0, which means that it’s ignored.</li> <li><strong>flags </strong>(optional argument): a more advanced modifier that allows you to customize the behavior of the function. Per default the regex module does not consider any flags. Want to know <a href="https://blog.finxter.com/python-regex-flags/">how to use those flags? Check out this detailed article</a> on the Finxter blog.</li> </ul> <p>The first and second arguments are required. The third and fourth arguments are optional. </p> <p>You’ll learn about those arguments in more detail later. </p> <p><strong>Return Value:</strong></p> <p>The regex split method returns a list of substrings obtained by using the regex as a delimiter.</p> <h2>Regex Split Minimal Example</h2> <p>Let’s study some more examples—from simple to more complex.</p> <p>The easiest use is with only two arguments: the delimiter regex and the string to be split.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> string = 'fgffffgfgPythonfgisfffawesomefgffg' >>> re.split('[fg]+', string) ['', 'Python', 'is', 'awesome', '']</pre> <p>You use an arbitrary number of ‘f’ or ‘g’ characters as regular expression delimiters. How do you accomplish this? By combining the character class regex <code>[A]</code> and the one-or-more regex <code>A+</code> into the following regex: <code>[fg]+</code>. The strings in between are added to the return list.</p> <h2>How to Use the maxsplit Argument?</h2> <p>What if you don’t want to split the whole string but only a limited number of times. Here’s an example:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> string = 'a-bird-in-the-hand-is-worth-two-in-the-bush' >>> re.split('-', string, maxsplit=5) ['a', 'bird', 'in', 'the', 'hand', 'is-worth-two-in-the-bush'] >>> re.split('-', string, maxsplit=2) ['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']</pre> <p>We use the simple delimiter regex ‘-‘ to divide the string into substrings. In the first method call, we set maxsplit=5 to obtain six list elements. In the second method call, we set maxsplit=3 to obtain three list elements. Can you see the pattern?</p> <p>You can also use positional arguments to save some characters:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""> >>> re.split('-', string, 2) ['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']</pre> <p>But as many coders don’t know about the maxsplit argument, you probably should use the keyword argument for readability.</p> <h2>How to Use the Optional Flag Argument?</h2> <p>As you’ve seen in the specification, the re.split() method comes with an optional fourth ‘flag’ argument:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">re.split(pattern, string, maxsplit=0, flags=0)</pre> <p>What’s the purpose of the <a href="https://blog.finxter.com/python-regex-flags/">flags argument</a>?</p> <p>Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (for example, whether to ignore capitalization when matching your regex). </p> <figure class="wp-block-table is-style-stripes"> <table> <tbody> <tr> <td><strong>Syntax</strong></td> <td><strong>Meaning</strong></td> </tr> <tr> <td> <strong>re.ASCII</strong></td> <td>If you don’t use this flag, the special Python regex symbols w, W, b, B, d, D, s and S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests. </td> </tr> <tr> <td> <strong>re.A</strong> </td> <td>Same as re.ASCII </td> </tr> <tr> <td> <strong>re.DEBUG</strong> </td> <td>If you use this flag, Python will print some useful information to the shell that helps you debugging your regex. </td> </tr> <tr> <td> <strong>re.IGNORECASE</strong> </td> <td>If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z]. </td> </tr> <tr> <td> <strong>re.I</strong> </td> <td>Same as re.IGNORECASE </td> </tr> <tr> <td> <strong>re.LOCALE</strong> </td> <td>Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable. </td> </tr> <tr> <td> <strong>re.L</strong> </td> <td>Same as re.LOCALE </td> </tr> <tr> <td> <strong>re.MULTILINE</strong> </td> <td>This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string. </td> </tr> <tr> <td> <strong>re.M</strong> </td> <td>Same as re.MULTILINE </td> </tr> <tr> <td> <strong>re.DOTALL</strong> </td> <td>Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘n’. Switch on this flag to really match all characters including the newline character. </td> </tr> <tr> <td> <strong>re.S</strong> </td> <td>Same as re.DOTALL </td> </tr> <tr> <td> <strong>re.VERBOSE</strong> </td> <td>To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex. </td> </tr> <tr> <td> <strong>re.X</strong> </td> <td>Same as re.VERBOSE </td> </tr> </tbody> </table> </figure> <p>Here’s how you’d use it in a practical example:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> re.split('[xy]+', text, flags=re.I) ['the', 'russians', 'are', 'coming']</pre> <p>Although your regex is lowercase, we ignore the capitalization by using the flag re.I which is short for re.IGNORECASE. If we wouldn’t do it, the result would be quite different:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.split('[xy]+', text) ['theXXXYYYrussiansXX', 'are', 'Y', 'coming']</pre> <p>As the character class [xy] only contains lowerspace characters ‘x’ and ‘y’, their uppercase variants appear in the returned list rather than being used as delimiters.</p> <h2>What’s the Difference Between re.split() and string.split() Methods in Python?</h2> <p>The method re.split() is much more powerful. The re.split(pattern, string) method can split a string along all occurrences of a matched pattern. The pattern can be arbitrarily complicated. This is in contrast to the string.split(delimiter) method which also splits a string into substrings along the delimiter. However, the delimiter must be a normal string. </p> <p>An example where the more powerful re.split() method is superior is in splitting a text along any whitespace characters:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely Frost Upon the sweetest flower of all the field. ''' print(re.split('\s+', text)) ''' ['', 'Ha!', 'let', 'me', 'see', 'her:', 'out,', 'alas!', "he's", 'cold:', 'Her', 'blood', 'is', 'settled,', 'and', 'her', 'joints', 'are', 'stiff;', 'Life', 'and', 'these', 'lips', 'have', 'long', 'been', 'separated:', 'Death', 'lies', 'on', 'her', 'like', 'an', 'untimely', 'Frost', 'Upon', 'the', 'sweetest', 'flower', 'of', 'all', 'the', 'field.', ''] '''</pre> <p>The re.split() method divides the string along any positive number of whitespace characters. You couldn’t achieve such a result with string.split(delimiter) because the delimiter must be a constant-sized string.</p> <h2>Related Re Methods</h2> <p>There are five important regular expression methods which you should master:</p> <ul> <li>The <strong>re.findall(pattern, string)</strong> method returns a list of string matches. Read more in <a href="https://blog.finxter.com/python-re-findall/">our blog tutorial</a>.</li> <li>The <strong>re.search(pattern, string)</strong> method returns a match object of the first match. Read more in <a href="https://blog.finxter.com/python-regex-search/">our blog tutorial</a>.</li> <li>The <strong>re.match(pattern, string)</strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href="https://blog.finxter.com/python-regex-match/">our blog tutorial</a>.</li> <li>The <strong>re.fullmatch(pattern, string)</strong> method returns a match object if the regex matches the whole string. Read more in <a href="https://blog.finxter.com/python-regex-fullmatch/">our blog tutorial</a>.</li> <li>The <strong>re.compile(pattern)</strong> method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in <a href="https://blog.finxter.com/python-regex-compile/">our blog tutorial</a>.</li> </ul> <p>These five methods are 80% of what you need to know to get started with Python’s regular expression functionality.</p> <h2>Where to Go From Here?</h2> <p><strong>You’ve learned about the re.split(pattern, string) method that divides the string along the matched pattern occurrences and returns a list of substrings.</strong></p> <p>Learning Python is hard. But if you cheat, it isn’t as hard as it has to be:</p> <p><a href="https://blog.finxter.com/subscribe/">Download 8 Free Python Cheat Sheets now!</a></p> </div> https://www.sickgaming.net/blog/2020/01/22/python-regex-split/ |