[Tut] Python Regex Split - Printable Version

[Tut] Python Regex Split - Printable Version

+- Sick Gaming (https://www.sickgaming.net)
+-- Forum: Programming (https://www.sickgaming.net/forum-76.html)
+--- Forum: Python (https://www.sickgaming.net/forum-83.html)
+--- Thread: [Tut] Python Regex Split (/thread-93278.html)

[Tut] Python Regex Split - xSicKxBot - 01-23-2020

Python Regex Split

<div>Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!
This article is all about the re.split(pattern, string) method of Python’s <a rel="noreferrer noopener" target="_blank" href="https://docs.python.org/3/library/re.html">re library</a>.
Let’s answer the following question:
<h2>How Does re.split() Work in Python?</h2>
The re.split(pattern, string, maxsplit=0, flags=0) method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those.
Here’s a minimal example:
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> string = 'Learn Python with\t Finxter!'
>>> re.split('\s+', string)
['Learn', 'Python', 'with', 'Finxter!']</pre>
The string contains four words that are separated by whitespace characters (in particular: the empty space ‘ ‘ and the tabular character ‘\t’). You use the regular expression ‘\s+’ to match all occurrences of a positive number of subsequent whitespaces. The matched substrings serve as delimiters. The result is the string divided along those delimiters.
But that’s not all! Let’s have a look at the formal definition of the split method.
Specification
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">re.split(pattern, string, maxsplit=0, flags=0)</pre>
The method has four arguments—two of which are optional.
<ul>
<li>pattern: the regular expression pattern you want to use as a delimiter.</li>
<li>string: the text you want to break up into a list of strings.</li>
<li>maxsplit (optional argument): the maximum number of split operations (= the size of the returned list). Per default, the maxsplit argument is 0, which means that it’s ignored.</li>
<li>flags (optional argument): a more advanced modifier that allows you to customize the behavior of the function. Per default the regex module does not consider any flags. Want to know <a href="https://blog.finxter.com/python-regex-flags/">how to use those flags? Check out this detailed article</a> on the Finxter blog.</li>
</ul>
The first and second arguments are required. The third and fourth arguments are optional. 
You’ll learn about those arguments in more detail later. 
Return Value:
The regex split method returns a list of substrings obtained by using the regex as a delimiter.
<h2>Regex Split Minimal Example</h2>
Let’s study some more examples—from simple to more complex.
The easiest use is with only two arguments: the delimiter regex and the string to be split.
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> string = 'fgffffgfgPythonfgisfffawesomefgffg'
>>> re.split('[fg]+', string)
['', 'Python', 'is', 'awesome', '']</pre>
You use an arbitrary number of ‘f’ or ‘g’ characters as regular expression delimiters. How do you accomplish this? By combining the character class regex <code>[A]</code> and the one-or-more regex <code>A+</code> into the following regex: <code>[fg]+</code>. The strings in between are added to the return list.
<h2>How to Use the maxsplit Argument?</h2>
What if you don’t want to split the whole string but only a limited number of times. Here’s an example:
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> string = 'a-bird-in-the-hand-is-worth-two-in-the-bush'
>>> re.split('-', string, maxsplit=5)
['a', 'bird', 'in', 'the', 'hand', 'is-worth-two-in-the-bush']
>>> re.split('-', string, maxsplit=2)
['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']</pre>
We use the simple delimiter regex ‘-‘ to divide the string into substrings. In the first method call, we set maxsplit=5 to obtain six list elements. In the second method call, we set maxsplit=3 to obtain three list elements. Can you see the pattern?
You can also use positional arguments to save some characters:
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""> >>> re.split('-', string, 2)
['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']</pre>
But as many coders don’t know about the maxsplit argument, you probably should use the keyword argument for readability.
<h2>How to Use the Optional Flag Argument?</h2>
As you’ve seen in the specification, the re.split() method comes with an optional fourth ‘flag’ argument:
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">re.split(pattern, string, maxsplit=0, flags=0)</pre>
What’s the purpose of the <a href="https://blog.finxter.com/python-regex-flags/">flags argument</a>?
Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (for example, whether to ignore capitalization when matching your regex). 
<figure class="wp-block-table is-style-stripes">
<table>
<tbody>
<tr>
<td>Syntax</td>
<td>Meaning</td>
</tr>
<tr>
<td> re.ASCII</td>
<td>If you don’t use this flag, the special Python regex symbols w, W, b, B, d, D, s and S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests. </td>
</tr>
<tr>
<td> re.A </td>
<td>Same as re.ASCII </td>
</tr>
<tr>
<td> re.DEBUG </td>
<td>If you use this flag, Python will print some useful information to the shell that helps you debugging your regex. </td>
</tr>
<tr>
<td> re.IGNORECASE </td>
<td>If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z]. </td>
</tr>
<tr>
<td> re.I </td>
<td>Same as re.IGNORECASE </td>
</tr>
<tr>
<td> re.LOCALE </td>
<td>Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable. </td>
</tr>
<tr>
<td> re.L </td>
<td>Same as re.LOCALE </td>
</tr>
<tr>
<td> re.MULTILINE </td>
<td>This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string. </td>
</tr>
<tr>
<td> re.M </td>
<td>Same as re.MULTILINE </td>
</tr>
<tr>
<td> re.DOTALL </td>
<td>Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘n’. Switch on this flag to really match all characters including the newline character. </td>
</tr>
<tr>
<td> re.S </td>
<td>Same as re.DOTALL </td>
</tr>
<tr>
<td> re.VERBOSE </td>
<td>To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex. </td>
</tr>
<tr>
<td> re.X </td>
<td>Same as re.VERBOSE </td>
</tr>
</tbody>
</table>
</figure>
Here’s how you’d use it in a practical example:
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.split('[xy]+', text, flags=re.I)
['the', 'russians', 'are', 'coming']</pre>
Although your regex is lowercase, we ignore the capitalization by using the flag re.I which is short for re.IGNORECASE. If we wouldn’t do it, the result would be quite different:
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.split('[xy]+', text)
['theXXXYYYrussiansXX', 'are', 'Y', 'coming']</pre>
As the character class [xy] only contains lowerspace characters ‘x’ and ‘y’, their uppercase variants appear in the returned list rather than being used as delimiters.
<h2>What’s the Difference Between re.split() and string.split() Methods in Python?</h2>
The method re.split() is much more powerful. The re.split(pattern, string) method can split a string along all occurrences of a matched pattern. The pattern can be arbitrarily complicated. This is in contrast to the string.split(delimiter) method which also splits a string into substrings along the delimiter. However, the delimiter must be a normal string. 
An example where the more powerful re.split() method is superior is in splitting a text along any whitespace characters:
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely Frost Upon the sweetest flower of all the field. ''' print(re.split('\s+', text)) '''
['', 'Ha!', 'let', 'me', 'see', 'her:', 'out,', 'alas!', "he's", 'cold:', 'Her', 'blood', 'is', 'settled,', 'and', 'her', 'joints', 'are', 'stiff;', 'Life', 'and', 'these', 'lips', 'have', 'long', 'been', 'separated:', 'Death', 'lies', 'on', 'her', 'like', 'an', 'untimely', 'Frost', 'Upon', 'the', 'sweetest', 'flower', 'of', 'all', 'the', 'field.', ''] '''</pre>
The re.split() method divides the string along any positive number of whitespace characters. You couldn’t achieve such a result with string.split(delimiter) because the delimiter must be a constant-sized string.
<h2>Related Re Methods</h2>
There are five important regular expression methods which you should master:
<ul>
<li>The re.findall(pattern, string) method returns a list of string matches. Read more in <a href="https://blog.finxter.com/python-re-findall/">our blog tutorial</a>.</li>
<li>The re.search(pattern, string) method returns a match object of the first match. Read more in <a href="https://blog.finxter.com/python-regex-search/">our blog tutorial</a>.</li>
<li>The re.match(pattern, string) method returns a match object if the regex matches at the beginning of the string. Read more in <a href="https://blog.finxter.com/python-regex-match/">our blog tutorial</a>.</li>
<li>The re.fullmatch(pattern, string) method returns a match object if the regex matches the whole string. Read more in <a href="https://blog.finxter.com/python-regex-fullmatch/">our blog tutorial</a>.</li>
<li>The re.compile(pattern) method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in <a href="https://blog.finxter.com/python-regex-compile/">our blog tutorial</a>.</li>
</ul>
These five methods are 80% of what you need to know to get started with Python’s regular expression functionality.
<h2>Where to Go From Here?</h2>
You’ve learned about the re.split(pattern, string) method that divides the string along the matched pattern occurrences and returns a list of substrings.
Learning Python is hard. But if you cheat, it isn’t as hard as it has to be:
<a href="https://blog.finxter.com/subscribe/">Download 8 Free Python Cheat Sheets now!</a>
</div>

https://www.sickgaming.net/blog/2020/01/22/python-regex-split/