Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python Regex Split

#1
Python Regex Split

<div><p>Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!</p>
<p>This article is all about the <strong>re.split(pattern, string)</strong> method of Python’s <a rel="noreferrer noopener" target="_blank" href="https://docs.python.org/3/library/re.html">re library</a>.</p>
<p>Let’s answer the following question:</p>
<h2>How Does re.split() Work in Python?</h2>
<p><strong>The <strong>re.split(pattern, string, maxsplit=0, flags=0)</strong> method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those.</strong></p>
<p>Here’s a minimal example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> string = 'Learn Python with\t Finxter!'
>>> re.split('\s+', string)
['Learn', 'Python', 'with', 'Finxter!']</pre>
<p>The string contains four words that are separated by whitespace characters (in particular: the empty space ‘ ‘ and the tabular character ‘\t’). You use the regular expression ‘\s+’ to match all occurrences of a positive number of subsequent whitespaces. The matched substrings serve as delimiters. The result is the string divided along those delimiters.</p>
<p>But that’s not all! Let’s have a look at the formal definition of the split method.</p>
<p><strong>Specification</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">re.split(pattern, string, maxsplit=0, flags=0)</pre>
<p>The method has four arguments—two of which are optional.</p>
<ul>
<li><strong>pattern</strong>: the regular expression pattern you want to use as a delimiter.</li>
<li><strong>string</strong>: the text you want to break up into a list of strings.</li>
<li><strong>maxsplit</strong> (optional argument): the maximum number of split operations (= the size of the returned list). Per default, the maxsplit argument is 0, which means that it’s ignored.</li>
<li><strong>flags </strong>(optional argument): a more advanced modifier that allows you to customize the behavior of the function. Per default the regex module does not consider any flags. Want to know <a href="https://blog.finxter.com/python-regex-flags/">how to use those flags? Check out this detailed article</a> on the Finxter blog.</li>
</ul>
<p>The first and second arguments are required. The third and fourth arguments are optional. </p>
<p>You’ll learn about those arguments in more detail later. </p>
<p><strong>Return Value:</strong></p>
<p>The regex split method returns a list of substrings obtained by using the regex as a delimiter.</p>
<h2>Regex Split Minimal Example</h2>
<p>Let’s study some more examples—from simple to more complex.</p>
<p>The easiest use is with only two arguments: the delimiter regex and the string to be split.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> string = 'fgffffgfgPythonfgisfffawesomefgffg'
>>> re.split('[fg]+', string)
['', 'Python', 'is', 'awesome', '']</pre>
<p>You use an arbitrary number of ‘f’ or ‘g’ characters as regular expression delimiters. How do you accomplish this? By combining the character class regex <code>[A]</code> and the one-or-more regex <code>A+</code> into the following regex: <code>[fg]+</code>. The strings in between are added to the return list.</p>
<h2>How to Use the maxsplit Argument?</h2>
<p>What if you don’t want to split the whole string but only a limited number of times. Here’s an example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> string = 'a-bird-in-the-hand-is-worth-two-in-the-bush'
>>> re.split('-', string, maxsplit=5)
['a', 'bird', 'in', 'the', 'hand', 'is-worth-two-in-the-bush']
>>> re.split('-', string, maxsplit=2)
['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']</pre>
<p>We use the simple delimiter regex ‘-‘ to divide the string into substrings. In the first method call, we set maxsplit=5 to obtain six list elements. In the second method call, we set maxsplit=3 to obtain three list elements. Can you see the pattern?</p>
<p>You can also use positional arguments to save some characters:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""> >>> re.split('-', string, 2)
['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']</pre>
<p>But as many coders don’t know about the maxsplit argument, you probably should use the keyword argument for readability.</p>
<h2>How to Use the Optional Flag Argument?</h2>
<p>As you’ve seen in the specification, the re.split() method comes with an optional fourth ‘flag’ argument:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">re.split(pattern, string, maxsplit=0, flags=0)</pre>
<p>What’s the purpose of the <a href="https://blog.finxter.com/python-regex-flags/">flags argument</a>?</p>
<p>Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (for example, whether to ignore capitalization when matching your regex). </p>
<figure class="wp-block-table is-style-stripes">
<table>
<tbody>
<tr>
<td><strong>Syntax</strong></td>
<td><strong>Meaning</strong></td>
</tr>
<tr>
<td> <strong>re.ASCII</strong></td>
<td>If you don’t use this flag, the special Python regex symbols w, W, b, B, d, D, s and S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests. </td>
</tr>
<tr>
<td> <strong>re.A</strong> </td>
<td>Same as re.ASCII </td>
</tr>
<tr>
<td> <strong>re.DEBUG</strong> </td>
<td>If you use this flag, Python will print some useful information to the shell that helps you debugging your regex. </td>
</tr>
<tr>
<td> <strong>re.IGNORECASE</strong> </td>
<td>If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z]. </td>
</tr>
<tr>
<td> <strong>re.I</strong> </td>
<td>Same as re.IGNORECASE </td>
</tr>
<tr>
<td> <strong>re.LOCALE</strong> </td>
<td>Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable. </td>
</tr>
<tr>
<td> <strong>re.L</strong> </td>
<td>Same as re.LOCALE </td>
</tr>
<tr>
<td> <strong>re.MULTILINE</strong> </td>
<td>This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string. </td>
</tr>
<tr>
<td> <strong>re.M</strong> </td>
<td>Same as re.MULTILINE </td>
</tr>
<tr>
<td> <strong>re.DOTALL</strong> </td>
<td>Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘n’. Switch on this flag to really match all characters including the newline character. </td>
</tr>
<tr>
<td> <strong>re.S</strong> </td>
<td>Same as re.DOTALL </td>
</tr>
<tr>
<td> <strong>re.VERBOSE</strong> </td>
<td>To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex. </td>
</tr>
<tr>
<td> <strong>re.X</strong> </td>
<td>Same as re.VERBOSE </td>
</tr>
</tbody>
</table>
</figure>
<p>Here’s how you’d use it in a practical example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.split('[xy]+', text, flags=re.I)
['the', 'russians', 'are', 'coming']</pre>
<p>Although your regex is lowercase, we ignore the capitalization by using the flag re.I which is short for re.IGNORECASE. If we wouldn’t do it, the result would be quite different:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.split('[xy]+', text)
['theXXXYYYrussiansXX', 'are', 'Y', 'coming']</pre>
<p>As the character class [xy] only contains lowerspace characters ‘x’ and ‘y’, their uppercase variants appear in the returned list rather than being used as delimiters.</p>
<h2>What’s the Difference Between re.split() and string.split() Methods in Python?</h2>
<p>The method re.split() is much more powerful. The re.split(pattern, string) method can split a string along all occurrences of a matched pattern. The pattern can be arbitrarily complicated. This is in contrast to the string.split(delimiter) method which also splits a string into substrings along the delimiter. However, the delimiter must be a normal string. </p>
<p>An example where the more powerful re.split() method is superior is in splitting a text along any whitespace characters:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely Frost Upon the sweetest flower of all the field. ''' print(re.split('\s+', text)) '''
['', 'Ha!', 'let', 'me', 'see', 'her:', 'out,', 'alas!', "he's", 'cold:', 'Her', 'blood', 'is', 'settled,', 'and', 'her', 'joints', 'are', 'stiff;', 'Life', 'and', 'these', 'lips', 'have', 'long', 'been', 'separated:', 'Death', 'lies', 'on', 'her', 'like', 'an', 'untimely', 'Frost', 'Upon', 'the', 'sweetest', 'flower', 'of', 'all', 'the', 'field.', ''] '''</pre>
<p>The re.split() method divides the string along any positive number of whitespace characters. You couldn’t achieve such a result with string.split(delimiter) because the delimiter must be a constant-sized string.</p>
<h2>Related Re Methods</h2>
<p>There are five important regular expression methods which you should master:</p>
<ul>
<li>The <strong>re.findall(pattern, string)</strong> method returns a list of string matches. Read more in <a href="https://blog.finxter.com/python-re-findall/">our blog tutorial</a>.</li>
<li>The <strong>re.search(pattern, string)</strong> method returns a match object of the first match. Read more in <a href="https://blog.finxter.com/python-regex-search/">our blog tutorial</a>.</li>
<li>The <strong>re.match(pattern, string)</strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href="https://blog.finxter.com/python-regex-match/">our blog tutorial</a>.</li>
<li>The <strong>re.fullmatch(pattern, string)</strong> method returns a match object if the regex matches the whole string. Read more in <a href="https://blog.finxter.com/python-regex-fullmatch/">our blog tutorial</a>.</li>
<li>The <strong>re.compile(pattern)</strong> method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in <a href="https://blog.finxter.com/python-regex-compile/">our blog tutorial</a>.</li>
</ul>
<p>These five methods are 80% of what you need to know to get started with Python’s regular expression functionality.</p>
<h2>Where to Go From Here?</h2>
<p><strong>You’ve learned about the re.split(pattern, string) method that divides the string along the matched pattern occurrences and returns a list of substrings.</strong></p>
<p>Learning Python is hard. But if you cheat, it isn’t as hard as it has to be:</p>
<p><a href="https://blog.finxter.com/subscribe/">Download 8 Free Python Cheat Sheets now!</a></p>
</div>


https://www.sickgaming.net/blog/2020/01/...gex-split/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016