{"id":107829,"date":"2020-01-22T20:40:29","date_gmt":"2020-01-22T20:40:29","guid":{"rendered":"https:\/\/blog.finxter.com\/?p=5830"},"modified":"2020-01-22T20:40:29","modified_gmt":"2020-01-22T20:40:29","slug":"python-regex-split","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2020\/01\/22\/python-regex-split\/","title":{"rendered":"Python Regex Split"},"content":{"rendered":"<p>Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!<\/p>\n<p>This article is all about the <strong>re.split(pattern, string)<\/strong> method of Python&#8217;s\u00a0<a rel=\"noreferrer noopener\" target=\"_blank\" href=\"https:\/\/docs.python.org\/3\/library\/re.html\">re library<\/a>.<\/p>\n<p>Let&#8217;s answer the following question:<\/p>\n<h2>How Does re.split() Work in Python?<\/h2>\n<p><strong>The <strong>re.split(pattern, string, maxsplit=0, flags=0)<\/strong> method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those.<\/strong><\/p>\n<p>Here&#8217;s a minimal example:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> string = 'Learn Python with\\t Finxter!'\n>>> re.split('\\s+', string)\n['Learn', 'Python', 'with', 'Finxter!']<\/pre>\n<p>The string contains four words that are separated by whitespace characters (in particular: the empty space &#8216; &#8216; and the tabular character &#8216;\\t&#8217;). You use the regular expression &#8216;\\s+&#8217; to match all occurrences of a positive number of subsequent whitespaces. The matched substrings serve as delimiters. The result is the string divided along those delimiters.<\/p>\n<p>But that&#8217;s not all! Let&#8217;s have a look at the formal definition of the split method.<\/p>\n<p><strong>Specification<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">re.split(pattern, string, maxsplit=0, flags=0)<\/pre>\n<p>The method has four arguments&#8212;two of which are optional.<\/p>\n<ul>\n<li><strong>pattern<\/strong>: the regular expression pattern you want to use as a delimiter.<\/li>\n<li><strong>string<\/strong>: the text you want to break up into a list of strings.<\/li>\n<li><strong>maxsplit<\/strong> (optional argument): the maximum number of split operations (= the size of the returned list). Per default, the maxsplit argument is 0, which means that it&#8217;s ignored.<\/li>\n<li><strong>flags <\/strong>(optional argument): a more advanced modifier that allows you to customize the behavior of the function. Per default the regex module does not consider any flags. Want to know <a href=\"https:\/\/blog.finxter.com\/python-regex-flags\/\">how to use those flags? Check out this detailed article<\/a> on the Finxter blog.<\/li>\n<\/ul>\n<p>The first and second arguments are required. The third and fourth arguments are optional. <\/p>\n<p>You&#8217;ll learn about those arguments in more detail later. <\/p>\n<p><strong>Return Value:<\/strong><\/p>\n<p>The regex split method returns a list of substrings obtained by using the regex as a delimiter.<\/p>\n<h2>Regex Split Minimal Example<\/h2>\n<p>Let&#8217;s study some more examples&#8212;from simple to more complex.<\/p>\n<p>The easiest use is with only two arguments: the delimiter regex and the string to be split.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> string = 'fgffffgfgPythonfgisfffawesomefgffg'\n>>> re.split('[fg]+', string)\n['', 'Python', 'is', 'awesome', '']<\/pre>\n<p>You use an arbitrary number of &#8216;f&#8217; or &#8216;g&#8217; characters as regular expression delimiters. How do you accomplish this? By combining the character class regex <code>[A]<\/code> and the one-or-more regex <code>A+<\/code> into the following regex: <code>[fg]+<\/code>. The strings in between are added to the return list.<\/p>\n<h2>How to Use the maxsplit Argument?<\/h2>\n<p>What if you don&#8217;t want to split the whole string but only a limited number of times. Here&#8217;s an example:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> string = 'a-bird-in-the-hand-is-worth-two-in-the-bush'\n>>> re.split('-', string, maxsplit=5)\n['a', 'bird', 'in', 'the', 'hand', 'is-worth-two-in-the-bush']\n>>> re.split('-', string, maxsplit=2)\n['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']<\/pre>\n<p>We use the simple delimiter regex &#8216;-&#8216; to divide the string into substrings. In the first method call, we set maxsplit=5 to obtain six list elements. In the second method call, we set maxsplit=3 to obtain three list elements. Can you see the pattern?<\/p>\n<p>You can also use positional arguments to save some characters:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"> >>> re.split('-', string, 2)\n['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']<\/pre>\n<p>But as many coders don&#8217;t know about the maxsplit argument, you probably should use the keyword argument for readability.<\/p>\n<h2>How to Use the Optional Flag Argument?<\/h2>\n<p>As you&#8217;ve seen in the specification, the re.split() method comes with an optional fourth &#8216;flag&#8217; argument:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">re.split(pattern, string, maxsplit=0, flags=0)<\/pre>\n<p>What&#8217;s the purpose of the <a href=\"https:\/\/blog.finxter.com\/python-regex-flags\/\">flags argument<\/a>?<\/p>\n<p>Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (for example, whether to ignore capitalization when matching your regex). <\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table>\n<tbody>\n<tr>\n<td><strong>Syntax<\/strong><\/td>\n<td><strong>Meaning<\/strong><\/td>\n<\/tr>\n<tr>\n<td> <strong>re.ASCII<\/strong><\/td>\n<td>If you don&#8217;t use this flag, the special Python regex symbols w, W, b, B, d, D, s and S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters &#8212; as the name suggests. <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.A<\/strong> <\/td>\n<td>Same as re.ASCII <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.DEBUG<\/strong> <\/td>\n<td>If you use this flag, Python will print some useful information to the shell that helps you debugging your regex. <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.IGNORECASE<\/strong> <\/td>\n<td>If you use this flag, the regex engine will perform case-insensitive matching. So if you&#8217;re searching for [A-Z], it will also match [a-z]. <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.I<\/strong> <\/td>\n<td>Same as re.IGNORECASE <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.LOCALE<\/strong> <\/td>\n<td>Don&#8217;t use this flag &#8212; ever. It&#8217;s depreciated&#8212;the idea was to perform case-insensitive matching depending on your current locale. But it isn&#8217;t reliable. <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.L<\/strong> <\/td>\n<td>Same as re.LOCALE <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.MULTILINE<\/strong> <\/td>\n<td>This flag switches on the following feature: the start-of-the-string regex &#8216;^&#8217; matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex &#8216;$&#8217; that now matches also at the end of each line in a multi-line string. <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.M<\/strong> <\/td>\n<td>Same as re.MULTILINE <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.DOTALL<\/strong> <\/td>\n<td>Without using this flag, the dot regex &#8216;.&#8217; matches all characters except the newline character &#8216;n&#8217;. Switch on this flag to really match all characters including the newline character. <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.S<\/strong> <\/td>\n<td>Same as re.DOTALL <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.VERBOSE<\/strong> <\/td>\n<td>To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character &#8216;#&#8217; are ignored in the regex. <\/td>\n<\/tr>\n<tr>\n<td> <strong>re.X<\/strong> <\/td>\n<td>Same as re.VERBOSE <\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>Here&#8217;s how you&#8217;d use it in a practical example:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.split('[xy]+', text, flags=re.I)\n['the', 'russians', 'are', 'coming']<\/pre>\n<p>Although your regex is lowercase, we ignore the capitalization by using the flag re.I which is short for re.IGNORECASE. If we wouldn&#8217;t do it, the result would be quite different:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.split('[xy]+', text)\n['theXXXYYYrussiansXX', 'are', 'Y', 'coming']<\/pre>\n<p>As the character class [xy] only contains lowerspace characters &#8216;x&#8217; and &#8216;y&#8217;, their uppercase variants appear in the returned list rather than being used as delimiters.<\/p>\n<h2>What&#8217;s the Difference Between re.split() and string.split() Methods in Python?<\/h2>\n<p>The method re.split() is much more powerful. The re.split(pattern, string) method can split a string along all occurrences of a matched pattern. The pattern can be arbitrarily complicated. This is in contrast to the string.split(delimiter) method which also splits a string into substrings along the delimiter. However, the delimiter must be a normal string. <\/p>\n<p>An example where the more powerful re.split() method is superior is in splitting a text along any whitespace characters:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely Frost Upon the sweetest flower of all the field. ''' print(re.split('\\s+', text)) '''\n['', 'Ha!', 'let', 'me', 'see', 'her:', 'out,', 'alas!', \"he's\", 'cold:', 'Her', 'blood', 'is', 'settled,', 'and', 'her', 'joints', 'are', 'stiff;', 'Life', 'and', 'these', 'lips', 'have', 'long', 'been', 'separated:', 'Death', 'lies', 'on', 'her', 'like', 'an', 'untimely', 'Frost', 'Upon', 'the', 'sweetest', 'flower', 'of', 'all', 'the', 'field.', ''] '''<\/pre>\n<p>The re.split() method divides the string along any positive number of whitespace characters. You couldn&#8217;t achieve such a result with string.split(delimiter) because the delimiter must be a constant-sized string.<\/p>\n<h2>Related Re Methods<\/h2>\n<p>There are five important regular expression methods which you should master:<\/p>\n<ul>\n<li>The <strong>re.findall(pattern, string)<\/strong> method returns a list of string matches. Read more in <a href=\"https:\/\/blog.finxter.com\/python-re-findall\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.search(pattern, string)<\/strong> method returns a match object of the first match. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-search\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.match(pattern, string)<\/strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-match\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.fullmatch(pattern, string)<\/strong> method returns a match object if the regex matches the whole string. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-fullmatch\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.compile(pattern)<\/strong> method prepares the regular expression pattern\u2014and returns a regex object which you can use multiple times in your code. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-compile\/\">our blog tutorial<\/a>.<\/li>\n<\/ul>\n<p>These five methods are 80% of what you need to know to get started with Python&#8217;s regular expression functionality.<\/p>\n<h2>Where to Go From Here?<\/h2>\n<p><strong>You&#8217;ve learned about the re.split(pattern, string) method that divides the string along the matched pattern occurrences and returns a list of substrings.<\/strong><\/p>\n<p>Learning Python is hard. But if you cheat, it isn&#8217;t as hard as it has to be:<\/p>\n<p><a href=\"https:\/\/blog.finxter.com\/subscribe\/\">Download 8 Free Python Cheat Sheets now!<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens! This article is all about the re.split(pattern, string) method of Python&#8217;s\u00a0re library. Let&#8217;s answer the following question: [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[857],"tags":[73,468,528],"class_list":["post-107829","post","type-post","status-publish","format-standard","hentry","category-python-tut","tag-programming","tag-python","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/107829","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=107829"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/107829\/revisions"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=107829"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=107829"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=107829"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}