{"id":109429,"date":"2020-02-20T15:47:52","date_gmt":"2020-02-20T15:47:52","guid":{"rendered":"https:\/\/blog.finxter.com\/?p=6316"},"modified":"2020-02-20T15:47:52","modified_gmt":"2020-02-20T15:47:52","slug":"python-regex-greedy-vs-non-greedy-quantifiers","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2020\/02\/20\/python-regex-greedy-vs-non-greedy-quantifiers\/","title":{"rendered":"Python Regex Greedy vs Non-Greedy Quantifiers"},"content":{"rendered":"<p>Ready to earn the black belt of your <a href=\"https:\/\/blog.finxter.com\/python-regex\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"regex superpower (opens in a new tab)\">regex superpower<\/a>? This tutorial shows you the subtle but important difference between greedy and non-greedy regex quantifiers. <\/p>\n<p>But first things first: <strong>what are &#8220;quantifiers&#8221; anyway?<\/strong> Great question &#8211; I&#8217;m glad you asked! So let&#8217;s dive into Python&#8217;s three main regex quantifiers.<\/p>\n<h2>Python Regex Quantifiers<\/h2>\n<p>The word &#8220;<a href=\"https:\/\/www.merriam-webster.com\/dictionary\/quantity\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">quantifier<\/a>&#8221; originates from latin: it&#8217;s meaning is <strong>quantus = how much \/ how often<\/strong>.<\/p>\n<p><strong>This is precisely what a regular expression quantifier means: you tell the regex engine how often you want to match a given pattern. <\/strong><\/p>\n<p>If you think you don&#8217;t define any quantifier, you do it implicitly: no quantifier means to match the regular expression exactly once.<\/p>\n<p>So what are the regex quantifiers in Python?<\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table>\n<tbody>\n<tr>\n<td>Quantifier<\/td>\n<td>Meaning<\/td>\n<\/tr>\n<tr>\n<td><code>A?<\/code><\/td>\n<td>Match regular expression <code>A<\/code> zero or one times<\/td>\n<\/tr>\n<tr>\n<td><code>A*<\/code><\/td>\n<td>Match regular expression <code>A<\/code> zero or more times<\/td>\n<\/tr>\n<tr>\n<td><code>A+<\/code><\/td>\n<td>Match regular expression <code>A<\/code> one or more times<\/td>\n<\/tr>\n<tr>\n<td><code>A{m}<\/code><\/td>\n<td>Match regular expression <code>A<\/code> exactly m times<\/td>\n<\/tr>\n<tr>\n<td><code>A{m,n}<\/code><\/td>\n<td>Match regular expression <code>A<\/code> between m and n times (included)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>Note that in this tutorial, I assume you have at least a remote idea of what regular expressions actually are. If you haven&#8217;t, no problem, check out my <a rel=\"noreferrer noopener\" aria-label=\"detailed regex tutorial on this blog (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-regex\/\" target=\"_blank\">detailed regex tutorial on this blog<\/a>.<\/p>\n<p>You see in the table that the quantifiers <code>?<\/code>, <code>*<\/code>, <code>+<\/code>, <code>{m}<\/code>, and <code>{m,n}<\/code> define how often you repeat the matching of regex <code>A<\/code>. <\/p>\n<p>Let&#8217;s have a look at some examples&#8212;one for each quantifier:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.findall('a?', 'aaaa')\n['a', 'a', 'a', 'a', '']\n>>> re.findall('a*', 'aaaa')\n['aaaa', '']\n>>> re.findall('a+', 'aaaa')\n['aaaa']\n>>> re.findall('a{3}', 'aaaa')\n['aaa']\n>>> re.findall('a{1,2}', 'aaaa')\n['aa', 'aa']<\/pre>\n<p>In each line, you try a different quantifier on the same text <code>'aaaa'<\/code>. And, interestingly, each line leads to a different output:<\/p>\n<ul>\n<li>The <a rel=\"noreferrer noopener\" aria-label=\"zero-or-one (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-re-question-mark\/\" target=\"_blank\">zero-or-one<\/a> regex <code>'a?'<\/code> matches four times one <code>'a'<\/code>. Note that it doesn&#8217;t match zero characters if it can avoid doing so.<\/li>\n<li>The <a rel=\"noreferrer noopener\" href=\"https:\/\/blog.finxter.com\/python-re-question-mark\/\" target=\"_blank\">zero-or-more<\/a> regex <code>'a*'<\/code> matches once four <code>'a'<\/code>s and consumes them. At the end of the string, it can still match the empty string.<\/li>\n<li>The <a rel=\"noreferrer noopener\" href=\"https:\/\/blog.finxter.com\/python-re-question-mark\/\" target=\"_blank\">one-or-more<\/a> regex <code>'a+'<\/code> matches once four <code>'a'<\/code>s. In contrast to the previous quantifier, it cannot match an empty string.<\/li>\n<li>The repeating regex <code>'a{3}'<\/code> matches up to three <code>'a'<\/code>s in a single run. It can do so only once.<\/li>\n<li>The repeating regex <code>'a{1,2}'<\/code> matches one or two <code>'a'<\/code>s. It tries to match as many as possible.<\/li>\n<\/ul>\n<p>You&#8217;ve learned the basic quantifiers of Python regular expressions. Now, it&#8217;s time to explore the meaning of the term greedy. Shall we?<\/p>\n<h2>Python Regex Greedy Match<\/h2>\n<p><strong>A greedy match means that the regex engine (the one which tries to find your pattern in the string) matches as many characters as possible.<\/strong><\/p>\n<p>For example, the regex <code>'a+'<\/code> will match as many <code>'a'<\/code>s as possible in your string <code>'aaaa'<\/code>. Although the substrings <code>'a'<\/code>, <code>'aa'<\/code>, <code>'aaa'<\/code> all match the regex <code>'a+'<\/code>, it&#8217;s not enough for the regex engine. It&#8217;s always hungry and tries to match even more.<\/p>\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/66.media.tumblr.com\/afcca057d2243b8de45c61035d32ae9a\/tumblr_mxpukxBkoH1qbt7i1o1_640.gifv\" alt=\"\" width=\"176\" height=\"228\"\/><\/figure>\n<p>In other words, the greedy quantifiers give you the <strong>longest match<\/strong> from a given position in the string.<\/p>\n<p>As it turns out, all default quantifiers <code>?<\/code>, <code>*<\/code>, <code>+<\/code>, <code>{m}<\/code>, and <code>{m,n}<\/code> you&#8217;ve learned above are greedy: they &#8220;consume&#8221; or match as many characters as possible so that the regex pattern is still satisfied.<\/p>\n<p>Here are the above examples again that all show how greedy the regex engine is:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.findall('a?', 'aaaa')\n['a', 'a', 'a', 'a', '']\n>>> re.findall('a*', 'aaaa')\n['aaaa', '']\n>>> re.findall('a+', 'aaaa')\n['aaaa']\n>>> re.findall('a{3}', 'aaaa')\n['aaa']\n>>> re.findall('a{1,2}', 'aaaa')\n['aa', 'aa']<\/pre>\n<p>In all cases, a shorter match would also be valid. But as the regex engine is greedy per default, those are not enough for the regex engine.<\/p>\n<p>Okay, so how can we do a non-greedy match?<\/p>\n<h2>Python Regex Non-Greedy Match<\/h2>\n<p><strong>A non-greedy match means that the regex engine matches as few characters as possible&#8212;so that it still can match the pattern in the given string.<\/strong><\/p>\n<p>For example, the regex <code>'a+?'<\/code> will match as few <code>'a'<\/code>s as possible in your string <code>'aaaa'<\/code>. Thus, it matches the first character <code>'a'<\/code> and is done with it. Then, it moves on to the second character (which is also a match) and so on.<\/p>\n<p>In other words, the non-greedy quantifiers give you the <strong>shortest possible match<\/strong> from a given position in the string.<\/p>\n<p>You can make the default quantifiers <code>?<\/code>, <code>*<\/code>, <code>+<\/code>, <code>{m}<\/code>, and <code>{m,n}<\/code> non-greedy by appending a question mark symbol <code>'?'<\/code> to them: <code>??<\/code>, <code>*?<\/code>, <code>+?<\/code>, and <code>{m,n}?<\/code>. they &#8220;consume&#8221; or match as few characters as possible so that the regex pattern is still satisfied.<\/p>\n<p>Here are some examples that show how non-greedy matching works:<\/p>\n<h2>Non-Greedy Question Mark Operator (??)<\/h2>\n<p>Let&#8217;s start with the question mark (zero-or-one operator):<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.findall('a?', 'aaaa')\n['a', 'a', 'a', 'a', '']\n>>> re.findall('a??', 'aaaa')\n['', 'a', '', 'a', '', 'a', '', 'a', '']<\/pre>\n<p>In the first instance, you use the zero-or-one regex <code>'a?'<\/code>. It&#8217;s greedy so it matches one <code>'a'<\/code> character if possible. <\/p>\n<p>In the second instance, you use the non-greedy zero-or-one version <code>'a??'<\/code>. It matches zero <code>'a'<\/code>s if possible. Note that it moves from left to right so it matches the empty string and &#8220;consumes&#8221; it. Only then, it cannot match the empty string anymore so it is forced to match the first <code>'a'<\/code> character. But after that, it&#8217;s free to match the empty string again. This pattern of first matching the empty string and only then matching the <code>'a'<\/code> if it is absolutely needed repeats. That&#8217;s why this strange pattern occurs.<\/p>\n<h2>Non-Greedy Asterisk Operator (*?)<\/h2>\n<p>Let&#8217;s start with the asterisk (zero-or-more operator):<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.findall('a*', 'aaaa')\n['aaaa', '']\n>>> re.findall('a*?', 'aaaa')\n['', 'a', '', 'a', '', 'a', '', 'a', '']<\/pre>\n<p>First, you use the zero-or-more asterisk regex <code>'a*'<\/code>. It&#8217;s greedy so it matches as many <code>'a'<\/code> characters as it can. <\/p>\n<p>Second, you use the non-greedy zero-or-one version <code>'a*?'<\/code>. Again, it matches zero <code>'a'<\/code>s if possible. Only if it has already matched zero characters at a certain position, it matches one character at that position, &#8220;consumes&#8221; it, and moves on.<\/p>\n<h2>Non-Greedy Plus Operator (+?)<\/h2>\n<p>Let&#8217;s start with the plus (one-or-more operator):<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.findall('a+', 'aaaa')\n['aaaa']\n>>> re.findall('a+?', 'aaaa')\n['a', 'a', 'a', 'a']<\/pre>\n<p>First, you use the one-or-more plus regex <code>'a+'<\/code>. It&#8217;s greedy so it matches as many <code>'a'<\/code> characters as it can (but at least one). <\/p>\n<p>Second, you use the non-greedy one-or-more version <code>'a+?'<\/code>. In this case, the regex engine matches only one character <code>'a'<\/code>, consumes it, and moves on with the next match.<\/p>\n<p>Let&#8217;s summarize what you&#8217;ve learned so far:<\/p>\n<h2>Greedy vs Non-Greedy Match &#8211; What&#8217;s the Difference?<\/h2>\n<p>Given a pattern with a quantifier (e.g. the <a href=\"https:\/\/blog.finxter.com\/python-re-asterisk\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"asterisk operator (opens in a new tab)\">asterisk operator<\/a>) that allows the regex engine to match the pattern multiple times. <\/p>\n<p>A given string may match the regex in multiple ways. For example, both substrings <code>'a'<\/code> and <code>'aaa'<\/code> are valid matches when matching the pattern <code>'a*'<\/code> in the string <code>'aaaa'<\/code>. <\/p>\n<p>So the difference between the greedy and the non-greedy match is the following: The greedy match will try to match as many repetitions of the quantified pattern as possible. The non-greedy match will try to match as few repetitions of the quantified pattern as possible. <\/p>\n<h2>Examples Greedy vs Non-Greedy Match<\/h2>\n<p>Let&#8217;s consider a range of examples that help you understand the difference between greedy and non-greedy matches in Python:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.findall('a+', 'aaaa')\n['aaaa']\n>>> re.findall('a+?', 'aaaa')\n['a', 'a', 'a', 'a']\n>>> re.findall('a*', 'aaaa')\n['aaaa', '']\n>>> re.findall('a*?', 'aaaa')\n['', 'a', '', 'a', '', 'a', '', 'a', '']\n>>> re.findall('a?', 'aaaa')\n['a', 'a', 'a', 'a', '']\n>>> re.findall('a??', 'aaaa')\n['', 'a', '', 'a', '', 'a', '', 'a', '']<\/pre>\n<p>Make sure you completely understand those examples before you move on. If you don&#8217;t, please read the previous paragraphs again.<\/p>\n<h2>Which is Faster: Greedy vs Non-Greedy?<\/h2>\n<p>Considering that greedy quantifiers match a maximal and non-greedy a minimal number of patterns, is there any performance difference?<\/p>\n<p>Great question! <\/p>\n<p>Indeed, some benchmarks suggest that there&#8217;s a significant performance difference: the greedy quantifier is 100% slower in <a href=\"https:\/\/www.rexegg.com\/regex-quantifiers.html\">realistic experiments on benchmark data<\/a>. <\/p>\n<p><strong>So if you optimize for speed and you don&#8217;t care about greedy or non-greedy matches&#8212;and you don&#8217;t know anything else&#8212;go for the non-greedy quantifier!<\/strong><\/p>\n<p>However, the truth is not as simple. For example, consider the following basic experiment that falsifies the previous hypothesis that the non-greedy version is faster:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import timeit\n>>> timeit.timeit('import re;re.findall(\"a*\", \"aaaaaaaaaaaa\")')\n1.0579840000000331\n>>> timeit.timeit('import re;re.findall(\"a*?\", \"aaaaaaaaaaaa\")')\n3.7830938000000742<\/pre>\n<p>I used the speed testing tool <a rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\" href=\"https:\/\/docs.python.org\/3.8\/library\/timeit.html\" target=\"_blank\">timeit <\/a>that allows to throw in some simple Python statements and check how long they run. Per default, the passed statement is executed 1,000,000 times. <\/p>\n<p>You can see a notable performance difference of more than 300%! The non-greedy version is three times slower than the greedy version. <\/p>\n<p>Why is that?<\/p>\n<p>The reason is the re.findall() method that returns a list of matching substrings. Here&#8217;s the output both statements would produce:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.findall(\"a*\", \"aaaaaaaaaaaa\")\n['aaaaaaaaaaaa', '']\n>>> re.findall(\"a*?\", \"aaaaaaaaaaaa\")\n['', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '']<\/pre>\n<p>You can see that the greedy version finds one match and is done with it. The non-greedy version finds 25 matches which leads to far more processing and memory overhead.<\/p>\n<p>So what happens if you use the <a rel=\"noreferrer noopener\" aria-label=\"re.search() (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-regex-search\/\" target=\"_blank\">re.search()<\/a> method that returns only the first match rather than the <a rel=\"noreferrer noopener\" aria-label=\"re.findall() (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-re-findall\/\" target=\"_blank\">re.findall()<\/a> method that returns all matches?<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> timeit.timeit('import re;re.search(\"a*\", \"aaaaaaaaaaaa\")')\n0.8420328999998219\n>>> timeit.timeit('import re;re.search(\"a*?\", \"aaaaaaaaaaaa\")')\n0.7955709000000297<\/pre>\n<p>As expected, this changes things again. Both regex searches yield a single result, but the non-greedy match is much shorter: it matches the empty string <code>''<\/code> rather than the whole string <code>'aaaaaaaaaaaa'<\/code>. Of course, this is a bit faster.<\/p>\n<p>However, the difference is negligible in this minimal example.<\/p>\n<h2>There&#8217;s More: Greedy, Docile, Lazy, Helpful, Possessive Match<\/h2>\n<p>In this article, I&#8217;ve classified the regex world into greedy and non-greedy quantifiers. But you can differentiate the &#8220;non-greedy&#8221; class even more!<\/p>\n<p>Next, I&#8217;ll give you a short overview based on <a href=\"https:\/\/www.rexegg.com\/regex-quantifiers.html\">this great article<\/a> of the most important terms in this regard:<\/p>\n<ul>\n<li><strong>Greedy<\/strong>: match as many instances of the quantified pattern as you can.<\/li>\n<li><strong>Docile<\/strong>: match as many instances of the quantified pattern as long as it still matches the overall pattern&#8212;if this is possible. Note that what I called &#8220;greedy&#8221; in this article is really &#8220;docile&#8221;.<\/li>\n<li><strong>Lazy<\/strong>: match as few instances of the quantified pattern as needed. This is what I called &#8220;non-greedy&#8221; in this article.<\/li>\n<li><strong>Possessive<\/strong>: never gives up a partial match. So the regex engine may not even find a match that actually exist&#8212;just because it&#8217;s so greedy. This is very unusual and you won&#8217;t see it a lot in practice.<\/li>\n<\/ul>\n<p>If you want to learn more about those, I&#8217;d recommend that you read this excellent online tutorial<a rel=\"noreferrer noopener\" aria-label=\". (opens in a new tab)\" href=\"https:\/\/www.rexegg.com\/regex-quantifiers.html\" target=\"_blank\">. <\/a><\/p>\n<h2>Where to Go From Here<\/h2>\n<p><strong>Summary: <\/strong>You&#8217;ve learned that the greedy quantifiers <code>?<\/code>, <code>*<\/code>, and <code>+<\/code> match as many repetitions of the quantified pattern as possible. The non-greedy quantifiers <code>??<\/code>, <code>*?<\/code>, and <code>+?<\/code> match as few repetitions of the quantified pattern as possible. <\/p>\n<p>If you want to master Python and regular expressions, <a href=\"https:\/\/blog.finxter.com\/subscribe\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"join my free email academy (opens in a new tab)\">join my free email academy<\/a>&#8212;it&#8217;s fun!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ready to earn the black belt of your regex superpower? This tutorial shows you the subtle but important difference between greedy and non-greedy regex quantifiers. But first things first: what are &#8220;quantifiers&#8221; anyway? Great question &#8211; I&#8217;m glad you asked! So let&#8217;s dive into Python&#8217;s three main regex quantifiers. Python Regex Quantifiers The word &#8220;quantifier&#8221; [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[857],"tags":[73,468,528],"class_list":["post-109429","post","type-post","status-publish","format-standard","hentry","category-python-tut","tag-programming","tag-python","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/109429","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=109429"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/109429\/revisions"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=109429"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=109429"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=109429"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}