[Tut] How to Find All Lines Not Containing a Regex in Python? - Printable Version +- Sick Gaming (https://www.sickgaming.net) +-- Forum: Programming (https://www.sickgaming.net/forum-76.html) +--- Forum: Python (https://www.sickgaming.net/forum-83.html) +--- Thread: [Tut] How to Find All Lines Not Containing a Regex in Python? (/thread-93887.html) |
[Tut] How to Find All Lines Not Containing a Regex in Python? - xSicKxBot - 03-06-2020 How to Find All Lines Not Containing a Regex in Python? <div><p>Today, I stumbled upon this beautiful regex problem:</p> <p><em><strong>Given is a multi-line string and a regex pattern. How to find all lines that do NOT contain the regex pattern?</strong></em></p> <p>I’ll give you a short answer and a long answer.</p> <p><strong>The short answer is to use the pattern <code>'((?!regex).)*'</code> to match all lines that do not contain regex pattern <code>regex</code>. The expression <code>'(?! ...)'</code> is a negative lookahead that ensures that the enclosed pattern <code>...</code> does not follow from the current position. </strong></p> <p>So let’s discuss this solution in greater detail. (You can also watch my explainer video if you prefer video format.)</p> <figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio"> <div class="wp-block-embed__wrapper"> <div class="ast-oembed-container"><iframe title="How to Find All Lines Not Containing a Regex in Python?" width="1100" height="619" src="https://www.youtube.com/embed/RFyzd9xP7cM?feature=oembed" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div> </div> </figure> <h2>Detailed Example</h2> <p>Let’s consider a practical code snippet. I’ll show you the code first and explain it afterwards:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re s = '''the answer is 42 the answer: 42 42 is the answer 43 is not the answer 42''' for match in re.finditer('^((?!42).)*$', s, flags=re.M): print(match) ''' <re.Match object; span=(49, 58), match='43 is not'> <re.Match object; span=(59, 69), match='the answer'> '''</pre> <p>You can see that the code successfully matches only the lines that do not contain the string <code>'42'</code>. </p> <p><strong>How can you do it? </strong></p> <p>The general idea is to match a line that doesn’t contain the string ‘<code>42'</code>, print it to the shell, and move on to the next line. </p> <p>The <code>re.finditer(pattern, string)</code> accomplishes this easily by returning an iterator over all match objects. </p> <p>The regex pattern <code>'^((?!42).)*$'</code> matches the whole line from the first position <code>'^'</code> to the last position <code>'$'</code>. If you need a refresher on the start-of-the-line and<a rel="noreferrer noopener" aria-label=" end-of-the-line metacharacters, read this 5-min tutorial (opens in a new tab)" href="https://blog.finxter.com/python-regex-start-of-line-and-end-of-line/" target="_blank"> end-of-the-line metacharacters, read this 5-min tutorial</a>.</p> <p>In between, you match an arbitrary number of characters: the asterisk quantifier does that for you. <a rel="noreferrer noopener" aria-label="If you need help understanding the asterisk quantifier, check out this blog tutorial. (opens in a new tab)" href="https://blog.finxter.com/python-re-asterisk/" target="_blank">If you need help understanding the asterisk quantifier, check out this blog tutorial.</a></p> <p>Which characters do you match? Only those where you don’t have the negative word <code>'42'</code> in your lookahead. <a rel="noreferrer noopener" aria-label="If you need a refresher on lookaheads, check out this tutorial. (opens in a new tab)" href="https://blog.finxter.com/python-re-groups/" target="_blank">If you need a refresher on lookaheads, check out this tutorial. </a></p> <p>The lookahead itself doesn’t consume a character. Thus, you need to consume it manually by adding the dot metacharacter <code>.</code> which matches all characters except the newline character <code>'\n'</code>. <a rel="noreferrer noopener" aria-label="As it turns out, there's also a blog tutorial on the dot metacharacter. (opens in a new tab)" href="https://blog.finxter.com/python-re-dot/" target="_blank">As it turns out, there’s also a blog tutorial on the dot metacharacter.</a></p> <p>Finally, you need to define the <code>re.MULTILINE</code> flag, in short: <code>re.M</code>, because it allows the start <code>^</code> and end <code>$</code> metacharacters to match also at the start and end of each line (not only at the start and end of each string). <a href="https://blog.finxter.com/python-regex-flags/" target="_blank" rel="noreferrer noopener" aria-label="You can read more about the flags argument at this blog tutorial. (opens in a new tab)">You can read more about the flags argument at this blog tutorial.</a></p> <p>Together, this regular expression matches all lines that do not contain the specific word <code>'42'</code>. </p> <p>In case you had some problems understanding the concept of lookahead (and why it doesn’t consume anything), have a look at this explanation from the <a href="https://blog.finxter.com/python-re-groups/" target="_blank" rel="noreferrer noopener" aria-label="matching group tutorial (opens in a new tab)">matching group tutorial</a> on this blog:</p> <h2>Positive Lookahead (?=…)</h2> <p>The concept of lookahead is a very powerful one and any advanced coder should know it. A friend recently told me that he had written a complicated regex that ignores the order of occurrences of two words in a given text. It’s a challenging problem and without the concept of lookahead, the resulting code will be complicated and hard to understand. However, the concept of lookahead makes this problem simple to write and read. </p> <p>But first things first: <strong>how does the lookahead assertion work?</strong> </p> <p>In normal regular expression processing, the regex is matched from left to right. The regex engine “consumes” partially matching substrings. The consumed substring cannot be matched by any other part of the regex.</p> <p><img src="https://lh5.googleusercontent.com/0-XaBHQmCSQh3Wn6sRZUJyC8wmLLc08tnc89lxXQ3bVPFL8k-MyWQwaORoWB3GGB20U9lZE9dMlOcPZmzTDX8zWpEzngCTYCK6lK89vDW8T_VBS6tL41vCO1BAhVplDqpz_zweOv" width="602" height="339"><strong><em>Figure:</em></strong> <em>A simple example of lookahead. The regular expression engine matches (“consumes”) the string partially. Then it checks whether the remaining pattern could be matched without actually matching it.</em></p> <p>Think of the lookahead assertion as a non-consuming pattern match. The regex engine goes from the left to the right—searching for the pattern. At each point, it has one “current” position to check if this position is the first position of the remaining match. In other words, the regex engine tries to “consume” the next character as a (partial) match of the pattern.</p> <p>The advantage of the lookahead expression is that it doesn’t consume anything. It just “looks ahead” starting from the current position whether what follows would theoretically match the lookahead pattern. If it doesn’t, the regex engine cannot move on. Next, it “backtracks”—which is just a fancy way of saying: it goes back to a previous decision and tries to match something else.</p> <h3>Positive Lookahead Example: How to Match Two Words in Arbitrary Order?</h3> <p>What if you want to search a given text for pattern A AND pattern B—but in no particular order? If both patterns appear anywhere in the string, the whole string should be returned as a match.</p> <p>Now, this is a bit more complicated because any regular expression pattern is ordered from left to right. A simple solution is to use the lookahead assertion (?.*A) to check whether regex A appears anywhere in the string. (Note we assume a single line string as the .* pattern doesn’t match the newline character by default.)</p> <p>Let’s first have a look at the minimal solution to check for two patterns anywhere in the string (say, patterns ‘hi’ AND ‘you’).</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> pattern = '(?=.*hi)(?=.*you)' >>> re.findall(pattern, 'hi how are yo?') [] >>> re.findall(pattern, 'hi how are you?') [''] </pre> <p>In the first example, both words do not appear. In the second example, they do.</p> <p>Let’s go back to the expression (?=.*hi)(?=.*you) to match strings that contain both ‘hi’ and ‘you’. Why does it work?</p> <p>The reason is that the lookahead expressions don’t consume anything. You first search for an arbitrary number of characters .*, followed by the word hi. But because the regex engine hasn’t consumed anything, it’s still in the <strong>same position at the beginning of the string</strong>. So, you can repeat the same for the word you.</p> <p>Note that this method doesn’t care about the order of the two words:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> pattern = '(?=.*hi)(?=.*you)' >>> re.findall(pattern, 'hi how are you?') [''] >>> re.findall(pattern, 'you are how? hi!') ['']</pre> <p>No matter which word “hi” or “you” appears first in the text, the regex engine finds both.</p> <p>You may ask: why’s the output the empty string? The reason is that the regex engine hasn’t consumed any character. It just checked the lookaheads. So the easy fix is to consume all characters as follows:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> pattern = '(?=.*hi)(?=.*you).*' >>> re.findall(pattern, 'you fly high') ['you fly high']</pre> <p>Now, the whole string is a match because after checking the lookahead with ‘(?=.*hi)(?=.*you)’, you also consume the whole string ‘.*’.</p> <h2>Negative Lookahead (?!…)</h2> <p>The negative lookahead works just like the positive lookahead—only it checks that the given regex pattern does <strong>not </strong>occur going forward from a certain position. </p> <p>Here’s an example:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> re.search('(?!.*hi.*)', 'hi say hi?') <re.Match object; span=(8, 8), match=''></pre> <p>The negative lookahead pattern <code>(?!.*hi.*)</code> ensures that, going forward in the string, there’s no occurrence of the substring <code>'hi'</code>. The first position where this holds is position 8 (right after the second <code>'h'</code>). Like the positive lookahead, the negative lookahead does not consume any character so the result is the empty string (which is a valid match of the pattern). </p> <p>You can even combine multiple negative lookaheads like this:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.search('(?!.*hi.*)(?!\?).', 'hi say hi?') <re.Match object; span=(8, 9), match='i'></pre> <p>You search for a position where neither ‘hi’ is in the lookahead, nor does the question mark character follow immediately. This time, we consume an arbitrary character so the resulting match is the character <code>'i'</code>. </p> <h2>Where to Go From Here?</h2> <p><strong>Summary</strong>: You’ve learned that you can match all lines that do not match a certain <code>regex</code> by using the lookahead pattern <code>((?!regex).)*</code>. </p> <p>Now this was a lot of theory! Let’s get some practice.</p> <p>In my Python freelancer bootcamp, I’ll train you how to create yourself a new success skill as a Python freelancer with the potential of earning six figures online. The next recession is coming for sure and you want to be able to create your own economy so that you can take care of your loved ones.</p> <p><a rel="noreferrer noopener" aria-label="Check out my free webinar now. (opens in a new tab)" href="https://blog.finxter.com/webinar-freelancer/" target="_blank">Check out my free “Python Freelancer” webinar now!</a></p> <p><strong>Join 20,000+ ambitious coders for free!</strong></p> </div> https://www.sickgaming.net/blog/2020/03/05/how-to-find-all-lines-not-containing-a-regex-in-python/ |