[Tut] How to Match an Exact Word in Python Regex? (Answer: Don’t) - Printable Version +- Sick Gaming (https://www.sickgaming.net) +-- Forum: Programming (https://www.sickgaming.net/forum-76.html) +--- Forum: Python (https://www.sickgaming.net/forum-83.html) +--- Thread: [Tut] How to Match an Exact Word in Python Regex? (Answer: Don’t) (/thread-93926.html) |
[Tut] How to Match an Exact Word in Python Regex? (Answer: Don’t) - xSicKxBot - 03-08-2020 How to Match an Exact Word in Python Regex? (Answer: Don’t) <div><p>This morning, I read over an actual <a href="https://www.quora.com/How-do-you-match-an-exact-word-with-Regex-Python" target="_blank" rel="noreferrer noopener" aria-label="Quora (opens in a new tab)">Quora</a> thread with this precise question. While there’s no dumb question, the question reveals that there may be some gap in understanding the basics in Python and Python’s <a rel="noreferrer noopener" aria-label="regular expression library (opens in a new tab)" href="https://blog.finxter.com/python-regex/" target="_blank">regular expression library</a>.</p> <p>So if you’re an impatient person, here’s the short answer:</p> <p><strong><em>How to match an exact word/string using a regular expression in Python?</em></strong></p> <p><strong>You don’t! Well, you can do it by using the straightforward regex <code>'hello'</code> to match it in <code>'hello world'</code>. But there’s no need to use an expensive and less readable regex to match an exact substring in a given string. Instead, simply use the pure Python expression <code>'hello'</code> in <code>'hello world'</code>. </strong></p> <p>So far so good. But let’s dive into some more specific questions—because you may not exactly have looked for this simplistic answer. In fact, there are multiple ways of understanding your question and I have tried to find all interpretations and answered them one by one:</p> <figure class="wp-block-embed-youtube wp-block-embed is-type-rich is-provider-embed-handler wp-embed-aspect-16-9 wp-has-aspect-ratio"> <div class="wp-block-embed__wrapper"> <div class="ast-oembed-container"><iframe title="How to Match an Exact Word in Python Regex? (Answer: Don’t)" width="1400" height="788" src="https://www.youtube.com/embed/Lj_QGc7zWUA?feature=oembed" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div> </p></div> </figure> <p>(You can also watch my tutorial video as you go over the article)</p> <h2>How to Check Membership of a Word in a String (Python Built-In)?</h2> <p>This is the simple answer, you’ve already learned. Instead of matching an exact string, it’s often enough to use Python’s <code>in</code> keyword to check membership. As this is a very efficient built-in functionality in Python, it’s much faster, more readable, and doesn’t require external dependencies. </p> <p>Thus, you should rely on this method if possible:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> 'hello' in 'hello world' True</pre> <p>The first example shows the most straightforward way of doing it: simply ask Python whether a string is “in” another string. This is called the <a rel="noreferrer noopener" aria-label="membership operator (opens in a new tab)" href="https://docs.python.org/3/reference/expressions.html" target="_blank">membership operator</a> and it’s very efficient. </p> <p>You can also check whether a string does <em>not </em>occur in another string. Here’s how:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> 'hi' not in 'hello world' True</pre> <p>The negative membership operator <code>s1 not in s2</code> returns <code>True</code> if string <code>s1</code> does not occur in string <code>s2</code>. </p> <p>But there’s a problem with the membership operator. The return value is only a Boolean value. However, the advantage of Python’s <a rel="noreferrer noopener" aria-label="regular expression library (opens in a new tab)" href="https://blog.finxter.com/python-regex/" target="_blank">regular expression library</a> <code>re</code> is that it returns a <a rel="noreferrer noopener" aria-label="match object (opens in a new tab)" href="https://blog.finxter.com/python-regex-match/" target="_blank">match object</a> which contains more interesting information such as the exact location of the matching substring. </p> <p>So let’s explore the problem of exact string matching using the regex library next:</p> <h2>How to Match an Exact String (Regex)?</h2> <p>Here’s how you can match an exact substring in a given string:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> re.search('hello', 'hello world') <re.Match object; span=(0, 5), match='hello'></pre> <p>After importing Python’s library for regular expression processing <code>re</code>, you use the <code>re.search(pattern, string)</code> method to find the first occurrence of the <code>pattern</code> in the <code>string</code>. If you’re unsure about this method, check out my <a rel="noreferrer noopener" aria-label="detailed tutorial (opens in a new tab)" href="https://blog.finxter.com/python-regex-search/" target="_blank">detailed tutorial</a> on this blog.</p> <p>This returns a match object that wraps a lot of useful information such as the start and stop matching positions and the matching substring. As you’re looking for exact string matches, the matching substring will always be the same as your searched word.</p> <p>But wait, there’s another problem: you wanted an exact match, right? But this also means that you’re getting prefix matches of your searched word:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.search('good', 'goodbye') <re.Match object; span=(0, 4), match='good'></pre> <p>When searching for the exact word <code>'good'</code> in the string <code>'goodbye'</code> it actually matches the prefix of the word. Is this what you wanted?</p> <p>If not, read on:</p> <h2>How to Match a Word in a String (Word Boundary \b)?</h2> <p>So how can we fix the problem that an exact match of a word will also retrieve matching substrings that occur anywhere in the string?</p> <p>Here’s an example:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> 'no' in 'nobody knows' True</pre> <p>And another example:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.search('see', 'dfjkyldsssseels') <re.Match object; span=(10, 13), match='see'></pre> <p>What if you want to match only whole words—not exact substrings? The answer is simple: use the word boundary metacharacter <code>'\b'</code>. This metacharacter matches at the beginning and end of each word—but it doesn’t consume anything. In other words, it simply checks whether the word starts or ends at this position (by checking for whitespace or non-word characters). </p> <p>Here’s how you use the word boundary character to ensure that only whole words match:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> re.search(r'\bno\b', 'nobody knows') >>> >>> re.search(r'\bno\b', 'nobody knows nothing - no?') <re.Match object; span=(23, 25), match='no'> </pre> <p>In both examples, you use the same regex <code>'\bno\b'</code> that searches for the exact word <code>'no'</code> but only if the word boundary character <code>'\b'</code> matches before and after. In other words, the word <code>'no'</code> must appear on its own as a separate word. It is not allowed to appear within another sequence of word characters.</p> <p>As a result, the regex doesn’t match in the string <code>'nobody knows'</code> but it matches in the string <code>'nobody knows nothing - no?'</code>. </p> <p>Note that we use raw string <code>r'...'</code> to write the regex so that the escape sequence <code>'\b'</code> works in the string. Without the raw string, Python would assume that it’s an unescaped backslash character <code>'\'</code>, followed by the character <code>'b'</code>. With the raw string, all backslashes will just be that: backslashes. The regex engine then interprets the two characters as one special metacharacter: the word boundary <code>'\b'</code>. </p> <p>But what if you don’t care whether the word is upper or lowercase or capitalized? In other words:</p> <h2>How to Match a Word in a String (Case Insensitive)?</h2> <p>You can search for an exact word in a string—but ignore capitalization. This way, it’ll be irrelevant whether the word’s characters are lowercase or uppercase. Here’s how:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> re.search('no', 'NONONON', flags=re.IGNORECASE) <re.Match object; span=(0, 2), match='NO'> >>> re.search('no', 'NONONON', flags=re.I) <re.Match object; span=(0, 2), match='NO'> >>> re.search('(?i)no', 'NONONON') <re.Match object; span=(0, 2), match='NO'></pre> <p>All three ways are equivalent: they all ignore the capitalization of the word’s letters. If you need to learn more about the <code>flags</code> argument in Python, check out my <a rel="noreferrer noopener" aria-label="detailed tutorial on this blog (opens in a new tab)" href="https://blog.finxter.com/python-regex-flags/" target="_blank">detailed tutorial on this blog</a>. The third example uses the in-regex flag <code>(?i)</code> that also means: “ignore the capitalization”.</p> <h2>How to Find All Occurrences of a Word in a String?</h2> <p>Okay, you’re never satisfied, are you? So let’s explore how you can find all occurrences of a word in a string.</p> <p>In the previous examples, you used the <code>re.search(pattern, string)</code> method to find the first match of the <code>pattern</code> in the <code>string</code>. </p> <p>Next, you’ll learn how to find all occurrences (not only the first match) by using the <code>re.findall(pattern, string)</code> method. You can also read my <a rel="noreferrer noopener" aria-label="blog tutorial about the findall() method (opens in a new tab)" href="https://blog.finxter.com/python-re-findall/" target="_blank">blog tutorial about the findall() method</a> that explains all the details.</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> re.findall('no', 'nononono') ['no', 'no', 'no', 'no']</pre> <p>Your code retrieves all matching substrings. If you need to find all match objects rather than matching substrings, you can use the <a href="https://blog.finxter.com/python-regex-how-to-count-the-number-of-matches/" target="_blank" rel="noreferrer noopener" aria-label="re.finditer(pattern, string) (opens in a new tab)">re.finditer(pattern, string)</a> method:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> for match in re.finditer('no', 'nonononono'): print(match) <re.Match object; span=(0, 2), match='no'> <re.Match object; span=(2, 4), match='no'> <re.Match object; span=(4, 6), match='no'> <re.Match object; span=(6, 8), match='no'> <re.Match object; span=(8, 10), match='no'> >>> </pre> <p>The <code>re.finditer(pattern, string)</code> method creates an iterator that iterates over all matches and returns the match objects. This way, you can find all matches and get the match objects as well.</p> <h2>How to Find All Lines Containing an Exact Word?</h2> <p>Say you want to find all lines that contain the word ’42’ from a multi-line string in Python. How’d you do it?</p> <p>The answer makes use of a fine Python regex specialty: the dot regex matches all characters, except the newline character. Thus, the regex <code>.*</code> will match all characters in a given line (but then stop). </p> <p>Here’s how you can use this fact to get all lines that contain a certain word:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re >>> s = '''the answer is 42 the answer: 42 42 is the answer 43 is not''' >>> re.findall('.*42.*', s) ['the answer is 42', 'the answer: 42', '42 is the answer']</pre> <p>Three out of four lines contain the word <code>'42'</code>. The <code>findall()</code> method returns these as strings.</p> <h2>How to Find All Lines Not Containing an Exact Word?</h2> <p>In the previous section, you’ve learned how to find all lines that contain an exact word. In this section, you’ll learn how to do the opposite: find all lines that NOT contain an exact word. </p> <p>This is a bit more complicated. I’ll show you the code first and explain it afterwards:</p> <pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re s = '''the answer is 42 the answer: 42 42 is the answer 43 is not the answer 42''' for match in re.finditer('^((?!42).)*$', s, flags=re.M): print(match) ''' <re.Match object; span=(49, 58), match='43 is not'> <re.Match object; span=(59, 69), match='the answer'> '''</pre> <p>You can see that the code successfully matches only the lines that do not contain the string <code>'42'</code>. </p> <p>How can you do it? </p> <p>The general idea is to match a line that doesn’t contain the string ‘<code>42'</code>, print it to the shell, and move on to the next line. The <code>re.finditer(pattern, string)</code> accomplishes this easily by returning an iterator over all match objects. </p> <p>The regex pattern <code>'^((?!42).)*$'</code> matches the whole line from the first position <code>'^'</code> to the last position <code>'$'</code>. If you need a refresher on the start-of-the-line and<a rel="noreferrer noopener" aria-label=" end-of-the-line metacharacters, read this 5-min tutorial (opens in a new tab)" href="https://blog.finxter.com/python-regex-start-of-line-and-end-of-line/" target="_blank"> end-of-the-line metacharacters, read this 5-min tutorial</a>.</p> <p>In between, you match an arbitrary number of characters: the asterisk quantifier does that for you. <a rel="noreferrer noopener" aria-label="If you need help understanding the asterisk quantifier, check out this blog tutorial. (opens in a new tab)" href="https://blog.finxter.com/python-re-asterisk/" target="_blank">If you need help understanding the asterisk quantifier, check out this blog tutorial.</a></p> <p>Which characters do you match? Only those where you don’t have the negative word <code>'42'</code> in your lookahead. <a rel="noreferrer noopener" aria-label="If you need a refresher on lookaheads, check out this tutorial. (opens in a new tab)" href="https://blog.finxter.com/python-re-groups/" target="_blank">If you need a refresher on lookaheads, check out this tutorial. </a></p> <p>As the lookahead itself doesn’t consume a character, we need to consume it manually by adding the dot metacharacter <code>.</code> which matches all characters except the newline character <code>'\n'</code>. <a rel="noreferrer noopener" aria-label="As it turns out, there's also a blog tutorial on the dot metacharacter. (opens in a new tab)" href="https://blog.finxter.com/python-re-dot/" target="_blank">As it turns out, there’s also a blog tutorial on the dot metacharacter.</a></p> <p>Finally, you need to define the <code>re.MULTILINE</code> flag, in short: <code>re.M</code>, because it allows the start <code>^</code> and end <code>$</code> metacharacters to match also at the start and end of each line (not only at the start and end of each string). </p> <p>Together, this regular expression matches all lines that do not contain the specific word <code>'42'</code>. </p> <h2>Where to Go From Here?</h2> <p>Summary: You’ve learned multiple ways of matching an exact word in a string. You can use the simple Python membership operator. You can use a default regex with no special metacharacters. You can use the word boundary metacharacter <code>'\b'</code> to match only whole words. You can match case-insensitive by using the flags argument <code>re.IGNORECASE</code>. You can match not only one but all occurrences of a word in a string by using the <code>re.findall()</code> or <code>re.finditer()</code> methods. And you can match all lines containing and not containing a certain word.</p> <p>Pheww. This was some theory-heavy stuff. Do you feel like you need some more practical stuff next?</p> <p>Then check out my practice-heavy <a href="https://blog.finxter.com/become-python-freelancer-course/" target="_blank" rel="noreferrer noopener" aria-label="Python freelancer course (opens in a new tab)">Python freelancer course</a> that helps you prepare for the worst and create a second income stream by creating your thriving coding side-business online.</p> <p><a href="https://blog.finxter.com/become-python-freelancer-course/" target="_blank" rel="noreferrer noopener" aria-label="https://blog.finxter.com/become-python-freelancer-course/ (opens in a new tab)">https://blog.finxter.com/become-python-freelancer-course/</a></p> </div> https://www.sickgaming.net/blog/2020/03/04/how-to-match-an-exact-word-in-python-regex-answer-dont/ |