{"id":110030,"date":"2020-03-04T11:42:08","date_gmt":"2020-03-04T11:42:08","guid":{"rendered":"https:\/\/blog.finxter.com\/?p=6547"},"modified":"2020-03-04T11:42:08","modified_gmt":"2020-03-04T11:42:08","slug":"how-to-match-an-exact-word-in-python-regex-answer-dont","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2020\/03\/04\/how-to-match-an-exact-word-in-python-regex-answer-dont\/","title":{"rendered":"How to Match an Exact Word in Python Regex? (Answer: Don\u2019t)"},"content":{"rendered":"<p>This morning, I read over an actual <a href=\"https:\/\/www.quora.com\/How-do-you-match-an-exact-word-with-Regex-Python\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Quora (opens in a new tab)\">Quora<\/a> thread with this precise question. While there&#8217;s no dumb question, the question reveals that there may be some gap in understanding the basics in Python and Python&#8217;s <a rel=\"noreferrer noopener\" aria-label=\"regular expression library (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-regex\/\" target=\"_blank\">regular expression library<\/a>.<\/p>\n<p>So if you&#8217;re an impatient person, here&#8217;s the short answer:<\/p>\n<p><strong><em>How to match an exact word\/string using a regular expression in Python?<\/em><\/strong><\/p>\n<p><strong>You don&#8217;t! Well, you can do it by using the straightforward regex <code>'hello'<\/code> to match it in <code>'hello world'<\/code>. But there&#8217;s no need to use an expensive and less readable regex to match an exact substring in a given string. Instead, simply use the pure Python expression <code>'hello'<\/code> in <code>'hello world'<\/code>. <\/strong><\/p>\n<p>So far so good. But let&#8217;s dive into some more specific questions&#8212;because you may not exactly have looked for this simplistic answer. In fact, there are multiple ways of understanding your question and I have tried to find all interpretations and answered them one by one:<\/p>\n<figure class=\"wp-block-embed-youtube wp-block-embed is-type-rich is-provider-embed-handler wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<div class=\"wp-block-embed__wrapper\">\n<div class=\"ast-oembed-container\"><iframe loading=\"lazy\" title=\"How to Match an Exact Word in Python Regex? (Answer: Don\u2019t)\" width=\"1400\" height=\"788\" src=\"https:\/\/www.youtube.com\/embed\/Lj_QGc7zWUA?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/div>\n<\/p><\/div>\n<\/figure>\n<p>(You can also watch my tutorial video as you go over the article)<\/p>\n<h2>How to Check Membership of a Word in a String (Python Built-In)?<\/h2>\n<p>This is the simple answer, you&#8217;ve already learned. Instead of matching an exact string, it&#8217;s often enough to use Python&#8217;s <code>in<\/code> keyword to check membership. As this is a very efficient built-in functionality in Python, it&#8217;s much faster, more readable, and doesn&#8217;t require external dependencies. <\/p>\n<p>Thus, you should rely on this method if possible:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> 'hello' in 'hello world'\nTrue<\/pre>\n<p>The first example shows the most straightforward way of doing it: simply ask Python whether a string is &#8220;in&#8221; another string. This is called the <a rel=\"noreferrer noopener\" aria-label=\"membership operator (opens in a new tab)\" href=\"https:\/\/docs.python.org\/3\/reference\/expressions.html\" target=\"_blank\">membership operator<\/a> and it&#8217;s very efficient. <\/p>\n<p>You can also check whether a string does <em>not <\/em>occur in another string. Here&#8217;s how:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> 'hi' not in 'hello world'\nTrue<\/pre>\n<p>The negative membership operator <code>s1 not in s2<\/code> returns <code>True<\/code> if string <code>s1<\/code> does not occur in string <code>s2<\/code>. <\/p>\n<p>But there&#8217;s a problem with the membership operator. The return value is only a Boolean value. However, the advantage of Python&#8217;s <a rel=\"noreferrer noopener\" aria-label=\"regular expression library (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-regex\/\" target=\"_blank\">regular expression library<\/a> <code>re<\/code> is that it returns a <a rel=\"noreferrer noopener\" aria-label=\"match object (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-regex-match\/\" target=\"_blank\">match object<\/a> which contains more interesting information such as the exact location of the matching substring. <\/p>\n<p>So let&#8217;s explore the problem of exact string matching using the regex library next:<\/p>\n<h2>How to Match an Exact String (Regex)?<\/h2>\n<p>Here&#8217;s how you can match an exact substring in a given string:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.search('hello', 'hello world')\n&lt;re.Match object; span=(0, 5), match='hello'><\/pre>\n<p>After importing Python&#8217;s library for regular expression processing <code>re<\/code>, you use the <code>re.search(pattern, string)<\/code> method to find the first occurrence of the <code>pattern<\/code> in the <code>string<\/code>. If you&#8217;re unsure about this method, check out my <a rel=\"noreferrer noopener\" aria-label=\"detailed tutorial (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-regex-search\/\" target=\"_blank\">detailed tutorial<\/a> on this blog.<\/p>\n<p>This returns a match object that wraps a lot of useful information such as the start and stop matching positions and the matching substring. As you&#8217;re looking for exact string matches, the matching substring will always be the same as your searched word.<\/p>\n<p>But wait, there&#8217;s another problem: you wanted an exact match, right? But this also means that you&#8217;re getting prefix matches of your searched word:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.search('good', 'goodbye')\n&lt;re.Match object; span=(0, 4), match='good'><\/pre>\n<p>When searching for the exact word <code>'good'<\/code> in the string <code>'goodbye'<\/code> it actually matches the prefix of the word. Is this what you wanted?<\/p>\n<p>If not, read on:<\/p>\n<h2>How to Match a Word in a String (Word Boundary \\b)?<\/h2>\n<p>So how can we fix the problem that an exact match of a word will also retrieve matching substrings that occur anywhere in the string?<\/p>\n<p>Here&#8217;s an example:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> 'no' in 'nobody knows'\nTrue<\/pre>\n<p>And another example:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.search('see', 'dfjkyldsssseels')\n&lt;re.Match object; span=(10, 13), match='see'><\/pre>\n<p>What if you want to match only whole words&#8212;not exact substrings? The answer is simple: use the word boundary metacharacter <code>'\\b'<\/code>. This metacharacter matches at the beginning and end of each word&#8212;but it doesn&#8217;t consume anything. In other words, it simply checks whether the word starts or ends at this position (by checking for whitespace or non-word characters). <\/p>\n<p>Here&#8217;s how you use the word boundary character to ensure that only whole words match:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.search(r'\\bno\\b', 'nobody knows')\n>>> >>> re.search(r'\\bno\\b', 'nobody knows nothing - no?')\n&lt;re.Match object; span=(23, 25), match='no'>\n<\/pre>\n<p>In both examples, you use the same regex <code>'\\bno\\b'<\/code> that searches for the exact word <code>'no'<\/code> but only if the word boundary character <code>'\\b'<\/code> matches before and after. In other words, the word <code>'no'<\/code> must appear on its own as a separate word. It is not allowed to appear within another sequence of word characters.<\/p>\n<p>As a result, the regex doesn&#8217;t match in the string <code>'nobody knows'<\/code> but it matches in the string <code>'nobody knows nothing - no?'<\/code>. <\/p>\n<p>Note that we use raw string <code>r'...'<\/code> to write the regex so that the escape sequence <code>'\\b'<\/code> works in the string. Without the raw string, Python would assume that it&#8217;s an unescaped backslash character <code>'\\'<\/code>, followed by the character <code>'b'<\/code>. With the raw string, all backslashes will just be that: backslashes. The regex engine then interprets the two characters as one special metacharacter: the word boundary <code>'\\b'<\/code>. <\/p>\n<p>But what if you don&#8217;t care whether the word is upper or lowercase or capitalized? In other words:<\/p>\n<h2>How to Match a Word in a String (Case Insensitive)?<\/h2>\n<p>You can search for an exact word in a string&#8212;but ignore capitalization. This way, it&#8217;ll be irrelevant whether the word&#8217;s characters are lowercase or uppercase. Here&#8217;s how:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.search('no', 'NONONON', flags=re.IGNORECASE)\n&lt;re.Match object; span=(0, 2), match='NO'>\n>>> re.search('no', 'NONONON', flags=re.I)\n&lt;re.Match object; span=(0, 2), match='NO'>\n>>> re.search('(?i)no', 'NONONON')\n&lt;re.Match object; span=(0, 2), match='NO'><\/pre>\n<p>All three ways are equivalent: they all ignore the capitalization of the word&#8217;s letters. If you need to learn more about the <code>flags<\/code> argument in Python, check out my <a rel=\"noreferrer noopener\" aria-label=\"detailed tutorial on this blog (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-regex-flags\/\" target=\"_blank\">detailed tutorial on this blog<\/a>. The third example uses the in-regex flag <code>(?i)<\/code> that also means: &#8220;ignore the capitalization&#8221;.<\/p>\n<h2>How to Find All Occurrences of a Word in a String?<\/h2>\n<p>Okay, you&#8217;re never satisfied, are you? So let&#8217;s explore how you can find all occurrences of a word in a string.<\/p>\n<p>In the previous examples, you used the <code>re.search(pattern, string)<\/code> method to find the first match of the <code>pattern<\/code> in the <code>string<\/code>. <\/p>\n<p>Next, you&#8217;ll learn how to find all occurrences (not only the first match) by using the <code>re.findall(pattern, string)<\/code> method. You can also read my <a rel=\"noreferrer noopener\" aria-label=\"blog tutorial about the findall() method (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-re-findall\/\" target=\"_blank\">blog tutorial about the findall() method<\/a> that explains all the details.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.findall('no', 'nononono')\n['no', 'no', 'no', 'no']<\/pre>\n<p>Your code retrieves all matching substrings. If you need to find all match objects rather than matching substrings, you can use the <a href=\"https:\/\/blog.finxter.com\/python-regex-how-to-count-the-number-of-matches\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"re.finditer(pattern, string) (opens in a new tab)\">re.finditer(pattern, string)<\/a> method:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> for match in re.finditer('no', 'nonononono'): print(match) &lt;re.Match object; span=(0, 2), match='no'>\n&lt;re.Match object; span=(2, 4), match='no'>\n&lt;re.Match object; span=(4, 6), match='no'>\n&lt;re.Match object; span=(6, 8), match='no'>\n&lt;re.Match object; span=(8, 10), match='no'>\n>>> <\/pre>\n<p>The <code>re.finditer(pattern, string)<\/code> method creates an iterator that iterates over all matches and returns the match objects. This way, you can find all matches and get the match objects as well.<\/p>\n<h2>How to Find All Lines Containing an Exact Word?<\/h2>\n<p>Say you want to find all lines that contain the word &#8217;42&#8217; from a multi-line string in Python. How&#8217;d you do it?<\/p>\n<p>The answer makes use of a fine Python regex specialty: the dot regex matches all characters, except the newline character. Thus, the regex <code>.*<\/code> will match all characters in a given line (but then stop). <\/p>\n<p>Here&#8217;s how you can use this fact to get all lines that contain a certain word:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> s = '''the answer is 42\nthe answer: 42\n42 is the answer\n43 is not'''\n>>> re.findall('.*42.*', s)\n['the answer is 42', 'the answer: 42', '42 is the answer']<\/pre>\n<p>Three out of four lines contain the word <code>'42'<\/code>. The <code>findall()<\/code> method returns these as strings.<\/p>\n<h2>How to Find All Lines Not Containing an Exact Word?<\/h2>\n<p>In the previous section, you&#8217;ve learned how to find all lines that contain an exact word. In this section, you&#8217;ll learn how to do the opposite: find all lines that NOT contain an exact word. <\/p>\n<p>This is a bit more complicated. I&#8217;ll show you the code first and explain it afterwards:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import re\ns = '''the answer is 42\nthe answer: 42\n42 is the answer\n43 is not\nthe answer\n42''' for match in re.finditer('^((?!42).)*$', s, flags=re.M): print(match) '''\n&lt;re.Match object; span=(49, 58), match='43 is not'>\n&lt;re.Match object; span=(59, 69), match='the answer'> '''<\/pre>\n<p>You can see that the code successfully matches only the lines that do not contain the string <code>'42'<\/code>. <\/p>\n<p>How can you do it? <\/p>\n<p>The general idea is to match a line that doesn&#8217;t contain the string &#8216;<code>42'<\/code>, print it to the shell, and move on to the next line. The <code>re.finditer(pattern, string)<\/code> accomplishes this easily by returning an iterator over all match objects. <\/p>\n<p>The regex pattern <code>'^((?!42).)*$'<\/code> matches the whole line from the first position <code>'^'<\/code> to the last position <code>'$'<\/code>. If you need a refresher on the start-of-the-line and<a rel=\"noreferrer noopener\" aria-label=\" end-of-the-line metacharacters, read this 5-min tutorial (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-regex-start-of-line-and-end-of-line\/\" target=\"_blank\"> end-of-the-line metacharacters, read this 5-min tutorial<\/a>.<\/p>\n<p>In between, you match an arbitrary number of characters: the asterisk quantifier does that for you. <a rel=\"noreferrer noopener\" aria-label=\"If you need help understanding the asterisk quantifier, check out this blog tutorial. (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-re-asterisk\/\" target=\"_blank\">If you need help understanding the asterisk quantifier, check out this blog tutorial.<\/a><\/p>\n<p>Which characters do you match? Only those where you don&#8217;t have the negative word <code>'42'<\/code> in your lookahead. <a rel=\"noreferrer noopener\" aria-label=\"If you need a refresher on lookaheads, check out this tutorial. (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-re-groups\/\" target=\"_blank\">If you need a refresher on lookaheads, check out this tutorial. <\/a><\/p>\n<p>As the lookahead itself doesn&#8217;t consume a character, we need to consume it manually by adding the dot metacharacter <code>.<\/code> which matches all characters except the newline character <code>'\\n'<\/code>. <a rel=\"noreferrer noopener\" aria-label=\"As it turns out, there's also a blog tutorial on the dot metacharacter. (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-re-dot\/\" target=\"_blank\">As it turns out, there&#8217;s also a blog tutorial on the dot metacharacter.<\/a><\/p>\n<p>Finally, you need to define the <code>re.MULTILINE<\/code> flag, in short: <code>re.M<\/code>, because it allows the start <code>^<\/code> and end <code>$<\/code> metacharacters to match also at the start and end of each line (not only at the start and end of each string). <\/p>\n<p>Together, this regular expression matches all lines that do not contain the specific word <code>'42'<\/code>. <\/p>\n<h2>Where to Go From Here?<\/h2>\n<p>Summary: You&#8217;ve learned multiple ways of matching an exact word in a string. You can use the simple Python membership operator. You can use a default regex with no special metacharacters. You can use the word boundary metacharacter <code>'\\b'<\/code> to match only whole words. You can match case-insensitive by using the flags argument <code>re.IGNORECASE<\/code>. You can match not only one but all occurrences of a word in a string by using the <code>re.findall()<\/code> or <code>re.finditer()<\/code> methods. And you can match all lines containing and not containing a certain word.<\/p>\n<p>Pheww. This was some theory-heavy stuff. Do you feel like you need some more practical stuff next?<\/p>\n<p>Then check out my practice-heavy <a href=\"https:\/\/blog.finxter.com\/become-python-freelancer-course\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Python freelancer course (opens in a new tab)\">Python freelancer course<\/a> that helps you prepare for the worst and create a second income stream by creating your thriving coding side-business online.<\/p>\n<p><a href=\"https:\/\/blog.finxter.com\/become-python-freelancer-course\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"https:\/\/blog.finxter.com\/become-python-freelancer-course\/ (opens in a new tab)\">https:\/\/blog.finxter.com\/become-python-freelancer-course\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This morning, I read over an actual Quora thread with this precise question. While there&#8217;s no dumb question, the question reveals that there may be some gap in understanding the basics in Python and Python&#8217;s regular expression library. So if you&#8217;re an impatient person, here&#8217;s the short answer: How to match an exact word\/string using [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[857],"tags":[73,468,528],"class_list":["post-110030","post","type-post","status-publish","format-standard","hentry","category-python-tut","tag-programming","tag-python","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/110030","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=110030"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/110030\/revisions"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=110030"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=110030"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=110030"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}