{"id":109643,"date":"2020-02-26T11:23:40","date_gmt":"2020-02-26T11:23:40","guid":{"rendered":"https:\/\/blog.finxter.com\/?p=6421"},"modified":"2020-02-26T11:23:40","modified_gmt":"2020-02-26T11:23:40","slug":"regex-special-characters-examples-in-python-re","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2020\/02\/26\/regex-special-characters-examples-in-python-re\/","title":{"rendered":"Regex Special Characters \u2013 Examples in Python Re"},"content":{"rendered":"<p>Regular expressions are a strange animal. Many students find them difficult to understand &#8211; do you? <\/p>\n<figure class=\"wp-block-embed-youtube wp-block-embed is-type-rich is-provider-embed-handler wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<div class=\"wp-block-embed__wrapper\">\n<div class=\"ast-oembed-container\"><iframe loading=\"lazy\" title=\"Regex Special Characters - Examples in Python Re\" width=\"1100\" height=\"619\" src=\"https:\/\/www.youtube.com\/embed\/hSy0xea-8p8?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/div>\n<\/p><\/div>\n<\/figure>\n<p>I realized that a major reason for this is simply that they don&#8217;t understand the special regex characters. To put it differently: understand the special characters and everything else in the regex space will come much easier to you.<\/p>\n<p><a href=\"https:\/\/blog.finxter.com\/python-regex\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Regular expressions (opens in a new tab)\">Regular expressions<\/a> are built from characters. There are <a href=\"https:\/\/www.regular-expressions.info\/characters.html\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">two types<\/a> of characters: <strong>literal characters<\/strong> and <strong>special characters<\/strong>.<\/p>\n<h2>Literal Characters<\/h2>\n<p>Let&#8217;s start with the absolute first thing you need to know with regular expressions: a regular expression (short: <em>regex<\/em>) searches for a given pattern in a given string.<\/p>\n<p>What&#8217;s a pattern? In its most basic form, a pattern can be a literal character. So the literal characters <code>'a'<\/code>, <code>'b'<\/code>, and <code>'c'<\/code> are all valid regex patterns. <\/p>\n<p>For example, you can search for the regex pattern <code>'a'<\/code> in the string <code>'hello world'<\/code> but it won&#8217;t find a <em>match<\/em>. You can also search for the pattern <code>'a'<\/code> in the string <code>'hello woman'<\/code> and there is a match: the second last character in the string.<\/p>\n<p>Based on the simple insight that a literal character is a valid regex pattern, you&#8217;ll find that a combination of literal characters is also a valid regex pattern. For example, the regex pattern <code>'an'<\/code> matches the last two characters in the string <code>'hello woman'<\/code>. <\/p>\n<p><strong>Summary<\/strong>: Regular expressions are built from characters. An important class of characters are the literal characters. <a href=\"https:\/\/docs.python.org\/3\/library\/re.html\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"In principle (opens in a new tab)\">In principle<\/a>, you can use all <a href=\"https:\/\/en.wikipedia.org\/wiki\/Unicode\">Unicode<\/a> literal characters in your regex pattern. <\/p>\n<h2>Special Characters<\/h2>\n<p>However, the power of regular expressions come from their abstraction capability. Instead of writing the <a rel=\"noreferrer noopener\" aria-label=\"character class (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-character-set-regex-tutorial\/\" target=\"_blank\">character set<\/a> <code>[abcdefghijklmnopqrstuvwxyz]<\/code>, you&#8217;d write <code>[a-z]<\/code> or even <code>\\w<\/code>. The latter is a special regex character&#8212;and pros know them by heart. In fact, regex experts seldomly match literal characters. In most cases, they use more advanced constructs or special characters for various reasons such as brevity, expressiveness, or generality. <\/p>\n<p><strong>So what are the special characters you can use in your regex patterns?<\/strong><\/p>\n<p>Let&#8217;s have a look at the following table that contains all special characters in Python&#8217;s <code>re<\/code> package for regular expression processing. <\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table>\n<thead>\n<tr>\n<th>Special Character<\/th>\n<th>Meaning<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>\\n<\/code><\/td>\n<td>The <strong>newline<\/strong> symbol is not a special symbol particular to regex only, it\u2019s actually one of the most widely-used, standard characters. However, you\u2019ll see the newline character so often that I just couldn\u2019t write this list without including it. For example, the regex <code>'hello\\nworld'<\/code> matches a string where the string <code>'hello'<\/code> is placed in one line and the string <code>'world'<\/code> is placed into the second line.\u00a0<\/td>\n<\/tr>\n<tr>\n<td><code>\\t<\/code><\/td>\n<td>The <strong>tabular<\/strong> character is, like the newline character, not a \u201cregex-specific\u201d symbol. It just encodes the tabular space <code>'\u00a0\u00a0\u00a0'<\/code> which is different to a sequence of whitespaces (even if it doesn&#8217;t look different over here). For example, the regex <code>'hello\\n\\tworld'<\/code> matches the string that consists of <code>'hello'<\/code> in the first line and <code>' world'<\/code> in the second line (with a leading tab character).<\/td>\n<\/tr>\n<tr>\n<td><code>\\s<\/code><\/td>\n<td>The <strong>whitespace<\/strong> character is, in contrast to the newline character, a special symbol of the regex libraries. You\u2019ll find it in many other programming languages, too. The problem is that you often don\u2019t know which type of whitespace is used: tabular characters, simple whitespaces, or even newlines. The whitespace character <code>'\\s'<\/code> simply matches any of them. For example, the regex <code>'\\s*hello\\s+world'<\/code> matches the string <code>' \\t \\n hello \\n \\n \\t world'<\/code>, as well as <code>'hello world'<\/code>.<\/td>\n<\/tr>\n<tr>\n<td><code>\\S<\/code><\/td>\n<td>The <strong>whitespace-negation<\/strong> character matches everything that does not match <code>\\s<\/code>. <\/td>\n<\/tr>\n<tr>\n<td><code>\\w<\/code><\/td>\n<td>The <strong>word<\/strong> character regex simplifies text processing significantly. It represents the class of all characters used in typical words (<code>A-Z<\/code>, <code>a-z<\/code>, <code>0-9<\/code>, and <code>'_'<\/code>). This simplifies the writing of complex regular expressions significantly. For example, the regex <code>'\\w+'<\/code> matches the strings <code>'hello'<\/code>, <code>'bye'<\/code>, <code>'Python'<\/code>, and <code>'Python_is_great'<\/code>.\u00a0<\/td>\n<\/tr>\n<tr>\n<td><code>\\W<\/code><\/td>\n<td>The <strong>word-character-negation<\/strong>. It matches any character that is not a word character.<\/td>\n<\/tr>\n<tr>\n<td><code>\\b<\/code><\/td>\n<td>The <strong>word boundary<\/strong> is also a special symbol used in many regex tools. You can use it to match,\u00a0 as the name suggests, the boundary between the a word character (<code>\\w<\/code>) and a non-word (<code>\\W<\/code>) character. But note that it matches only the empty string! You may ask: why does it exist if it doesn\u2019t match any character? The reason is that it doesn\u2019t \u201cconsume\u201d the character right in front or right after a word. This way, you can search for whole words (or parts of words) and return only the word but not the delimiting characters that separate the word, e.g.,\u00a0 from other words.<\/td>\n<\/tr>\n<tr>\n<td><code>\\d<\/code><\/td>\n<td>The <strong>digit character <\/strong>matches all numeric symbols between 0 and 9. You can use it to match integers with an arbitrary number of digits: the regex <code>'\\d+'<\/code> matches integer numbers <code>'10'<\/code>, <code>'1000'<\/code>, <code>'942'<\/code>, and <code>'99999999999'<\/code>.<\/td>\n<\/tr>\n<tr>\n<td><code>\\D<\/code><\/td>\n<td>Matches any <strong>non-digit character<\/strong>. This is the inverse of <code>\\d<\/code> and it&#8217;s equivalent to <code>[^0-9]<\/code>. <\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>But these are not all characters you can use in a regular expression. <\/p>\n<p>There are also <em>meta characters<\/em> for the regex engine that allow you to do much more powerful stuff. <\/p>\n<p>A good example is the asterisk operator that matches &#8220;zero or more&#8221; occurrences of the preceding regex. For example, the pattern <code>.*txt<\/code> matches an arbitrary number of arbitrary characters followed by the suffix <code>'txt'<\/code>. This pattern has two special regex meta characters: the dot <code>.<\/code> and the asterisk operator <code>*<\/code>. You&#8217;ll now learn about those meta characters:<\/p>\n<h2>Regex Meta Characters<\/h2>\n<p>Feel free to watch the short video about the most important regex meta characters:<\/p>\n<figure class=\"wp-block-embed-youtube wp-block-embed is-type-rich is-provider-embed-handler wp-embed-aspect-4-3 wp-has-aspect-ratio\">\n<div class=\"wp-block-embed__wrapper\">\n<div class=\"ast-oembed-container\"><iframe loading=\"lazy\" title=\"Python Regex Syntax [15-Minute Primer]\" width=\"1100\" height=\"825\" src=\"https:\/\/www.youtube.com\/embed\/G1JLUpc-bvY?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/div>\n<\/p><\/div>\n<\/figure>\n<p>Next, you\u2019ll get a quick and dirty overview of the most important regex operations and how to use them in Python.<\/p>\n<p>Here are the most important regex operators:<\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table>\n<thead>\n<tr>\n<th>Meta Character<\/th>\n<th>Meaning<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>.<\/code><\/td>\n<td>The <strong>wild-card<\/strong> operator (<em>dot<\/em>) matches any character in a string except the newline character <code>'\\n'<\/code>. For example, the regex <code>'...'<\/code> matches all words with three characters such as <code>'abc'<\/code>, <code>'cat'<\/code>, and <code>'dog'<\/code>.\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td><code>*<\/code><\/td>\n<td>The <strong>zero-or-more<\/strong> asterisk operator matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex. For example, the regex &#8216;cat*&#8217; matches the strings <code>'ca'<\/code>, <code>'cat'<\/code>, <code>'catt'<\/code>, <code>'cattt'<\/code>, and <code>'catttttttt'<\/code>. <\/td>\n<\/tr>\n<tr>\n<td><code>?<\/code><\/td>\n<td>The <strong>zero-or-one<\/strong> operator matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. For example, the regex \u2018cat?\u2019 matches both strings <code>\u2018ca\u2019<\/code> and <code>\u2018cat\u2019<\/code> &#8212; but not <code>\u2018catt\u2019<\/code>, <code>\u2018cattt\u2019<\/code>, and <code>\u2018catttttttt\u2019<\/code>.\u00a0<\/td>\n<\/tr>\n<tr>\n<td><code>+<\/code><\/td>\n<td>The <strong>at-least-one<\/strong> operator matches one or more occurrences of the immediately preceding regex. For example, the regex <code>\u2018cat+\u2019<\/code> does not match the string <code>\u2018ca\u2019<\/code> but matches all strings with at least one trailing character <code>\u2018t\u2019<\/code> such as <code>\u2018cat\u2019<\/code>, <code>\u2018catt\u2019<\/code>, and <code>\u2018cattt\u2019<\/code>.\u00a0<\/td>\n<\/tr>\n<tr>\n<td><code>^<\/code><\/td>\n<td>The <strong>start-of-string<\/strong> operator matches the beginning of a string. For example, the regex <code>\u2018^p\u2019<\/code> would match the strings <code>\u2018python\u2019<\/code> and <code>\u2018programming\u2019<\/code> but not <code>\u2018lisp\u2019<\/code> and <code>\u2018spying\u2019<\/code> where the character <code>\u2018p\u2019<\/code> does not occur at the start of the string.<\/td>\n<\/tr>\n<tr>\n<td><code>$<\/code><\/td>\n<td>The <strong>end-of-string<\/strong> operator matches the end of a string. For example, the regex <code>\u2018py$\u2019<\/code> would match the strings <code>\u2018main.py\u2019<\/code> and <code>\u2018pypy\u2019<\/code> but not the strings <code>\u2018python\u2019<\/code> and <code>\u2018pypi\u2019<\/code>.\u00a0<\/td>\n<\/tr>\n<tr>\n<td><code>A|B<\/code><\/td>\n<td>The <strong>OR<\/strong> operator matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. For example, the regex <code>\u2018(hello)|(hi)\u2019<\/code> matches strings <code>\u2018hello world\u2019<\/code> and <code>\u2018hi python\u2019<\/code>. It wouldn\u2019t make sense to try to match both of them at the same time.<\/td>\n<\/tr>\n<tr>\n<td><code>AB<\/code>\u00a0<\/td>\n<td>The <strong>AND<\/strong> operator matches first the regex A and second the regex B, in this sequence. We\u2019ve already seen it trivially in the regex <code>\u2018ca\u2019<\/code> that matches first regex <code>\u2018c\u2019<\/code> and second regex <code>\u2018a\u2019<\/code>.\u00a0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the <code>\u2018^\u2019<\/code> operator is usually denoted as the \u2018caret\u2019 operator. Those names are not descriptive so I came up with more kindergarten-like words such as the \u201cstart-of-string\u201d operator.<\/p>\n<p>Let\u2019s dive into some examples!<\/p>\n<h2>Examples<\/h2>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import re text = ''' Ha! let me see her: out, alas! he's cold: Her blood is settled, and her joints are stiff; Life and these lips have long been separated: Death lies on her like an untimely frost Upon the sweetest flower of all the field. ''' print(re.findall('.a!', text)) '''\nFinds all occurrences of an arbitrary character that is\nfollowed by the character sequence 'a!'.\n['Ha!'] ''' print(re.findall('is.*and', text)) '''\nFinds all occurrences of the word 'is',\nfollowed by an arbitrary number of characters\nand the word 'and'.\n['is settled, and'] ''' print(re.findall('her:?', text)) '''\nFinds all occurrences of the word 'her',\nfollowed by zero or one occurrences of the colon ':'.\n['her:', 'her', 'her'] ''' print(re.findall('her:+', text)) '''\nFinds all occurrences of the word 'her',\nfollowed by one or more occurrences of the colon ':'.\n['her:'] ''' print(re.findall('^Ha.*', text)) '''\nFinds all occurrences where the string starts with\nthe character sequence 'Ha', followed by an arbitrary\nnumber of characters except for the new-line character. Can you figure out why Python doesn't find any?\n[] ''' print(re.findall('\\n$', text)) '''\nFinds all occurrences where the new-line character '\\n'\noccurs at the end of the string.\n['\\n'] ''' print(re.findall('(Life|Death)', text)) '''\nFinds all occurrences of either the word 'Life' or the\nword 'Death'.\n['Life', 'Death'] '''\n<\/pre>\n<p>In these examples, you\u2019ve already seen the special symbol <code>\\n<\/code> which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions.<\/p>\n<h2>Where to Go From Here<\/h2>\n<p>You&#8217;ve learned all special characters of regular expressions, as well as meta characters. This will give you a strong basis for improving your regex skills.<\/p>\n<p>If you want to accelerate your skills, you need a good foundation. Check out my brand-new Python book &#8220;<a href=\"https:\/\/www.amazon.com\/gp\/product\/B07ZY7XMX8\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Python One-Liners (Amazon Link) (opens in a new tab)\">Python One-Liners (Amazon Link)<\/a>&#8221; which boosts your skills from zero to hero&#8212;in a single line of Python code!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Regular expressions are a strange animal. Many students find them difficult to understand &#8211; do you? I realized that a major reason for this is simply that they don&#8217;t understand the special regex characters. To put it differently: understand the special characters and everything else in the regex space will come much easier to you. [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[857],"tags":[73,468,528],"class_list":["post-109643","post","type-post","status-publish","format-standard","hentry","category-python-tut","tag-programming","tag-python","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/109643","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=109643"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/109643\/revisions"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=109643"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=109643"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=109643"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}