{"id":109244,"date":"2020-02-15T19:04:14","date_gmt":"2020-02-15T19:04:14","guid":{"rendered":"https:\/\/blog.finxter.com\/?p=6208"},"modified":"2020-02-15T19:04:14","modified_gmt":"2020-02-15T19:04:14","slug":"python-character-set-regex-tutorial","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2020\/02\/15\/python-character-set-regex-tutorial\/","title":{"rendered":"Python Character Set [Regex Tutorial]"},"content":{"rendered":"<p>This tutorial makes you a master of <strong>character sets<\/strong> in Python. (I know, I know, it feels awesome to see your deepest desires finally come true.)<\/p>\n<figure class=\"wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\">\n<div class=\"wp-block-embed__wrapper\">\n<div class=\"ast-oembed-container\"><iframe loading=\"lazy\" title=\"Python Character Set [Regex Tutorial]\" width=\"1100\" height=\"825\" src=\"https:\/\/www.youtube.com\/embed\/lrI5wmZo-mY?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/div>\n<\/div>\n<\/figure>\n<p>As I wrote this article, I saw a lot of different terms describing this same powerful concept such as &#8220;<strong>character class<\/strong>&#8220;, &#8220;<strong>character range<\/strong>&#8220;, or &#8220;<strong>character group<\/strong>&#8220;. However, the most precise term is &#8220;<strong>character set<\/strong>&#8221; as introduced in the <a href=\"https:\/\/docs.python.org\/3\/library\/re.html\">official <\/a>Python regex docs. So in this tutorial, I&#8217;ll use this term throughout.<\/p>\n<h2>Python Regex &#8211; Character Set<\/h2>\n<p><strong>So, what is a character set in regular expressions?<\/strong><\/p>\n<p>The character set is (surprise) a set of characters: if you use a character set in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set. As you may know, a <em><a href=\"https:\/\/blog.finxter.com\/sets-in-python\/\">set is an unordered collection of unique elements<\/a><\/em>. So each character in a character set is unique and the order doesn&#8217;t really matter (with a few minor exceptions). <\/p>\n<p>Here&#8217;s an example of a character set as used in a regular expression:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.findall('[abcde]', 'hello world!')\n['e', 'd']<\/pre>\n<p>You use the <a rel=\"noreferrer noopener\" aria-label=\"re.findall(pattern, string) method (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-re-findall\/\" target=\"_blank\">re.findall(pattern, string) method<\/a> to match the pattern <code>'[abcde]'<\/code> in the string <code>'hello world!'<\/code>. You can think of all characters a, b, c, d, and e as being in an OR relation: either of them would be a valid match.<\/p>\n<p>The regex engine goes from the left to the right, scanning over the string &#8216;hello world!&#8217; and simultaneously trying to match the (character set) pattern. Two characters from the text &#8216;hello world!&#8217; are in the character set&#8212;they are valid matches and returned by the re.findall() method.<\/p>\n<p>You can simplify many character sets by using the range symbol &#8216;-&#8216; that has a special meaning within square brackets: [a-z] reads &#8220;match any character from a to z&#8221;, while [0-9] reads &#8220;match any character from 0 to 9&#8221;. <\/p>\n<p>Here&#8217;s the previous example, simplified:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.findall('[a-e]', 'hello world!')\n['e', 'd']<\/pre>\n<p>You can even combine multiple character ranges in a single character set:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.findall('[a-eA-E0-4]', 'hello WORLD 42!')\n['e', 'D', '4', '2']<\/pre>\n<p>Here, you match three ranges: lowercase characters from a to e, uppercase characters from A to E, and numbers from 0 to 4. Note that the ranges are inclusive so both start and stop symbols are included in the range.<\/p>\n<h2>Python Regex Negative Character Set<\/h2>\n<p>But what if you want to match all characters&#8212;except some? You can achieve this with a negative character set!<\/p>\n<p>The negative character set works just like a character set, but with one difference: it matches all characters that are <strong><em>not<\/em><\/strong> in the character set.<\/p>\n<p>Here&#8217;s an example where you match all sequences of characters that do not contain<em> <\/em>characters a, b, c, d, or e:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import re\n>>> re.findall('[^a-e]+', 'hello world')\n['h', 'llo worl']<\/pre>\n<p>We use the &#8220;at-least-once quantifier +&#8221; in the example that matches at least one occurrence of the preceding regex (if you&#8217;re unsure about how it works, check out my detailed <a rel=\"noreferrer noopener\" aria-label=\"Finxter tutorial about the plus operator (opens in a new tab)\" href=\"https:\/\/blog.finxter.com\/python-re-plus\/\" target=\"_blank\">Finxter tutorial about the plus operator<\/a>). <\/p>\n<p>There are only two such sequences: the one-character sequence &#8216;h&#8217; and the eight-character sequence &#8216;llo worl&#8217;. You can see that even the empty space matches the negative character set. <\/p>\n<p><strong>Summary: the negative character set matches all characters that are not enclosed in the brackets.<\/strong><\/p>\n<h2>How to Fix <em>&#8220;re.error: unterminated character set at position&#8221;<\/em>?<\/h2>\n<p>Now that you know character classes, you can probably fix this error easily: it occurs if you use the opening (or closing) bracket &#8216;[&#8216; in your pattern. Maybe you want to match the character &#8216;[&#8216; in your string?<\/p>\n<p>But Python assumes that you&#8217;ve just opened a character class&#8212;and you forgot to close it. <\/p>\n<p>Here&#8217;s an example:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.findall('[', 'hello [world]')\nTraceback (most recent call last): File \"&lt;pyshell#5>\", line 1, in &lt;module> re.findall('[', 'hello [world]') File \"C:\\Users\\xcent\\AppData\\Local\\Programs\\Python\\Python37\\lib\\re.py\", line 223, in findall return _compile(pattern, flags).findall(string) File \"C:\\Users\\xcent\\AppData\\Local\\Programs\\Python\\Python37\\lib\\re.py\", line 286, in _compile p = sre_compile.compile(pattern, flags) File \"C:\\Users\\xcent\\AppData\\Local\\Programs\\Python\\Python37\\lib\\sre_compile.py\", line 764, in compile p = sre_parse.parse(p, flags) File \"C:\\Users\\xcent\\AppData\\Local\\Programs\\Python\\Python37\\lib\\sre_parse.py\", line 930, in parse p = _parse_sub(source, pattern, flags &amp; SRE_FLAG_VERBOSE, 0) File \"C:\\Users\\xcent\\AppData\\Local\\Programs\\Python\\Python37\\lib\\sre_parse.py\", line 426, in _parse_sub not nested and not items)) File \"C:\\Users\\xcent\\AppData\\Local\\Programs\\Python\\Python37\\lib\\sre_parse.py\", line 532, in _parse source.tell() - here)\nre.error: unterminated character set at position 0<\/pre>\n<p>The error happens because you used the bracket character &#8216;[&#8216; as if it was a normal symbol. <\/p>\n<p>So, how to fix it? Just escape the special bracket character &#8216;\\[&#8216; with the single backslash:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> re.findall('\\[', 'hello [world]')\n['[']<\/pre>\n<p>This removes the &#8220;special&#8221; meaning of the bracket symbol.<\/p>\n<h2>Related Re Methods<\/h2>\n<p>There are seven important regular expression methods which you must master:<\/p>\n<ul>\n<li>The <strong>re.findall(pattern, string)<\/strong> method returns a list of string matches. Read more in <a href=\"https:\/\/blog.finxter.com\/python-re-findall\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.search(pattern, string)<\/strong> method returns a match object of the first match. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-search\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.match(pattern, string)<\/strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-match\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.fullmatch(pattern, string)<\/strong> method returns a match object if the regex matches the whole string. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-fullmatch\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.compile(pattern)<\/strong> method prepares the regular expression pattern\u2014and returns a regex object which you can use multiple times in your code. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-compile\/\">our blog tutorial<\/a>.<\/li>\n<li>The<strong> re.split(pattern, string)<\/strong> method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-split\/\">our blog tutorial<\/a>.<\/li>\n<li>The <strong>re.sub(The re.sub(pattern, repl, string, count=0, flags=0)<\/strong> method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in <a href=\"https:\/\/blog.finxter.com\/python-regex-sub\/\">our blog tutorial<\/a>.<\/li>\n<\/ul>\n<p>These seven methods are 80% of what you need to know to get started with Python&#8217;s regular expression functionality. If you want to learn more, check out <a href=\"https:\/\/blog.finxter.com\/python-regex\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"the most comprehensive Python regex tutorial in the world (opens in a new tab)\">the most comprehensive Python regex tutorial in the world<\/a>!<\/p>\n<h2>Where to Go From Here?<\/h2>\n<p>You&#8217;ve learned everything you need to know about the <strong><em>Python Regex Character Set<\/em><\/strong> Operator. <\/p>\n<p><em><strong>Summary<\/strong>: <\/em><\/p>\n<p><strong>If you use a character set [XYZ] in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set: X, Y, or Z. <\/strong><\/p>\n<hr class=\"wp-block-separator\"\/>\n<p><strong>Want to earn money while you learn Python?<\/strong> Average Python programmers earn more than $50 per hour. You can certainly become average, can&#8217;t you?<\/p>\n<p>Join the free webinar that shows you how to become a thriving coding business owner online!<\/p>\n<p><a href=\"https:\/\/blog.finxter.com\/webinar-freelancer\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">[Webinar] Become a Six-Figure Freelance Developer with Python<\/a><\/p>\n<p>Join us. It&#8217;s fun! <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/12.0.0-1\/72x72\/1f642.png\" alt=\"\ud83d\ude42\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This tutorial makes you a master of character sets in Python. (I know, I know, it feels awesome to see your deepest desires finally come true.) As I wrote this article, I saw a lot of different terms describing this same powerful concept such as &#8220;character class&#8220;, &#8220;character range&#8220;, or &#8220;character group&#8220;. However, the most [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[857],"tags":[73,468,528],"class_list":["post-109244","post","type-post","status-publish","format-standard","hentry","category-python-tut","tag-programming","tag-python","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/109244","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=109244"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/109244\/revisions"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=109244"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=109244"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=109244"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}