Sick Gaming
[Tut] Python Character Set [Regex Tutorial] - Printable Version

+- Sick Gaming (https://www.sickgaming.net)
+-- Forum: Programming (https://www.sickgaming.net/forum-76.html)
+--- Forum: Python (https://www.sickgaming.net/forum-83.html)
+--- Thread: [Tut] Python Character Set [Regex Tutorial] (/thread-93660.html)



[Tut] Python Character Set [Regex Tutorial] - xSicKxBot - 02-18-2020

Python Character Set [Regex Tutorial]

<div><p>This tutorial makes you a master of <strong>character sets</strong> in Python. (I know, I know, it feels awesome to see your deepest desires finally come true.)</p>
<figure class="wp-block-embed-youtube wp-block-embed is-type-video is-provider-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio">
<div class="wp-block-embed__wrapper">
<div class="ast-oembed-container"><iframe title="Python Character Set [Regex Tutorial]" width="1100" height="825" src="https://www.youtube.com/embed/lrI5wmZo-mY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div>
</div>
</figure>
<p>As I wrote this article, I saw a lot of different terms describing this same powerful concept such as “<strong>character class</strong>“, “<strong>character range</strong>“, or “<strong>character group</strong>“. However, the most precise term is “<strong>character set</strong>” as introduced in the <a href="https://docs.python.org/3/library/re.html">official </a>Python regex docs. So in this tutorial, I’ll use this term throughout.</p>
<h2>Python Regex – Character Set</h2>
<p><strong>So, what is a character set in regular expressions?</strong></p>
<p>The character set is (surprise) a set of characters: if you use a character set in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set. As you may know, a <em><a href="https://blog.finxter.com/sets-in-python/">set is an unordered collection of unique elements</a></em>. So each character in a character set is unique and the order doesn’t really matter (with a few minor exceptions). </p>
<p>Here’s an example of a character set as used in a regular expression:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('[abcde]', 'hello world!')
['e', 'd']</pre>
<p>You use the <a rel="noreferrer noopener" aria-label="re.findall(pattern, string) method (opens in a new tab)" href="https://blog.finxter.com/python-re-findall/" target="_blank">re.findall(pattern, string) method</a> to match the pattern <code>'[abcde]'</code> in the string <code>'hello world!'</code>. You can think of all characters a, b, c, d, and e as being in an OR relation: either of them would be a valid match.</p>
<p>The regex engine goes from the left to the right, scanning over the string ‘hello world!’ and simultaneously trying to match the (character set) pattern. Two characters from the text ‘hello world!’ are in the character set—they are valid matches and returned by the re.findall() method.</p>
<p>You can simplify many character sets by using the range symbol ‘-‘ that has a special meaning within square brackets: [a-z] reads “match any character from a to z”, while [0-9] reads “match any character from 0 to 9”. </p>
<p>Here’s the previous example, simplified:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('[a-e]', 'hello world!')
['e', 'd']</pre>
<p>You can even combine multiple character ranges in a single character set:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('[a-eA-E0-4]', 'hello WORLD 42!')
['e', 'D', '4', '2']</pre>
<p>Here, you match three ranges: lowercase characters from a to e, uppercase characters from A to E, and numbers from 0 to 4. Note that the ranges are inclusive so both start and stop symbols are included in the range.</p>
<h2>Python Regex Negative Character Set</h2>
<p>But what if you want to match all characters—except some? You can achieve this with a negative character set!</p>
<p>The negative character set works just like a character set, but with one difference: it matches all characters that are <strong><em>not</em></strong> in the character set.</p>
<p>Here’s an example where you match all sequences of characters that do not contain<em> </em>characters a, b, c, d, or e:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> import re
>>> re.findall('[^a-e]+', 'hello world')
['h', 'llo worl']</pre>
<p>We use the “at-least-once quantifier +” in the example that matches at least one occurrence of the preceding regex (if you’re unsure about how it works, check out my detailed <a rel="noreferrer noopener" aria-label="Finxter tutorial about the plus operator (opens in a new tab)" href="https://blog.finxter.com/python-re-plus/" target="_blank">Finxter tutorial about the plus operator</a>). </p>
<p>There are only two such sequences: the one-character sequence ‘h’ and the eight-character sequence ‘llo worl’. You can see that even the empty space matches the negative character set. </p>
<p><strong>Summary: the negative character set matches all characters that are not enclosed in the brackets.</strong></p>
<h2>How to Fix <em>“re.error: unterminated character set at position”</em>?</h2>
<p>Now that you know character classes, you can probably fix this error easily: it occurs if you use the opening (or closing) bracket ‘[‘ in your pattern. Maybe you want to match the character ‘[‘ in your string?</p>
<p>But Python assumes that you’ve just opened a character class—and you forgot to close it. </p>
<p>Here’s an example:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('[', 'hello [world]')
Traceback (most recent call last): File "&lt;pyshell#5>", line 1, in &lt;module> re.findall('[', 'hello [world]') File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 223, in findall return _compile(pattern, flags).findall(string) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 930, in parse p = _parse_sub(source, pattern, flags &amp; SRE_FLAG_VERBOSE, 0) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 426, in _parse_sub not nested and not items)) File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 532, in _parse source.tell() - here)
re.error: unterminated character set at position 0</pre>
<p>The error happens because you used the bracket character ‘[‘ as if it was a normal symbol. </p>
<p>So, how to fix it? Just escape the special bracket character ‘\[‘ with the single backslash:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">>>> re.findall('\[', 'hello [world]')
['[']</pre>
<p>This removes the “special” meaning of the bracket symbol.</p>
<h2>Related Re Methods</h2>
<p>There are seven important regular expression methods which you must master:</p>
<ul>
<li>The <strong>re.findall(pattern, string)</strong> method returns a list of string matches. Read more in <a href="https://blog.finxter.com/python-re-findall/">our blog tutorial</a>.</li>
<li>The <strong>re.search(pattern, string)</strong> method returns a match object of the first match. Read more in <a href="https://blog.finxter.com/python-regex-search/">our blog tutorial</a>.</li>
<li>The <strong>re.match(pattern, string)</strong> method returns a match object if the regex matches at the beginning of the string. Read more in <a href="https://blog.finxter.com/python-regex-match/">our blog tutorial</a>.</li>
<li>The <strong>re.fullmatch(pattern, string)</strong> method returns a match object if the regex matches the whole string. Read more in <a href="https://blog.finxter.com/python-regex-fullmatch/">our blog tutorial</a>.</li>
<li>The <strong>re.compile(pattern)</strong> method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in <a href="https://blog.finxter.com/python-regex-compile/">our blog tutorial</a>.</li>
<li>The<strong> re.split(pattern, string)</strong> method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in <a href="https://blog.finxter.com/python-regex-split/">our blog tutorial</a>.</li>
<li>The <strong>re.sub(The re.sub(pattern, repl, string, count=0, flags=0)</strong> method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in <a href="https://blog.finxter.com/python-regex-sub/">our blog tutorial</a>.</li>
</ul>
<p>These seven methods are 80% of what you need to know to get started with Python’s regular expression functionality. If you want to learn more, check out <a href="https://blog.finxter.com/python-regex/" target="_blank" rel="noreferrer noopener" aria-label="the most comprehensive Python regex tutorial in the world (opens in a new tab)">the most comprehensive Python regex tutorial in the world</a>!</p>
<h2>Where to Go From Here?</h2>
<p>You’ve learned everything you need to know about the <strong><em>Python Regex Character Set</em></strong> Operator. </p>
<p><em><strong>Summary</strong>: </em></p>
<p><strong>If you use a character set [XYZ] in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set: X, Y, or Z. </strong></p>
<hr class="wp-block-separator"/>
<p><strong>Want to earn money while you learn Python?</strong> Average Python programmers earn more than $50 per hour. You can certainly become average, can’t you?</p>
<p>Join the free webinar that shows you how to become a thriving coding business owner online!</p>
<p><a href="https://blog.finxter.com/webinar-freelancer/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">[Webinar] Become a Six-Figure Freelance Developer with Python</a></p>
<p>Join us. It’s fun! <img src="https://s.w.org/images/core/emoji/12.0.0-1/72x72/1f642.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
</div>


https://www.sickgaming.net/blog/2020/02/15/python-character-set-regex-tutorial/