Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Python | Split Text into Sentences

#1
Python | Split Text into Sentences

<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload='{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;974815&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;top&quot;,&quot;ignore&quot;:&quot;&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;0&quot;,&quot;legendonly&quot;:&quot;&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;0&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;5&quot;,&quot;greet&quot;:&quot;Rate this post&quot;,&quot;legend&quot;:&quot;0\/5 - (0 votes)&quot;,&quot;size&quot;:&quot;24&quot;,&quot;width&quot;:&quot;0&quot;,&quot;_legend&quot;:&quot;{score}\/{best} - ({count} {votes})&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}'>
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 0px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> <span class="kksr-muted">Rate this post</span> </div>
</div>
<p class="has-background" style="background-color:#e8caff"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em; max-height: 1em;" /><strong>Summary: </strong>There are four different ways to split a text into sentences:<br /><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f680.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Using <code>nltk</code> module<br /><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f680.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Using <code>re.split()<br /></code><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f680.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Using <code>re.findall()</code> <br /><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f680.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Using <code>replace</code></p>
<h3><strong>Minimal Example</strong></h3>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">text = "God is Great! I won a lottery." # Method 1
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text)) # Method 2
import re
res = [x for x in re.split("[//.|//!|//?]", text) if x!=""]
print(res) # Method 3
res = re.findall(r"[^.!?]+", text)
print(res) # Method 4
def splitter(txt, delim): for i in txt: if i in delim: txt = txt.replace(i, ',') res = txt.split(',') res.pop() return res sep = ['.', '!']
print(splitter(text, sep)) # Output: ['God is Great', ' I won a lottery']</pre>
<hr class="wp-block-separator has-alpha-channel-opacity" />
<h2>Problem Formulation</h2>
<p><strong>Problem: </strong>Given a string/text containing numerous sentences; How will you split the string into sentences?</p>
<p><strong>Example: </strong>Let’s visualize the problem with the help of an example.</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Input
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
# output
['This is sentence 1', ' This is sentence 2', ' This is sentence 3']</pre>
<h2>Method 1: <strong>Using <a href="https://www.nltk.org/" target="_blank" rel="noreferrer noopener">nltk.tokenize</a></strong></h2>
<p>Natural Language Processing (NLP) has a process known as tokenization using which a large quantity of text can be divided into smaller parts called tokens. The Natural Language toolkit contains a very important module known as <strong><em>NLTK tokenize sentence</em></strong> which further comprises sub-modules. We can use this module and split a given text into sentences.</p>
<p><strong>Code:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">from nltk.tokenize import sent_tokenize
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
print(sent_tokenize(text)) # ['This is sentence 1.', ' This is sentence 2!', ' This is sentence 3?']</pre>
<p><strong>Explanation:&nbsp;</strong></p>
<ul>
<li>Import the <code>sent_tokenize</code> module.</li>
<li>Further, the <code>sentence_tokenizer</code> module allows you to parse the given sentences and break them into individual sentences at the occurrence of punctuations like periods, exclamation,  question marks, etc.</li>
</ul>
<p><strong>Caution: </strong>You might get an error after installing the <code>nltk</code> package. So, here’s the entire process to install <code>nltk</code> in your system.</p>
<p><code>Install nltk using → pip install nltk</code></p>
<p>Then go ahead and type the following in your Python shell:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import nltk
nltk.download('punkt')</pre>
<p>That’s it! You are now ready to use the <code>sentence_tokenizer</code> module in your code.</p>
<h2>Method 2: <strong>Using re.split</strong></h2>
<p>The <code>re.split(pattern, string)</code> method matches all occurrences of the pattern in the string and divides the string along the matches resulting in a list of strings between the matches. For example, <code>re.split('a', 'bbabbbab')</code> results in the list of strings <code>['bb', 'bbb', 'b']</code>.</p>
<p><strong>Approach: </strong>Split the given string using alphanumeric separators, and use the either-or <code>(|)</code> metacharacter. It allows you to specify each separator within the expression like so: <code>re.split("[//.|//!|//?]", text)</code>. Thus, whenever the script encounters any of the mentioned characters specified within the pattern, it will split the given string. The expression <code>x!=""</code> ignores all the empty characters.</p>
<p><strong>Code:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
res = [x for x in re.split("[//.|//!|//?]", text) if x!=""]
print(res) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']</pre>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f9e9.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /><strong>Recommended Read:  </strong><a href="https://blog.finxter.com/python-regex-split/"><strong>Python Regex Split</strong></a></p>
<h2>Method 3: <strong>Using findall</strong></h2>
<p>The <code>re.findall(pattern, string)</code> method scans the string from left to right, searching for all non-overlapping matches of the pattern. It returns a list of strings in the matching order when scanning the string from left to right.</p>
<p><strong>Code:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">import re
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
res = re.findall(r"[^.!?]+", text)
print(res) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']</pre>
<p><strong>Explanation: </strong>In the expression, i.e., <code>re.findall(r"[^.!?]+", text)</code>, all occurrences of characters are grouped except the punctuation marks. <code>[]+</code> denotes that all occurrences of one or more characters except (given by <code>^</code>) ‘<code>!</code>’, ‘<code>?</code>’, and ‘<code>.</code>’ will be returned. Thus, whenever the script finds and groups all characters until any of the mentioned characters within the square brackets are found. As soon as one of the mentioned characters is found it splits the string and finds the next group of characters.</p>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f9e9.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /><strong>Related Read:</strong> <a href="https://blog.finxter.com/python-re-findall/"><strong>Python re.findall() – Everything You Need to Know</strong></a></p>
<h2>Method 4: <strong>Using replace</strong></h2>
<p><strong>Approach: </strong>The idea here is to replace all the punctuation marks (<code>‘!’, ‘?’,</code> and <code>‘.’</code>) present in the given string with a comma (<code>,</code>) and then split the modified string to get the list of split substrings. The problem here is the last element returned will be an empty string. You can use the <code>pop()</code> method to remove the last element out of the list of substrings (the empty string).</p>
<p><strong>Code:</strong></p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def splitter(txt, delim): for i in txt: if i in delim: txt = txt.replace(i, ',') res = txt.split(',') res.pop() return res sep = ['.', '!', '?']
text = "This is sentence 1. This is sentence 2! This is sentence 3?"
print(splitter(text, sep)) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']</pre>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f9e9.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /><strong>Related Read: <a href="https://blog.finxter.com/python-string-replace-2">Python String replace()</a></strong></p>
<h2>Conclusion</h2>
<p>We have successfully solved the given problem using different approaches. I hope this <strong><a rel="noreferrer noopener" href="https://blog.finxter.com/" target="_blank">article</a></strong> helped you in your Python coding journey. Please <a rel="noreferrer noopener" href="https://blog.finxter.com/subscribe" target="_blank"><strong>subscribe and stay tuned</strong></a> for more interesting articles. </p>
<p>Happy coding! <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f40d.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> </p>
<hr class="wp-block-separator has-alpha-channel-opacity" />
<p><strong><em>Do you want to master the regex superpower?</em></strong> Check out my new book <em><strong><a href="https://blog.finxter.com/ebook-the-smartest-way-to-learn-python-regex/" target="_blank" rel="noreferrer noopener" title="[eBook] The Smartest Way to Learn Python Regex">The Smartest Way to Learn Regular Expressions in Python</a></strong></em> with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video. </p>
</div>


https://www.sickgaming.net/blog/2022/12/...sentences/
Reply



Forum Jump:


Users browsing this thread:
2 Guest(s)

Forum software by © MyBB Theme © iAndrew 2016