{"id":130510,"date":"2022-12-13T14:21:52","date_gmt":"2022-12-13T14:21:52","guid":{"rendered":"https:\/\/blog.finxter.com\/?p=974815"},"modified":"2022-12-13T14:21:52","modified_gmt":"2022-12-13T14:21:52","slug":"python-split-text-into-sentences","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2022\/12\/13\/python-split-text-into-sentences\/","title":{"rendered":"Python | Split Text into Sentences"},"content":{"rendered":"\n<div class=\"kk-star-ratings kksr-auto kksr-align-left kksr-valign-top\" data-payload='{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;974815&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;top&quot;,&quot;ignore&quot;:&quot;&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;0&quot;,&quot;legendonly&quot;:&quot;&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;0&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;5&quot;,&quot;greet&quot;:&quot;Rate this post&quot;,&quot;legend&quot;:&quot;0\\\/5 - (0 votes)&quot;,&quot;size&quot;:&quot;24&quot;,&quot;width&quot;:&quot;0&quot;,&quot;_legend&quot;:&quot;{score}\\\/{best} - ({count} {votes})&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}'>\n<div class=\"kksr-stars\">\n<div class=\"kksr-stars-inactive\">\n<div class=\"kksr-star\" data-star=\"1\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" data-star=\"2\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" data-star=\"3\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" data-star=\"4\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" data-star=\"5\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<div class=\"kksr-stars-active\" style=\"width: 0px;\">\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"kksr-legend\" style=\"font-size: 19.2px;\"> <span class=\"kksr-muted\">Rate this post<\/span> <\/div>\n<\/div>\n<p class=\"has-background\" style=\"background-color:#e8caff\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/2728.png\" alt=\"\u2728\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/><strong>Summary: <\/strong>There are four different ways to split a text into sentences:<br \/><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f680.png\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> Using <code>nltk<\/code> module<br \/><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f680.png\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> Using <code>re.split()<br \/><\/code><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f680.png\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> Using <code>re.findall()<\/code> <br \/><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f680.png\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> Using <code>replace<\/code><\/p>\n<h3><strong>Minimal Example<\/strong><\/h3>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">text = \"God is Great! I won a lottery.\" # Method 1\nfrom nltk.tokenize import sent_tokenize\nprint(sent_tokenize(text)) # Method 2\nimport re\nres = [x for x in re.split(\"[\/\/.|\/\/!|\/\/?]\", text) if x!=\"\"]\nprint(res) # Method 3\nres = re.findall(r\"[^.!?]+\", text)\nprint(res) # Method 4\ndef splitter(txt, delim): for i in txt: if i in delim: txt = txt.replace(i, ',') res = txt.split(',') res.pop() return res sep = ['.', '!']\nprint(splitter(text, sep)) # Output: ['God is Great', ' I won a lottery']<\/pre>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2>Problem Formulation<\/h2>\n<p><strong>Problem: <\/strong>Given a string\/text containing numerous sentences; How will you split the string into sentences?<\/p>\n<p><strong>Example: <\/strong>Let\u2019s visualize the problem with the help of an example.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Input\ntext = \"This is sentence 1. This is sentence 2! This is sentence 3?\"\n# output\n['This is sentence 1', ' This is sentence 2', ' This is sentence 3']<\/pre>\n<h2>Method 1: <strong>Using <a href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">nltk.tokenize<\/a><\/strong><\/h2>\n<p>Natural Language Processing (NLP) has a process known as tokenization using which a large quantity of text can be divided into smaller parts called tokens. The Natural Language toolkit contains a very important module known as <strong><em>NLTK tokenize sentence<\/em><\/strong> which further comprises sub-modules. We can use this module and split a given text into sentences.<\/p>\n<p><strong>Code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from nltk.tokenize import sent_tokenize\ntext = \"This is sentence 1. This is sentence 2! This is sentence 3?\"\nprint(sent_tokenize(text)) # ['This is sentence 1.', ' This is sentence 2!', ' This is sentence 3?']<\/pre>\n<p><strong>Explanation:&nbsp;<\/strong><\/p>\n<ul>\n<li>Import the <code>sent_tokenize<\/code> module.<\/li>\n<li>Further, the <code>sentence_tokenizer<\/code> module allows you to parse the given sentences and break them into individual sentences at the occurrence of punctuations like periods, exclamation,\u00a0 question marks, etc.<\/li>\n<\/ul>\n<p><strong>Caution: <\/strong>You might get an error after installing the <code>nltk<\/code> package. So, here\u2019s the entire process to install <code>nltk<\/code> in your system.<\/p>\n<p><code>Install nltk using \u2192 pip install nltk<\/code><\/p>\n<p>Then go ahead and type the following in your Python shell:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import nltk\nnltk.download('punkt')<\/pre>\n<p>That\u2019s it! You are now ready to use the <code>sentence_tokenizer<\/code> module in your code.<\/p>\n<h2>Method 2: <strong>Using re.split<\/strong><\/h2>\n<p>The <code>re.split(pattern, string)<\/code> method matches all occurrences of the pattern in the string and divides the string along the matches resulting in a list of strings between the matches. For example, <code>re.split('a', 'bbabbbab')<\/code> results in the list of strings <code>['bb', 'bbb', 'b']<\/code>.<\/p>\n<p><strong>Approach: <\/strong>Split the given string using alphanumeric separators, and use the either-or <code>(|)<\/code> metacharacter. It allows you to specify each separator within the expression like so: <code>re.split(\"[\/\/.|\/\/!|\/\/?]\", text)<\/code>. Thus, whenever the script encounters any of the mentioned characters specified within the pattern, it will split the given string.\u00a0The expression <code>x!=\"\"<\/code> ignores all the empty characters.<\/p>\n<p><strong>Code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import re\ntext = \"This is sentence 1. This is sentence 2! This is sentence 3?\"\nres = [x for x in re.split(\"[\/\/.|\/\/!|\/\/?]\", text) if x!=\"\"]\nprint(res) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']<\/pre>\n<p class=\"has-base-background-color has-background\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f9e9.png\" alt=\"\ud83e\udde9\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/><strong>Recommended Read:\u00a0 <\/strong><a href=\"https:\/\/blog.finxter.com\/python-regex-split\/\"><strong>Python Regex Split<\/strong><\/a><\/p>\n<h2>Method 3: <strong>Using findall<\/strong><\/h2>\n<p>The <code>re.findall(pattern, string)<\/code> method scans the string from left to right, searching for all non-overlapping matches of the pattern. It returns a list of strings in the matching order when scanning the string from left to right.<\/p>\n<p><strong>Code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import re\ntext = \"This is sentence 1. This is sentence 2! This is sentence 3?\"\nres = re.findall(r\"[^.!?]+\", text)\nprint(res) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']<\/pre>\n<p><strong>Explanation: <\/strong>In the expression, i.e., <code>re.findall(r\"[^.!?]+\", text)<\/code>, all occurrences of characters are grouped except the punctuation marks. <code>[]+<\/code> denotes that all occurrences of one or more characters except (given by <code>^<\/code>) \u2018<code>!<\/code>\u2019, \u2018<code>?<\/code>\u2019, and \u2018<code>.<\/code>\u2019 will be returned. Thus, whenever the script finds and groups all characters until any of the mentioned characters within the square brackets are found. As soon as one of the mentioned characters is found it splits the string and finds the next group of characters.<\/p>\n<p class=\"has-base-background-color has-background\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f9e9.png\" alt=\"\ud83e\udde9\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/><strong>Related Read:<\/strong> <a href=\"https:\/\/blog.finxter.com\/python-re-findall\/\"><strong>Python re.findall() \u2013 Everything You Need to Know<\/strong><\/a><\/p>\n<h2>Method 4: <strong>Using replace<\/strong><\/h2>\n<p><strong>Approach: <\/strong>The idea here is to replace all the punctuation marks (<code>\u2018!\u2019, \u2018?\u2019,<\/code> and <code>\u2018.\u2019<\/code>) present in the given string with a comma (<code>,<\/code>) and then split the modified string to get the list of split substrings. The problem here is the last element returned will be an empty string. You can use the <code>pop()<\/code> method to remove the last element out of the list of substrings (the empty string).<\/p>\n<p><strong>Code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def splitter(txt, delim): for i in txt: if i in delim: txt = txt.replace(i, ',') res = txt.split(',') res.pop() return res sep = ['.', '!', '?']\ntext = \"This is sentence 1. This is sentence 2! This is sentence 3?\"\nprint(splitter(text, sep)) # ['This is sentence 1', ' This is sentence 2', ' This is sentence 3']<\/pre>\n<p class=\"has-base-background-color has-background\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f9e9.png\" alt=\"\ud83e\udde9\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/><strong>Related Read: <a href=\"https:\/\/blog.finxter.com\/python-string-replace-2\">Python String replace()<\/a><\/strong><\/p>\n<h2>Conclusion<\/h2>\n<p>We have successfully solved the given problem using different approaches. I hope this <strong><a rel=\"noreferrer noopener\" href=\"https:\/\/blog.finxter.com\/\" target=\"_blank\">article<\/a><\/strong> helped you in your Python coding journey. Please <a rel=\"noreferrer noopener\" href=\"https:\/\/blog.finxter.com\/subscribe\" target=\"_blank\"><strong>subscribe and stay tuned<\/strong><\/a> for more interesting articles. <\/p>\n<p>Happy coding! <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f40d.png\" alt=\"\ud83d\udc0d\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> <\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p><strong><em>Do you want to master the regex superpower?<\/em><\/strong> Check out my new book <em><strong><a href=\"https:\/\/blog.finxter.com\/ebook-the-smartest-way-to-learn-python-regex\/\" target=\"_blank\" rel=\"noreferrer noopener\" title=\"[eBook] The Smartest Way to Learn Python Regex\">The Smartest Way to Learn Regular Expressions in Python<\/a><\/strong><\/em> with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Rate this post Summary: There are four different ways to split a text into sentences: Using nltk module Using re.split() Using re.findall() Using replace Minimal Example text = &#8220;God is Great! I won a lottery.&#8221; # Method 1 from nltk.tokenize import sent_tokenize print(sent_tokenize(text)) # Method 2 import re res = [x for x in re.split(&#8220;[\/\/.|\/\/!|\/\/?]&#8221;, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[857],"tags":[73,468,528],"class_list":["post-130510","post","type-post","status-publish","format-standard","hentry","category-python-tut","tag-programming","tag-python","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/130510","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=130510"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/130510\/revisions"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=130510"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=130510"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=130510"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}