{"id":131353,"date":"2023-01-22T16:02:45","date_gmt":"2023-01-22T16:02:45","guid":{"rendered":"https:\/\/blog.finxter.com\/?p=1077176"},"modified":"2023-01-22T16:02:45","modified_gmt":"2023-01-22T16:02:45","slug":"python-video-to-text-speech-recognition","status":"publish","type":"post","link":"https:\/\/sickgaming.net\/blog\/2023\/01\/22\/python-video-to-text-speech-recognition\/","title":{"rendered":"Python Video to Text \u2013 Speech Recognition"},"content":{"rendered":"\n<div class=\"kk-star-ratings kksr-auto kksr-align-left kksr-valign-top\" data-payload='{&quot;align&quot;:&quot;left&quot;,&quot;id&quot;:&quot;1077176&quot;,&quot;slug&quot;:&quot;default&quot;,&quot;valign&quot;:&quot;top&quot;,&quot;ignore&quot;:&quot;&quot;,&quot;reference&quot;:&quot;auto&quot;,&quot;class&quot;:&quot;&quot;,&quot;count&quot;:&quot;1&quot;,&quot;legendonly&quot;:&quot;&quot;,&quot;readonly&quot;:&quot;&quot;,&quot;score&quot;:&quot;5&quot;,&quot;starsonly&quot;:&quot;&quot;,&quot;best&quot;:&quot;5&quot;,&quot;gap&quot;:&quot;5&quot;,&quot;greet&quot;:&quot;Rate this post&quot;,&quot;legend&quot;:&quot;5\\\/5 - (1 vote)&quot;,&quot;size&quot;:&quot;24&quot;,&quot;width&quot;:&quot;142.5&quot;,&quot;_legend&quot;:&quot;{score}\\\/{best} - ({count} {votes})&quot;,&quot;font_factor&quot;:&quot;1.25&quot;}'>\n<div class=\"kksr-stars\">\n<div class=\"kksr-stars-inactive\">\n<div class=\"kksr-star\" data-star=\"1\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" data-star=\"2\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" data-star=\"3\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" data-star=\"4\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" data-star=\"5\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<div class=\"kksr-stars-active\" style=\"width: 142.5px;\">\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<div class=\"kksr-star\" style=\"padding-right: 5px\">\n<div class=\"kksr-icon\" style=\"width: 24px; height: 24px;\"><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/div>\n<div class=\"kksr-legend\" style=\"font-size: 19.2px;\"> 5\/5 &#8211; (1 vote) <\/div>\n<\/p><\/div>\n<p>A good friend and his wife recently founded an AI startup in the lifestyle niche that uses <a rel=\"noreferrer noopener\" href=\"https:\/\/blog.finxter.com\/machine-learning-engineer-income-and-opportunity\/\" data-type=\"post\" data-id=\"306050\" target=\"_blank\">machine learning<\/a> to discover specific real-world patterns from videos.<\/p>\n<p>For their business system, they need a pipeline that takes a video file, converts it to audio, and transcribes the audio to standard text that is then used for further processing. I couldn\u2019t help but work on a basic solution to help fix their business problem. <\/p>\n<h2>Project Overview<\/h2>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"568\" src=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-276-1024x568.png\" alt=\"\" class=\"wp-image-1077229\" srcset=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-276-1024x568.png 1024w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-276-300x166.png 300w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-276-768x426.png 768w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-276.png 1453w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<p>I finished the project in three steps:<\/p>\n<ul>\n<li>First, install the necessary libraries.<\/li>\n<li>Second, <strong>convert the video to an audio file<\/strong> (<code>.mp4<\/code> to <code>.wav<\/code>)<\/li>\n<li>Third, <strong>convert the audio file to a speech file<\/strong> (<code>.wav<\/code> to <code>.txt<\/code>). We first break the large audio file into smaller chunks and convert each of them separately due to the size restrictions of the used API.<\/li>\n<\/ul>\n<p>Let&#8217;s get started!<\/p>\n<h2>Step 1: Install Libraries<\/h2>\n<p>We need the following <code>import<\/code> statements in our code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Import libraries\nimport speech_recognition as sr\nimport os\nfrom pydub import AudioSegment\nfrom pydub.silence import split_on_silence\nimport moviepy.editor as mp<\/pre>\n<p>Consequently, you need to <code>pip install<\/code> the following three libraries in your shell &#8212; assuming you run <a href=\"https:\/\/blog.finxter.com\/how-to-check-your-python-version\/\" data-type=\"post\" data-id=\"1371\" target=\"_blank\" rel=\"noreferrer noopener\">Python version<\/a> 3.9:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">pip3.9 install pydub\npip3.9 install SpeechRecognition\npip3.9 install moviepy<\/pre>\n<p>The <code><a href=\"https:\/\/blog.finxter.com\/exploring-pythons-os-module\/\" data-type=\"post\" data-id=\"19050\" target=\"_blank\" rel=\"noreferrer noopener\">os<\/a><\/code> module is already preinstalled as a Python Standard Library.<\/p>\n<p>If you need an additional guide on how to install Python libraries, check out this tutorial:<\/p>\n<p class=\"has-base-background-color has-background\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f449.png\" alt=\"\ud83d\udc49\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> <strong>Recommended<\/strong>: <a href=\"https:\/\/blog.finxter.com\/how-to-install-xxx-in-python\/\" data-type=\"post\" data-id=\"653128\" target=\"_blank\" rel=\"noreferrer noopener\">Python Install Library Guide<\/a><\/p>\n<h2>Step 2: Video to Audio<\/h2>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"690\" height=\"460\" src=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-252.png\" alt=\"\" class=\"wp-image-1075726\" srcset=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-252.png 690w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-252-300x200.png 300w\" sizes=\"auto, (max-width: 690px) 100vw, 690px\" \/><\/figure>\n<\/div>\n<p>Before you can do speech recognition on the video, we need to extract the audio as a <code>.wav<\/code> file using the <code>moviepy.editor.VideoFileClip().audio.write_audiofile()<\/code> method.<\/p>\n<p>Here&#8217;s the code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def video_to_audio(in_path, out_path): \"\"\"Convert video file to audio file\"\"\" video = mp.VideoFileClip(in_path) video.audio.write_audiofile(out_path)<\/pre>\n<p class=\"has-base-background-color has-background\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f449.png\" alt=\"\ud83d\udc49\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> <strong>Recommended<\/strong>: <a href=\"https:\/\/blog.finxter.com\/python-video-to-audio\/\" data-type=\"post\" data-id=\"1077175\" target=\"_blank\" rel=\"noreferrer noopener\">Python Video to Audio<\/a><\/p>\n<h2>Step 3: Audio to Text<\/h2>\n<p>After extracting the audio file, we can start transcribing the speech from the <code>.wav<\/code> file using Google&#8217;s powerful speech recognition library on chunks of the potentially large audio file. <\/p>\n<p>Using chunks instead of passing the whole audio file avoids an error for large audio files &#8212; Google has some restrictions on the audio file size. <\/p>\n<p>However, you can play around with the splitting thresholds of 700ms silence&#8212;it can be more or less, depending on your concrete file.<\/p>\n<p>Here&#8217;s the audio to text code function that worked for me:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def large_audio_to_text(path): \"\"\"Split audio into chunks and apply speech recognition\"\"\" # Open audio file with pydub sound = AudioSegment.from_wav(path) # Split audio where silence is 700ms or greater and get chunks chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700) # Create folder to store audio chunks folder_name = \"audio-chunks\" if not os.path.isdir(folder_name): os.mkdir(folder_name) whole_text = \"\" # Process each chunk for i, audio_chunk in enumerate(chunks, start=1): # Export chunk and save in folder chunk_filename = os.path.join(folder_name, f\"chunk{i}.wav\") audio_chunk.export(chunk_filename, format=\"wav\") # Recognize chunk with sr.AudioFile(chunk_filename) as source: audio_listened = r.record(source) # Convert to text try: text = r.recognize_google(audio_listened) except sr.UnknownValueError as e: print(\"Error:\", str(e)) else: text = f\"{text.capitalize()}. \" print(chunk_filename, \":\", text) whole_text += text # Return text for all chunks return whole_text<\/pre>\n<p>Need more info? Check out the following deep dive:<\/p>\n<p class=\"has-base-background-color has-background\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/1f449.png\" alt=\"\ud83d\udc49\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\" \/> <strong>Recommended<\/strong>: <a rel=\"noreferrer noopener\" href=\"https:\/\/blog.finxter.com\/large-audio-to-text-heres-my-speech-recognition-solution-in-python\/\" data-type=\"post\" data-id=\"1075593\" target=\"_blank\">Large Audio to Text? Here\u2019s My Speech Recognition Solution in Python<\/a><\/p>\n<h2>Step 4: Putting It Together<\/h2>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"683\" src=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-255-1024x683.png\" alt=\"\" class=\"wp-image-1075808\" srcset=\"https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-255-1024x683.png 1024w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-255-300x200.png 300w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-255-768x512.png 768w, https:\/\/blog.finxter.com\/wp-content\/uploads\/2023\/01\/image-255.png 1282w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n<p>Finally, we can combine our functions. First, we extract the audio from the video. Second, we chunk the audio into smaller files and recognize speech independently on each chunk using Google&#8217;s speech recognition module.<\/p>\n<p>I added comments to annotate the most important parts of this code:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Import libraries\nimport speech_recognition as sr\nimport os\nfrom pydub import AudioSegment\nfrom pydub.silence import split_on_silence\nimport moviepy.editor as mp def video_to_audio(in_path, out_path): \"\"\"Convert video file to audio file\"\"\" video = mp.VideoFileClip(in_path) video.audio.write_audiofile(out_path) def large_audio_to_text(path): \"\"\"Split audio into chunks and apply speech recognition\"\"\" # Open audio file with pydub sound = AudioSegment.from_wav(path) # Split audio where silence is 700ms or greater and get chunks chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700) # Create folder to store audio chunks folder_name = \"audio-chunks\" if not os.path.isdir(folder_name): os.mkdir(folder_name) whole_text = \"\" # Process each chunk for i, audio_chunk in enumerate(chunks, start=1): # Export chunk and save in folder chunk_filename = os.path.join(folder_name, f\"chunk{i}.wav\") audio_chunk.export(chunk_filename, format=\"wav\") # Recognize chunk with sr.AudioFile(chunk_filename) as source: audio_listened = r.record(source) # Convert to text try: text = r.recognize_google(audio_listened) except sr.UnknownValueError as e: print(\"Error:\", str(e)) else: text = f\"{text.capitalize()}. \" print(chunk_filename, \":\", text) whole_text += text # Return text for all chunks return whole_text # Create a speech recognition object\nr = sr.Recognizer() # Video to audio to text\nvideo_to_audio('sample_video.mp4', 'sample_audio.wav')\nresult = large_audio_to_text('sample_audio.wav') # Print to shell and file\nprint(result)\nprint(result, file=open('result.txt', 'w'))\n<\/pre>\n<p>Store this code in a folder next to your video file <code>'sample_video.mp4'<\/code> and run it. It will create an audio file <code>'sample_audio.wav'<\/code> and chunk the audio and print the result to the shell, as well as to a file called <code>'result.txt'<\/code>. This contains the transcription of the video file.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>5\/5 &#8211; (1 vote) A good friend and his wife recently founded an AI startup in the lifestyle niche that uses machine learning to discover specific real-world patterns from videos. For their business system, they need a pipeline that takes a video file, converts it to audio, and transcribes the audio to standard text that [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[857],"tags":[73,468,528],"class_list":["post-131353","post","type-post","status-publish","format-standard","hentry","category-python-tut","tag-programming","tag-python","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/131353","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/comments?post=131353"}],"version-history":[{"count":0,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/posts\/131353\/revisions"}],"wp:attachment":[{"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/media?parent=131353"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/categories?post=131353"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sickgaming.net\/blog\/wp-json\/wp\/v2\/tags?post=131353"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}