Sensible Speech Recognition with Python: The Fundamentals


Have you ever wished to check out a speech recognition challenge however discovered all of it simply too intimidating?

What about creating one thing a couple of steps past and a bit extra advanced, like a full audio chatbot or a voice assistant? Placing collectively skeleton code for this sort of a challenge is definitely fairly easy, thanks to a couple open supply libraries which we are able to lean on. With that in thoughts, let’s take a look at find out how to begin making a fundamental toy speech recognition app with Python. As soon as we get the fundamentals down we are able to talk about methods to make it far more helpful .

Our toy Python app will probably be fairly ineffective, to be trustworthy. However it should introduce us to a couple ideas which will probably be helpful for constructing extra advanced issues afterwards. If we construct this toy correctly, modifying it to do something extra must be comparatively painless. No less than, to an extent.

Here is precisely what our app will do after we’re finished: it should hearken to what we are saying and parrot it again to us. That is it! The pair of helpful issues we are going to take away are constructing speech recognition and audio playback into our app.

First, let’s import the few libraries that we want:

import os

import speech_recognition as sr

from pydub import AudioSegment

from pydub.playback import play

from gtts import gTTS as tts

Here is the reasoning:

  • speech_recognition – “Library for performing speech recognition, with support for several engines and APIs, online and offline”
  • pydub – “Manipulate audio with a simple and easy high level interface”
  • gTTS – “Python library and CLI tool to interface with Google Translate’s text-to-speech API”

The following factor to do — and certain most significantly for a speech recognition app — is to acknowledge speech. To take action, we’ll must first seize incoming audio from the microphone, after which carry out the speech recognition. That is all dealt with by way of the speech_recognition library.

Here is a perform to seize speech.

def seize():

    “””Seize audio”””

    rec = sr.Recognizer()

    with sr.Microphone() as supply:

        print(‘IM LISTENING…’)

        audio = attention(supply, phrase_time_limit=5)


        textual content = rec.recognize_google(audio, language=’en-US’)

        return textual content


        converse(‘Sorry, I couldn’t perceive what you mentioned.’)

        return zero

That is it. Speech captured and acknowledged. Nonetheless assume that is intimidating?

Be aware that after this app begins working, it should pay attention in 5 seconds intervals, and course of these 5 second intervals one after the other. Sensible? No, probably not, however as soon as we do one thing extra advanced we are able to tweak this to, maybe, pay attention for an activation key phrase, after which pay attention for the total length of our talking, no matter size. Nonetheless, this can be a easy sufficient technique to begin.

So, what’s going to we do after we seize speech? We’ll course of it. What precisely does this imply?

What sort of app you’re constructing will largely decide what “process it” means. This time round, our processing will roughly be a placeholder perform to do different issues sooner or later. So for now, our toy app will course of captured speech by parroting it again to us (and outputting it to the console, for good measure).

Here is a easy perform for our processing.

def process_text(identify, enter):

    “””Course of what is claimed”””

    converse(identify + ‘, you mentioned: “‘ + enter + ‘”.’)


We additionally need our app to talk, so let’s write a perform which makes use of the Google text-to-speech engine to perform this.

def converse(textual content):

    “””Say one thing”””

    # Write output to console

    print(textual content)

    # Save audio file

    speech = tts(textual content=textual content, lang=’en’)

    speech_file = ‘enter.mp3’

    # Play audio file

    sound = AudioSegment.from_mp3(speech_file)


    os.take away(speech_file)

First we print out what was handed to the perform to the console; then Google text-to-speech is used to create an audio file from the textual content; the audio file is saved to disk; after which the file is re-opened and performed utilizing the pydub library.

That is the “difficult” stuff taken care of. Now we simply want a couple of strains to drive the method.

if __name__ == “__main__”:

    # First get identify

    converse(‘What’s your identify?’)

    identify = seize()

    converse(‘Hiya, ‘ + identify + ‘.’)

    # Then simply hold listening & responding

    whereas 1:

        converse(‘What do you need to say?’)

        captured_text = seize().decrease()

        if captured_text == zero:


        if ‘give up’ in str(captured_text):

            converse(‘OK, bye, ‘ + identify + ‘.’)


        # Course of captured textual content

        process_text(identify, captured_text)

Each the converse() and seize() capabilities are used to get the person’s identify when prompted, after which greet them. Then some time loop is entered which cycles between capturing speech enter and performing some very elementary checks to make sure that one thing was captured and the person didn’t say ‘give up’ to exit. The captured textual content is handed to the process_text() perform, which echoes what was mentioned. That is then repeated advert infinitum.

I am going to say it once more: there is not something of a lot complexity happening right here.

Save all the above code to a file,

Now let’s take a look at a dialog with our minimalist speech recognition app. Run it with the next line and see the outcomes beneath (whereas imagining I am speaking and having my phrases repeated again to me, after all).

  $ python

What’s your identify?


Hiya, Matthew.

What do you need to say?


Matthew, you mentioned: “where are you from”.

What do you need to say?


Matthew, you mentioned: “i’d like some pizza”.

What do you need to say?


Matthew, you mentioned: “what is the meaning of life”.

What do you need to say?


OK, bye, Matthew.

Course of completed with exit code zero

Fairly cool. In fact, it might be manner cooler if it really did one thing. So let’s flip our consideration to that subsequent.

For subsequent time, let’s ease into one thing extra advanced like integrating spaCy into our code and making an attempt some easy NLP duties, equivalent to spoken sentence classification, sentiment evaluation, and named entity recognition.

We are able to then have a look at one thing extra virtually helpful equivalent to making a private voice assistant, which would require some further tweaks to our interface. However one factor at a time…

Leave A Reply