If you’re a blogger who’s not a writer then you know that writing a blog post can be very time consuming.

The most time consuming part for me is writing the initial draft.

Imagine you could write a post by just talking about it. How cool would that be? I would write so many more blogs.


Luckily for me, the latest buzzword - “AI” could assist me in persuing my lazy blogger’s dream. I have to admit, I’ve never really read up too much about “AI” until now. My colleague suggested using Wit and Snowboy for my project. He used both for one of his side projects.

So I decided to give it a go. I chose to write a simple Python script - a speech-to-text converter with a hotword detector.

How it works


The converter waits for the hotword before it starts recording. It then records until the hotword is detected again. After that, it does the conversion from speech to text. In short, hotword-speech-hotword, where only the speech gets converted.

The output text is written to an output.txt file. It appends the recordings on new lines. Note that the converter recognizes the words ‘comma’ and ‘full stop’ and converts them to ‘,’ and ‘.’.

Implementation


Speech to text API

The first step is to register to Wit so that you can use their API. You need a unique Wit access token.

headers = {
    'Authorization': 'Bearer ' + WIT_ACCESS_TOKEN,
    'Content-Type': 'audio/wav',
}

response = requests.post(URL, data=audio, headers=headers)

data = json.loads(response.content)
text = data.get('_text')

Hotword detection

Create a hotword .pmdl file using Snowboy. Their instructions are pretty straight forward.

To start the hotword detection, use the following code

detector = snowboydecoder.HotwordDetector('hotword.pmdl')
detector.start()
detector.terminate()

Record speech before hot word detection

Detecting the hotword is easy, but recording and saving the recording before the hotword is detected requires modifying the Snowboy toolkit slightly.

Snowboy uses a ring buffer to save the recorded audio and to detect the hotword. It clears the ring buffer after a hotword check is completed. This repeats. For this reason we can’t use the ring buffer to hold the whole speech. We need to add a second ring buffer which never clears so that we can retrieve the complete recording ready for conversion to text.

Find the modifications here.

Prerequisite

A prerequisite is Pysox. We require it to transform the raw data obtained from the ring buffer to a .wav file. The Wit API accepts .wav files. You can find the conversion code here.

self.tfm = sox.Transformer()
self.tfm.set_input_format(rate=16000, bits=16, channels=1, encoding='signed-integer')

self.tfm.build('audio.raw', 'output.wav')

Using the speech converter

The output text is written to output.txt. The text gets apended at the end the file unless you specify a flag -c to indicate that you would like to clear the output file before recording.

This is just a sample implementation. Run the following commands after cloning the repository to try it!

After activating the virtual environment, run the converter Python script.

python speech_to_text_converter.py

The converter accepts input flag -c to clean the output.txt before the next recording.

python speech_to_text_converter.py -c  

Example output

Listening... Press Ctrl+C to exit
INFO:snowboy:Keyword 1 detected at time: 2018-01-14 15:33:37
INFO:snowboy:Keyword 1 detected at time: 2018-01-14 15:34:29
Converting...
Speech converted

This post was written using the speech to text converter.


I will admit that I can’t see myself using this project to write all my future posts. Its accuracy isn’t very high or perhaps my accent is throwing the API off. Nonetheless, I had to rewrite a lot of the converted text.

It was an interesting introduction to “AI” though. I thoroughly enjoyed the project.