FWIW, IBM has a wonderful speech to text API...I've put together a repo of examples and Python code:<p><a href="https://github.com/dannguyen/watson-word-watcher" rel="nofollow">https://github.com/dannguyen/watson-word-watcher</a><p>One of the great things about it is its word-level time stamp and confidence data that it returns...here's a few super cuts I've made from the presidential primary debates:<p><a href="https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-LoO73FrSa6yn8gsPpi7J9TJb7&index=14" rel="nofollow">https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-Lo...</a><p>It's not perfect by any means, but the granular results give you a place to start from...here's a super cut of cuss words from a well known episode of The Wire...only 59 such words were heard by Watson even though one scene contains 30+ F-bombs alone:<p><a href="https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be" rel="nofollow">https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be</a><p>The service is free for the first 1000 minutes each month.
Kids, it's called "speech recognition". Voice recognition also exists, but it's the task of identifying a user based on his/her voice, not the task of transcribing spoken input as text.
It really would be amazing to be able to get voice recognition software that covers at least recognizing a small enough fraction of our language to be useful without having to reach the cloud. It is definitely a dream I hope we one day achieve, thanks for the article, will test it on my day off and play with it a bit.
Don't expect this to be anything like modern "good" speech recognition. Sphinx is definitely from the 00's when it seemed like speech recognition would never be solved.<p>Apparently Kaldi is a lot better, but good luck setting it up!
Another project along similar lines is the Jasper Project[0], which has received some HN coverage in the past several years[1]. It interfaces with many of the same speech recognition and text-to-speech libraries.<p>[0] <a href="https://jasperproject.github.io/" rel="nofollow">https://jasperproject.github.io/</a><p>[1] <a href="https://hn.algolia.com/?query=Jasper%20Project&sort=byPopularity&prefix&page=0&dateRange=all&type=story" rel="nofollow">https://hn.algolia.com/?query=Jasper%20Project&sort=byPopula...</a>
Very cool! I just started playing with speech recognition in Python for home automation this week. I'm controlling some WeMo switches and my PC with an Android Tablet using Autovoice, and it works well as a proof-of-concept, but Autovoice doesn't always register commands, and the "Okay, Google" speech to text can be slow sometimes. I'd like it to take less than 5 seconds between saying "TV Off" and the TV actually turning off., with Autovoice it's anywhere from 3s to 25s depending on the lag. I also figure with real code, I can get commands that are more flexible than Autovoice's regex.<p>Aside from circumventing lag, I can also give it some personality. I want to name it Marvin, after the robot from H2G2, so that I can say:<p>"Marvin, turn the TV off"<p>"Here I am, brain the size of a planet, and you ask me to turn off the tv. Call that job satisfaction, 'cause I don't."
For folks who want to try this at home on Mac OS X, you'll need to change 'sapi5' to 'nsss' on the line 'speech_engine = pyttsx.init('sapi5')'.<p>I also had to 'brew install portaudio flac swig' and a bunch of other python libs. By the time it ran, 'pip freeze' returned:<p><pre><code> altgraph==0.12
macholib==1.7
modulegraph==0.12.1
py2app==0.9
PyAudio==0.2.9
pyobjc==3.0.4
pyttsx==1.1
SpeechRecognition==3.3.0
pocketsphinx==0.0.9
</code></pre>
My fork of the gist is here: <a href="https://gist.github.com/ivanistheone/b988d3de542c1bdd6a90" rel="nofollow">https://gist.github.com/ivanistheone/b988d3de542c1bdd6a90</a>
Nice work, ggulati. I had done some roughly similar stuff, but more basic, using same / similar libraries (but you have researched more libs), a while ago:<p>Recognizing speech (speech-to-text) with the Python speech module<p><a href="https://code.activestate.com/recipes/579115-recognizing-speech-speech-to-text-with-the-python-/?in=user-4173351" rel="nofollow">https://code.activestate.com/recipes/579115-recognizing-spee...</a><p>and<p>Python text-to-speech with pyttsx<p><a href="https://code.activestate.com/recipes/578839-python-text-to-speech-with-pyttsx/?in=user-4173351" rel="nofollow">https://code.activestate.com/recipes/578839-python-text-to-s...</a><p>Good stuff. I like this area.
Microsoft's translation API has 1 million characters/month free version for text to speech with male/female voice.<p>It is good enough quality and a good start for those who can not afford paying for Google's API.
Excellent post. Very interesting. I see how it works but am using Python 2.7 so based on your headline I suppose it won't work for me. This is the first real lead I've seen for integrating it easily. Pricing isn't terrible, if it goes production. Too bad there is no way to test it first for development. But we're lucky to have this at all.<p>The link to the VLC library is pretty handy.
I have had a problem with using the speech_recognition library in that it does not stop listening when silence occurs.<p>After trying to tweak the threshold parameters without success I just figured I'd add a custom key-command to break the listening loop in my project.
Does this work without an internet connection (once downloaded)? If yes, How big is the downloaded footprint? I still haven't gone through the webpage carefully.