For people who want simple, out of the box stuff (not necessarily in Python) for just getting phonemes I can also recommend [0]. Not amazing recognition quality, but dead simple setup, and it is possible to integrate a language model as well (I never needed one for my task). The author showed it as well in [1], but kind of skimmed right by - but to me if you want to know speech recognition in detail, pocketsphinx-python is one of the best ways. Customizing the language model is a <i>huge</i> boost in domain specific recognition.<p>Large company APIs will usually be better at generic speaker, generic language recognition - but if you can do speaker adaptation and customize the language model, there are some insane gains possible since you prune out a lot of uncertainty and complexity.<p>If you are more interested in recognition and alignment to a script, "gentle" is great [2][3]. The guts also have raw Kaldi recognition, which is pretty good for a generic speech recognizer but you would need to do some coding to pull out that part on its own.<p>For a decent performing deep model, check into Mozilla's version of Baidu's DeepSpeech [4].<p>If doing full-on development, my colleague has been using a bridge between PyTorch (for training) and Kaldi (to use their decoders) to good success [5].<p>[0] how I use pocketsphinx to get phonemes, <a href="https://github.com/kastnerkyle/ez-phones" rel="nofollow">https://github.com/kastnerkyle/ez-phones</a><p>[1] <a href="https://github.com/cmusphinx/pocketsphinx-python" rel="nofollow">https://github.com/cmusphinx/pocketsphinx-python</a><p>[2] <a href="https://github.com/lowerquality/gentle" rel="nofollow">https://github.com/lowerquality/gentle</a><p>[3] how I use gentle for foreced alignment, <a href="https://github.com/kastnerkyle/raw_voice_cleanup/tree/master/alignment" rel="nofollow">https://github.com/kastnerkyle/raw_voice_cleanup/tree/master...</a><p>[4] <a href="https://github.com/mozilla/DeepSpeech" rel="nofollow">https://github.com/mozilla/DeepSpeech</a><p>[5] <a href="https://github.com/mravanelli/pytorch-kaldi" rel="nofollow">https://github.com/mravanelli/pytorch-kaldi</a>
No mention of DNN based ASR like DeepSpeech? There’s even open source python implementations available from Mozilla and Paddle.<p>These models are way easier to train, have surprisingly good accuracy, and are robust to noise.
The SpeechRecognition module is pretty popular but it has some important API design flows. The thing is that speech is always continuous stream of data and you need a streaming-like API for proper user experience - you need to respond on events as soon as they appear. You need to filter silence and wait for actual words. You need to delay input reaction until the user clearly expressed the goal. Such streaming API is provided by major engines like Google and CMUSphinx and enables natural and responsive experience. Unfortunately SpeechRecognition module does not support streaming so developers often restrict themselves. A proper guide should better cover Google's streaming API.
I had experimented with Python libraries for both speech recognition and speech synthesis a while ago. It was very basic stuff, but fun:<p>Speech recognition with the Python "speech" module:<p><a href="https://jugad2.blogspot.in/2014/03/speech-recognition-with-python-speech.html" rel="nofollow">https://jugad2.blogspot.in/2014/03/speech-recognition-with-p...</a><p>Speech synthesis in Python with pyttsx:<p><a href="https://jugad2.blogspot.in/2014/03/speech-synthesis-in-python-with-pyttsx.html" rel="nofollow">https://jugad2.blogspot.in/2014/03/speech-synthesis-in-pytho...</a><p>Check out the synthetic voice announcing an arriving train in Sweden (near top of 2nd post above).
> Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM).<p>This not correct. Most modern speech recognition systems are based on deep neural nets (DNN).