TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Guide to Speech Recognition with Python

143 pointsby hn17about 7 years ago

5 comments

kastnerkyleabout 7 years ago
For people who want simple, out of the box stuff (not necessarily in Python) for just getting phonemes I can also recommend [0]. Not amazing recognition quality, but dead simple setup, and it is possible to integrate a language model as well (I never needed one for my task). The author showed it as well in [1], but kind of skimmed right by - but to me if you want to know speech recognition in detail, pocketsphinx-python is one of the best ways. Customizing the language model is a <i>huge</i> boost in domain specific recognition.<p>Large company APIs will usually be better at generic speaker, generic language recognition - but if you can do speaker adaptation and customize the language model, there are some insane gains possible since you prune out a lot of uncertainty and complexity.<p>If you are more interested in recognition and alignment to a script, &quot;gentle&quot; is great [2][3]. The guts also have raw Kaldi recognition, which is pretty good for a generic speech recognizer but you would need to do some coding to pull out that part on its own.<p>For a decent performing deep model, check into Mozilla&#x27;s version of Baidu&#x27;s DeepSpeech [4].<p>If doing full-on development, my colleague has been using a bridge between PyTorch (for training) and Kaldi (to use their decoders) to good success [5].<p>[0] how I use pocketsphinx to get phonemes, <a href="https:&#x2F;&#x2F;github.com&#x2F;kastnerkyle&#x2F;ez-phones" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kastnerkyle&#x2F;ez-phones</a><p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;cmusphinx&#x2F;pocketsphinx-python" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cmusphinx&#x2F;pocketsphinx-python</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;lowerquality&#x2F;gentle" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lowerquality&#x2F;gentle</a><p>[3] how I use gentle for foreced alignment, <a href="https:&#x2F;&#x2F;github.com&#x2F;kastnerkyle&#x2F;raw_voice_cleanup&#x2F;tree&#x2F;master&#x2F;alignment" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kastnerkyle&#x2F;raw_voice_cleanup&#x2F;tree&#x2F;master...</a><p>[4] <a href="https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;DeepSpeech" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;DeepSpeech</a><p>[5] <a href="https:&#x2F;&#x2F;github.com&#x2F;mravanelli&#x2F;pytorch-kaldi" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mravanelli&#x2F;pytorch-kaldi</a>
capo64about 7 years ago
No mention of DNN based ASR like DeepSpeech? There’s even open source python implementations available from Mozilla and Paddle.<p>These models are way easier to train, have surprisingly good accuracy, and are robust to noise.
评论 #16672354 未加载
评论 #16672453 未加载
评论 #16672358 未加载
评论 #16672173 未加载
nshmabout 7 years ago
The SpeechRecognition module is pretty popular but it has some important API design flows. The thing is that speech is always continuous stream of data and you need a streaming-like API for proper user experience - you need to respond on events as soon as they appear. You need to filter silence and wait for actual words. You need to delay input reaction until the user clearly expressed the goal. Such streaming API is provided by major engines like Google and CMUSphinx and enables natural and responsive experience. Unfortunately SpeechRecognition module does not support streaming so developers often restrict themselves. A proper guide should better cover Google&#x27;s streaming API.
评论 #16674960 未加载
vram22about 7 years ago
I had experimented with Python libraries for both speech recognition and speech synthesis a while ago. It was very basic stuff, but fun:<p>Speech recognition with the Python &quot;speech&quot; module:<p><a href="https:&#x2F;&#x2F;jugad2.blogspot.in&#x2F;2014&#x2F;03&#x2F;speech-recognition-with-python-speech.html" rel="nofollow">https:&#x2F;&#x2F;jugad2.blogspot.in&#x2F;2014&#x2F;03&#x2F;speech-recognition-with-p...</a><p>Speech synthesis in Python with pyttsx:<p><a href="https:&#x2F;&#x2F;jugad2.blogspot.in&#x2F;2014&#x2F;03&#x2F;speech-synthesis-in-python-with-pyttsx.html" rel="nofollow">https:&#x2F;&#x2F;jugad2.blogspot.in&#x2F;2014&#x2F;03&#x2F;speech-synthesis-in-pytho...</a><p>Check out the synthetic voice announcing an arriving train in Sweden (near top of 2nd post above).
payne92about 7 years ago
&gt; Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM).<p>This not correct. Most modern speech recognition systems are based on deep neural nets (DNN).
评论 #16672665 未加载
评论 #16674605 未加载