TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Guide to Speech Recognition with Python

143 点作者 hn17大约 7 年前

5 条评论

kastnerkyle大约 7 年前
For people who want simple, out of the box stuff (not necessarily in Python) for just getting phonemes I can also recommend [0]. Not amazing recognition quality, but dead simple setup, and it is possible to integrate a language model as well (I never needed one for my task). The author showed it as well in [1], but kind of skimmed right by - but to me if you want to know speech recognition in detail, pocketsphinx-python is one of the best ways. Customizing the language model is a <i>huge</i> boost in domain specific recognition.<p>Large company APIs will usually be better at generic speaker, generic language recognition - but if you can do speaker adaptation and customize the language model, there are some insane gains possible since you prune out a lot of uncertainty and complexity.<p>If you are more interested in recognition and alignment to a script, &quot;gentle&quot; is great [2][3]. The guts also have raw Kaldi recognition, which is pretty good for a generic speech recognizer but you would need to do some coding to pull out that part on its own.<p>For a decent performing deep model, check into Mozilla&#x27;s version of Baidu&#x27;s DeepSpeech [4].<p>If doing full-on development, my colleague has been using a bridge between PyTorch (for training) and Kaldi (to use their decoders) to good success [5].<p>[0] how I use pocketsphinx to get phonemes, <a href="https:&#x2F;&#x2F;github.com&#x2F;kastnerkyle&#x2F;ez-phones" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kastnerkyle&#x2F;ez-phones</a><p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;cmusphinx&#x2F;pocketsphinx-python" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cmusphinx&#x2F;pocketsphinx-python</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;lowerquality&#x2F;gentle" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lowerquality&#x2F;gentle</a><p>[3] how I use gentle for foreced alignment, <a href="https:&#x2F;&#x2F;github.com&#x2F;kastnerkyle&#x2F;raw_voice_cleanup&#x2F;tree&#x2F;master&#x2F;alignment" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;kastnerkyle&#x2F;raw_voice_cleanup&#x2F;tree&#x2F;master...</a><p>[4] <a href="https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;DeepSpeech" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;DeepSpeech</a><p>[5] <a href="https:&#x2F;&#x2F;github.com&#x2F;mravanelli&#x2F;pytorch-kaldi" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mravanelli&#x2F;pytorch-kaldi</a>
capo64大约 7 年前
No mention of DNN based ASR like DeepSpeech? There’s even open source python implementations available from Mozilla and Paddle.<p>These models are way easier to train, have surprisingly good accuracy, and are robust to noise.
评论 #16672354 未加载
评论 #16672453 未加载
评论 #16672358 未加载
评论 #16672173 未加载
nshm大约 7 年前
The SpeechRecognition module is pretty popular but it has some important API design flows. The thing is that speech is always continuous stream of data and you need a streaming-like API for proper user experience - you need to respond on events as soon as they appear. You need to filter silence and wait for actual words. You need to delay input reaction until the user clearly expressed the goal. Such streaming API is provided by major engines like Google and CMUSphinx and enables natural and responsive experience. Unfortunately SpeechRecognition module does not support streaming so developers often restrict themselves. A proper guide should better cover Google&#x27;s streaming API.
评论 #16674960 未加载
vram22大约 7 年前
I had experimented with Python libraries for both speech recognition and speech synthesis a while ago. It was very basic stuff, but fun:<p>Speech recognition with the Python &quot;speech&quot; module:<p><a href="https:&#x2F;&#x2F;jugad2.blogspot.in&#x2F;2014&#x2F;03&#x2F;speech-recognition-with-python-speech.html" rel="nofollow">https:&#x2F;&#x2F;jugad2.blogspot.in&#x2F;2014&#x2F;03&#x2F;speech-recognition-with-p...</a><p>Speech synthesis in Python with pyttsx:<p><a href="https:&#x2F;&#x2F;jugad2.blogspot.in&#x2F;2014&#x2F;03&#x2F;speech-synthesis-in-python-with-pyttsx.html" rel="nofollow">https:&#x2F;&#x2F;jugad2.blogspot.in&#x2F;2014&#x2F;03&#x2F;speech-synthesis-in-pytho...</a><p>Check out the synthetic voice announcing an arriving train in Sweden (near top of 2nd post above).
payne92大约 7 年前
&gt; Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM).<p>This not correct. Most modern speech recognition systems are based on deep neural nets (DNN).
评论 #16672665 未加载
评论 #16674605 未加载