TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Building an end-to-end Speech Recognition model in PyTorch

288 pointsby makaimcabout 5 years ago

6 comments

zeropabout 5 years ago
Good article. Speech recognition for real time use cases must get a really working open source solution. I have been evaluating deepspeech, which is okay. but there is lots of work needed to make it working close to Google Speech engine. Apart from a good Deep neural network, a good speech recognition system needs two important things:<p>1. Tons of diverse data sets (real world)<p>2. Solution for Noise - Either de-noise and train OR train with noise.<p>There are lots of extra challenges that voice recognition problem have to solve which is not common with other deep learning problems:<p>1. Pitch<p>2. Speed of conversation<p>3. Accents (can be solved with more data, I think)<p>4. Real time inference (low latency)<p>5. On the edge (i.e. Offline on mobile devices)
评论 #22901275 未加载
albertzeyerabout 5 years ago
This seems to be a CTC model. CTC is not really the best option for a good end-to-end system. Encoder-decoder-attention models or RNN-T models are both better alternatives.<p>There is also not really a problem about available open source code. There are countless of open source projects which already have that mostly ready to use, for all the common DL frameworks, like TF, PyTorch, Jax, MXNet, whatever. For anyone with a bit of ML experience, this should really not be too hard to setup.<p>But then to get good performance, on your own dataset, what you really need is experience. Probably taking some existing pipeline will get you some model, with an okish word-error-rate. But then you should tune it. In any case, even without tuning, probably encoder-decoder-attention models will perform better than CTC models.
评论 #22901348 未加载
评论 #22900225 未加载
评论 #22901055 未加载
评论 #22903384 未加载
评论 #22899927 未加载
评论 #22900278 未加载
spzbabout 5 years ago
This is probably really good but the linked Colab notebook is failing on the first step with some unresolvable dependencies. This does seem to be a bit of a common theme whenever I try running example ML projects.<p>Edit: I think I&#x27;ve fixed it by changing the pip command to:<p>!pip install torchaudio comet_ml==3.0
评论 #22899900 未加载
评论 #22900223 未加载
optionabout 5 years ago
Have a look at NeMo <a href="https:&#x2F;&#x2F;github.com&#x2F;nvidia&#x2F;NeMo" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;nvidia&#x2F;NeMo</a> it comes with QuartzNet (only 19M of weights and better accuracy than DeepSpeech2) pretrained on thousands of hours of speech.
评论 #22900343 未加载
coder543about 5 years ago
Mentioned once in the other comments here without any link, but another open source speech recognition model I heard about recently is Mozilla DeepSpeech:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;DeepSpeech" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;DeepSpeech</a><p><a href="https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2019&#x2F;12&#x2F;deepspeech-0-6-mozillas-speech-to-text-engine&#x2F;" rel="nofollow">https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2019&#x2F;12&#x2F;deepspeech-0-6-mozillas-sp...</a><p>I haven&#x27;t had a chance to test it, and I wish there were a client-side WASM demo of it that I could just visit on Mozilla&#x27;s site.
评论 #22900302 未加载
komuherabout 5 years ago
Dunno why (probably dataset) but open source Speech Recognition models are performing very poorly on real world data compared to google speech to text or azure cognitive.
评论 #22900064 未加载
评论 #22899956 未加载
评论 #22900200 未加载