Good article. Speech recognition for real time use cases must get a really working open source solution. I have been evaluating deepspeech, which is okay. but there is lots of work needed to make it working close to Google Speech engine. Apart from a good Deep neural network, a good speech recognition system needs two important things:<p>1. Tons of diverse data sets (real world)<p>2. Solution for Noise - Either de-noise and train OR train with noise.<p>There are lots of extra challenges that voice recognition problem have to solve which is not common with other deep learning problems:<p>1. Pitch<p>2. Speed of conversation<p>3. Accents (can be solved with more data, I think)<p>4. Real time inference (low latency)<p>5. On the edge (i.e. Offline on mobile devices)
This seems to be a CTC model. CTC is not really the best option for a good end-to-end system. Encoder-decoder-attention models or RNN-T models are both better alternatives.<p>There is also not really a problem about available open source code. There are countless of open source projects which already have that mostly ready to use, for all the common DL frameworks, like TF, PyTorch, Jax, MXNet, whatever. For anyone with a bit of ML experience, this should really not be too hard to setup.<p>But then to get good performance, on your own dataset, what you really need is experience. Probably taking some existing pipeline will get you some model, with an okish word-error-rate. But then you should tune it. In any case, even without tuning, probably encoder-decoder-attention models will perform better than CTC models.
This is probably really good but the linked Colab notebook is failing on the first step with some unresolvable dependencies. This does seem to be a bit of a common theme whenever I try running example ML projects.<p>Edit: I think I've fixed it by changing the pip command to:<p>!pip install torchaudio comet_ml==3.0
Have a look at NeMo <a href="https://github.com/nvidia/NeMo" rel="nofollow">https://github.com/nvidia/NeMo</a> it comes with QuartzNet (only 19M of weights and better accuracy than DeepSpeech2) pretrained on thousands of hours of speech.
Mentioned once in the other comments here without any link, but another open source speech recognition model I heard about recently is Mozilla DeepSpeech:<p><a href="https://github.com/mozilla/DeepSpeech" rel="nofollow">https://github.com/mozilla/DeepSpeech</a><p><a href="https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-speech-to-text-engine/" rel="nofollow">https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-sp...</a><p>I haven't had a chance to test it, and I wish there were a client-side WASM demo of it that I could just visit on Mozilla's site.
Dunno why (probably dataset) but open source Speech Recognition models are performing very poorly on real world data compared to google speech to text or azure cognitive.