All I see is """Sorry, this content isn't available right now
The link you followed may have expired, or the page may only be visible to an audience you're not in.
Go back to the previous page · Go to News Feed · Visit our Help Center"""<p>Edit: found a link that works <a href="https://github.com/facebookresearch/wav2letter" rel="nofollow">https://github.com/facebookresearch/wav2letter</a>
So by open sourced I assume this means there are absolutely no Facebook dependencies where the voice is passing through a Facebook server? Sorry, have to ask, as my trust level is low. Otherwise, awesome!
Online speech recognition <i>for English</i>.<p>The framework should be generalizable, but the models they are making available are only for English. Actually adapting this for any other language would be a huge amount of additional work.
How does this compare to Mozilla's DeepSpeech?<p>And does anyone know when Mozilla will release the updated Common Voice dataset from <a href="https://voice.mozilla.org" rel="nofollow">https://voice.mozilla.org</a> ?
I'd love a tutorial that shows a normal guy like me how to use this tool with the pre-trianed models to transcribe my audio files. Not finding anything of that kind included there.
The preprint: <a href="https://research.fb.com/wp-content/uploads/2020/01/Scaling-up-online-speech-recognition-using-ConvNets.pdf" rel="nofollow">https://research.fb.com/wp-content/uploads/2020/01/Scaling-u...</a><p>Interestingly, the baselines are all systems that model grapheme forms instead of acoustic (phonemes) directly.
I'd be really interested in the accuracy of this tool to solve Google audio captchas. I'm assuming the price of solving captchas will go further down.
If I may insert a relevant plug: we (MERL) just put out a paper last week with SOTA 7.0 % WER on LibriSpeech test-other (vs wav2letter@anywhere's 7.5%) with 590 ms theoretical latency using joint CTC-Transformer with parallel time-delayed LSTM and triggered attention.
Check it out: <a href="https://arxiv.org/abs/2001.02674" rel="nofollow">https://arxiv.org/abs/2001.02674</a>
I'm about to start as a professor in CS education, and am hoping we're getting close to the point where I can easily transcribe interviews and high-quality dialogue audio using open-sourced models running on machines in my lab. I'm tired of paying $1/minute for human transcription that's not great anyway, and would love to undertake research that would require processing a lot more audio than is affordable on those terms.<p>I haven't kept up with developments over the last two years--anyone have a sense of whether this is close to being a reality?<p>(I've taken a bunch of Stanford's graduate AI courses on NLP and speech recognition; I can read documentation and deploy/configure models but don't have much appetite for getting into the weeds.)
Given that this uses a beam search decoder to find the most likely word pattern, is it possible small perturbations in audio could cause it to improperly decode certain word strings? Sort of like the audio equivalent of adversarial attacks, but on ASR?
The name must be a nod to Word2Vec[1]. A cool naming scheme IMO.<p>[1] <a href="https://en.m.wikipedia.org/wiki/Word2vec" rel="nofollow">https://en.m.wikipedia.org/wiki/Word2vec</a>
Do the pretrained models work decently on landline phone quality recordings? I can see massive value for this if it can transcribe corporate call center audio.