One addendum to the linked post's notes:<p>> SoTa in low-resource setting Libri-light by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5<p>> SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on clean data<p>This note isn't super specific, but it's outdated if I'm understanding it correctly. To my understanding, the SOTA on this data is held by Conformer 1B (a 1 billion parameter model), at 1.4 clean, 2.6 noisy.<p>Conformer 1B is something like wav2vec 2.0 pretraining + conformer + noisy student + specaugment.<p><a href="https://arxiv.org/pdf/2010.10504.pdf" rel="nofollow">https://arxiv.org/pdf/2010.10504.pdf</a><p>--<p>Wav2vec 2.0 is very cool, but I've had some trouble reproducing the pretraining and fine tuning reliably. It might need a lot of resources (e.g. hundreds of clustered GPUs).<p>I think Wav2vec-U is extremely cool.
I wonder how much better this would be at capturing information that doesn't translate well into text representations of speech.<p>Consider how with word2vec there are relationships in the embedding space between semantically related words. I would expect the examples of that for word2vec (e.g. king -> queen being a similar translation as man -> woman) to apply here too, but can it also do things like place regular questions and rhetorical questions in different regions of the embedding space based off of of the inflection in the speech?<p>It would also be interesting to see what relationships exist between equivalent words in different languages within the embedding space. I suppose something like that is probably already used for text translation neural networks, but maybe some notable differences exist when dealing with speech directly.
Does anyone know of some good open sourced projects for OCR? Tesseract always seems to be the default, and then it seems Google cloud, and other services are miles ahead. However, for those who don't want to rely on the big tech companies, are there any comparable alternatives?
As someone who's an idiot about machine learning, is it possible to run this code in reverse? e.g. take the generated (or novel) vectors and convert them back into audio/waveforms?
Great summary! I also recently wrote a post digging into the internals of wav2vec with illustrations:<p>The Illustrated Wav2vec - <a href="https://jonathanbgn.com/2021/06/29/illustrated-wav2vec.html" rel="nofollow">https://jonathanbgn.com/2021/06/29/illustrated-wav2vec.html</a>