One addendum to the linked post's notes:<p>> SoTa in low-resource setting Libri-light by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5<p>> SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on clean data<p>This note isn't super specific, but it's outdated if I'm understanding it correctly. To my understanding, the SOTA on this data is held by Conformer 1B (a 1 billion parameter model), at 1.4 clean, 2.6 noisy.<p>Conformer 1B is something like wav2vec 2.0 pretraining + conformer + noisy student + specaugment.<p><a href="https://arxiv.org/pdf/2010.10504.pdf" rel="nofollow">https://arxiv.org/pdf/2010.10504.pdf</a><p>--<p>Wav2vec 2.0 is very cool, but I've had some trouble reproducing the pretraining and fine tuning reliably. It might need a lot of resources (e.g. hundreds of clustered GPUs).<p>I think Wav2vec-U is extremely cool.