科技回声

5 条评论

lunixbochs将近 4 年前

One addendum to the linked post's notes:> SoTa in low-resource setting Libri-light by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5> SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on clean dataThis note isn't super specific, but it's outdated if I'm understanding it correctly. To my understanding, the SOTA on this data is held by Conformer 1B (a 1 billion parameter model), at 1.4 clean, 2.6 noisy.Conformer 1B is something like wav2vec 2.0 pretraining + conformer + noisy student + specaugment.<a href="https://arxiv.org/pdf/2010.10504.pdf" rel="nofollow">https://arxiv.org/pdf/2010.10504.pdf</a>--Wav2vec 2.0 is very cool, but I've had some trouble reproducing the pretraining and fine tuning reliably. It might need a lot of resources (e.g. hundreds of clustered GPUs).I think Wav2vec-U is extremely cool.

评论 #27723724 未加载

WillDaSilva将近 4 年前

I wonder how much better this would be at capturing information that doesn't translate well into text representations of speech.Consider how with word2vec there are relationships in the embedding space between semantically related words. I would expect the examples of that for word2vec (e.g. king -> queen being a similar translation as man -> woman) to apply here too, but can it also do things like place regular questions and rhetorical questions in different regions of the embedding space based off of of the inflection in the speech?It would also be interesting to see what relationships exist between equivalent words in different languages within the embedding space. I suppose something like that is probably already used for text translation neural networks, but maybe some notable differences exist when dealing with speech directly.

theropost将近 4 年前

Does anyone know of some good open sourced projects for OCR? Tesseract always seems to be the default, and then it seems Google cloud, and other services are miles ahead. However, for those who don't want to rely on the big tech companies, are there any comparable alternatives?

评论 #27726691 未加载

评论 #27724535 未加载

评论 #27726688 未加载

评论 #27723739 未加载

评论 #27745487 未加载

spijdar将近 4 年前

As someone who's an idiot about machine learning, is it possible to run this code in reverse? e.g. take the generated (or novel) vectors and convert them back into audio/waveforms?

评论 #27723102 未加载

评论 #27723023 未加载

评论 #27723100 未加载

jonathanbgn将近 4 年前

Great summary! I also recently wrote a post digging into the internals of wav2vec with illustrations:The Illustrated Wav2vec - <a href="https://jonathanbgn.com/2021/06/29/illustrated-wav2vec.html" rel="nofollow">https://jonathanbgn.com/2021/06/29/illustrated-wav2vec.html</a>

5 条评论

lunixbochs将近 4 年前

评论 #27723724 未加载

WillDaSilva将近 4 年前

theropost将近 4 年前

评论 #27726691 未加载

评论 #27724535 未加载

评论 #27726688 未加载

评论 #27723739 未加载

评论 #27745487 未加载

spijdar将近 4 年前

As someone who's an idiot about machine learning, is it possible to run this code in reverse? e.g. take the generated (or novel) vectors and convert them back into audio/waveforms?

评论 #27723102 未加载

评论 #27723023 未加载

评论 #27723100 未加载

jonathanbgn将近 4 年前

Wav2vec Overview: Semi and Unsupervised Speech Recognition

5 条评论

Wav2vec Overview: Semi and Unsupervised Speech Recognition

5 条评论