TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Wav2vec Overview: Semi and Unsupervised Speech Recognition

162 pointsby vackosaralmost 4 years ago

5 comments

lunixbochsalmost 4 years ago
One addendum to the linked post&#x27;s notes:<p>&gt; SoTa in low-resource setting Libri-light by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5<p>&gt; SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on clean data<p>This note isn&#x27;t super specific, but it&#x27;s outdated if I&#x27;m understanding it correctly. To my understanding, the SOTA on this data is held by Conformer 1B (a 1 billion parameter model), at 1.4 clean, 2.6 noisy.<p>Conformer 1B is something like wav2vec 2.0 pretraining + conformer + noisy student + specaugment.<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2010.10504.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2010.10504.pdf</a><p>--<p>Wav2vec 2.0 is very cool, but I&#x27;ve had some trouble reproducing the pretraining and fine tuning reliably. It might need a lot of resources (e.g. hundreds of clustered GPUs).<p>I think Wav2vec-U is extremely cool.
评论 #27723724 未加载
WillDaSilvaalmost 4 years ago
I wonder how much better this would be at capturing information that doesn&#x27;t translate well into text representations of speech.<p>Consider how with word2vec there are relationships in the embedding space between semantically related words. I would expect the examples of that for word2vec (e.g. king -&gt; queen being a similar translation as man -&gt; woman) to apply here too, but can it also do things like place regular questions and rhetorical questions in different regions of the embedding space based off of of the inflection in the speech?<p>It would also be interesting to see what relationships exist between equivalent words in different languages within the embedding space. I suppose something like that is probably already used for text translation neural networks, but maybe some notable differences exist when dealing with speech directly.
theropostalmost 4 years ago
Does anyone know of some good open sourced projects for OCR? Tesseract always seems to be the default, and then it seems Google cloud, and other services are miles ahead. However, for those who don&#x27;t want to rely on the big tech companies, are there any comparable alternatives?
评论 #27726691 未加载
评论 #27724535 未加载
评论 #27726688 未加载
评论 #27723739 未加载
评论 #27745487 未加载
spijdaralmost 4 years ago
As someone who&#x27;s an idiot about machine learning, is it possible to run this code in reverse? e.g. take the generated (or novel) vectors and convert them back into audio&#x2F;waveforms?
评论 #27723102 未加载
评论 #27723023 未加载
评论 #27723100 未加载
jonathanbgnalmost 4 years ago
Great summary! I also recently wrote a post digging into the internals of wav2vec with illustrations:<p>The Illustrated Wav2vec - <a href="https:&#x2F;&#x2F;jonathanbgn.com&#x2F;2021&#x2F;06&#x2F;29&#x2F;illustrated-wav2vec.html" rel="nofollow">https:&#x2F;&#x2F;jonathanbgn.com&#x2F;2021&#x2F;06&#x2F;29&#x2F;illustrated-wav2vec.html</a>