TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Is Word Error Rate a Good Metric for Speech Recognition Models?

29 点作者 dylanbfox超过 3 年前

6 条评论

lunixbochs超过 3 年前
I ship speech recognition to users for full computer control (mixed commands and dictation) with a very tight feedback loop. I get a lot of direct feedback about any common issues.<p>One time I beta tested a new speech model I trained that scored very well on WER. Something like 1&#x2F;2 to 1&#x2F;3 as many errors as the previous model.<p>This new model frustrated so many users, because the _nature_ of errors was much worse than before, despite fewer overall errors. The worst characteristic of this new model was word deletions. They occurred far more often. This makes me think we should consider reporting insertion&#x2F;replacement&#x2F;deletion as separate % metrics (which I found some older whitepapers did!)<p>We have CER (Character Error Rate), which is more granular and helps give a sense of whether entire words are wrong (CER = WER) or mostly just single letters (CER much lower than WER).<p>-<p>I&#x27;d welcome some ideas for new metrics, even if they only make sense for evaluating my own models against each other.<p>GPT2 perplexity?<p>Phoneme aware WER that penalizes errors more if they don&#x27;t sound &quot;alike&quot; to the ground truth? (Because humans can in some cases read a transcription where every word is wrong, 100% WER, and still figure out by the sound of each incorrect word what the &quot;right&quot; words would have been)<p>&quot;edge&quot; error rate, that is, the likelihood that errors occur at the beginning &#x2F; end of an utterance rather than the middle?<p>Some kind of word histogram, to demonstrate which specific words tend to result in errors &#x2F; which words tend to be recognized well? One of the tasks I&#x27;ve found hardest is predicting single words in isolation. I&#x27;d love a good&#x2F;standard (demographically distributed) dataset around this, e.g. 100,000 English words spoken in isolation by speakers with good accent&#x2F;dialect distribution. I built a small version of this myself and I&#x27;ve seen WER &gt;50% on it for many publicly available models.<p>More focus on accent&#x2F;dialect aware evaluation datasets?<p>+ From one of my other comments here: some ways to detect error clustering? I think ideally you want errors to be randomly distributed rather than clustered on adjacent words or focused on specific parts of an utterance (e.g. tend to mess up the last word in the utterance)
评论 #28482366 未加载
评论 #28482135 未加载
评论 #28481581 未加载
评论 #28483260 未加载
thebiss超过 3 年前
I work in this domain, dealing exclusively with recognition for assistants, which is different from dictation. We measure three things, top down:<p>- whole phrase intent recognition rates. Run the transcribed phrase through a classifier to identify what the phrase is asking for, and compare that to what was expected, calculating an F1 score. Keep track of phrases that score poorly: they need to be improved.<p>- &quot;domain term&quot; error rate. Identify a list of key words that important to the domain and that must be recognized well, such as location names, products to buy, drug names, terms of art. For every transcribed utterance, measure the F1 score for those terms, and track alternatives created in confusion matrix. This results in distilled list of words the system gets wrong and what is heard instead.<p>- overall word error rate, to provide a general view of model performance.
blululu超过 3 年前
Feels like a lot of the counter examples listed involve contractions and conjugation errors. Saying &#x27;like&#x27; and &#x27;liked&#x27; are different words is a strong interpretation. Similarly, &#x27;I am&#x27; and &#x27;I&#x27;m&#x27; are really not distinct words so counting that toward an error rate is a bit too literal. The objections could be solved by a decent parser. That said, weighting insertions and deletions equally is clearly a problem. Certain words ought to have more weight in a model. Weighting words by something like 1&#x2F;log(frequency) might be a good start since less common words tend to be more important for meaning.
评论 #28481823 未加载
评论 #28481844 未加载
CornCobs超过 3 年前
I&#x27;m working on a similar domain, music transcription. The challenge is to estimate note values (how many beats is a note supposed to be as represented in the score?) and I&#x27;m not sure what would be the a good way to measure transcription accuracy. The naive note error rate cannot capture whether my model successfully detects certain musical structure, syncopation, dotted rhythms etc
评论 #28481780 未加载
评论 #28481547 未加载
评论 #28483129 未加载
评论 #28484270 未加载
tgv超过 3 年前
If the target of ASR would be document retrieval, it would make sense to apply the same (easy) transformations before calculating WER. Think: function word removal, unsplitting remaining contractions, and stemming. That would take out some of the problems, while staying true to the target. Aren&#x27;t there any old-school linguists working on this?
gok超过 3 年前
Rare example where Betteridge&#x27;s law of headlines is wrong.<p>One clever metric that Google mentioned in their early ASR papers was interesting: &quot;WebScore&quot;. Basically, they consider a hypothesis transcription to have errors only if it produces a different top web search result. [1] WebScore and WER always seemed to track each other though.<p>[1] <a href="https:&#x2F;&#x2F;static.googleusercontent.com&#x2F;media&#x2F;research.google.com&#x2F;en&#x2F;&#x2F;pubs&#x2F;archive&#x2F;36758.pdf" rel="nofollow">https:&#x2F;&#x2F;static.googleusercontent.com&#x2F;media&#x2F;research.google.c...</a>