TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

HellaSwag: 36% of this popular large language model benchmark contains errors

49 点作者 echen超过 2 年前

6 条评论

TillE超过 2 年前
I&#x27;m not entirely clear on what ActivityNet is (one of the primary sources for HellaSwag), but it looks like amateurish descriptions of videos, like you would write for audio descriptions for the blind, except written very badly.<p>I&#x27;m guessing it&#x27;s just Mechanical Turk content which wasn&#x27;t even spellchecked.
评论 #33875527 未加载
wging超过 2 年前
I&#x27;m not so sure the input &quot;errors&quot; called out in this post qualify as errors in the dataset. I wouldn&#x27;t necessarily call an input prompt with errors a dataset problem. It&#x27;s important to be robust to minor input errors, rather than requiring perfection on the part of the user.<p>I&#x27;m thinking here about &quot;People is around the field watching the game&quot;, and other input errors, not necessarily output errors, but maybe if I thought about it a little more I&#x27;d be able to make similar arguments for accepting weirder outputs? Not as confident about that. For inputs, the hopeful effect of training&#x2F;validating against such examples would be to make the model somewhat able to deal with imperfect inputs when the overall meaning is clear.
PraetorianGourd超过 2 年前
We are certainly at the “throw money at the buzzwords” stage of ML, especially LLMs. And while this is certainly caused by the gold-rush mentality hype cycle, there is an issue of those in this field wildly over-promising what this tech can do.<p>The scary thing about this hype cycle is that AI and ML are both being deployed in life-and-death scenarios like automated driving and health-care settings. This isn’t the normal web hype of “Uber for X” that we are used to.
carbocation超过 2 年前
This article is written such that you have to read the article twice to understand what it&#x27;s conveying. It could benefit from a two-sentence introduction that addresses the context.
评论 #33876563 未加载
0xblood超过 2 年前
&gt;More and more researchers are starting to see the importance of good data.<p>Let me just leave this here and I just don&#x27;t comment any further on this great progress within the research community
sinuhe69超过 2 年前
Perhaps the 36% errors help to beat humans&#x27; evaluation ;)