科技回声

6 条评论

TillE超过 2 年前

I'm not entirely clear on what ActivityNet is (one of the primary sources for HellaSwag), but it looks like amateurish descriptions of videos, like you would write for audio descriptions for the blind, except written very badly.I'm guessing it's just Mechanical Turk content which wasn't even spellchecked.

评论 #33875527 未加载

wging超过 2 年前

I'm not so sure the input "errors" called out in this post qualify as errors in the dataset. I wouldn't necessarily call an input prompt with errors a dataset problem. It's important to be robust to minor input errors, rather than requiring perfection on the part of the user.I'm thinking here about "People is around the field watching the game", and other input errors, not necessarily output errors, but maybe if I thought about it a little more I'd be able to make similar arguments for accepting weirder outputs? Not as confident about that. For inputs, the hopeful effect of training/validating against such examples would be to make the model somewhat able to deal with imperfect inputs when the overall meaning is clear.

PraetorianGourd超过 2 年前

We are certainly at the “throw money at the buzzwords” stage of ML, especially LLMs. And while this is certainly caused by the gold-rush mentality hype cycle, there is an issue of those in this field wildly over-promising what this tech can do.The scary thing about this hype cycle is that AI and ML are both being deployed in life-and-death scenarios like automated driving and health-care settings. This isn’t the normal web hype of “Uber for X” that we are used to.

carbocation超过 2 年前

This article is written such that you have to read the article twice to understand what it's conveying. It could benefit from a two-sentence introduction that addresses the context.

评论 #33876563 未加载

0xblood超过 2 年前

>More and more researchers are starting to see the importance of good data.Let me just leave this here and I just don't comment any further on this great progress within the research community

sinuhe69超过 2 年前

Perhaps the 36% errors help to beat humans' evaluation ;)

6 条评论

TillE超过 2 年前

评论 #33875527 未加载

wging超过 2 年前

PraetorianGourd超过 2 年前

carbocation超过 2 年前

This article is written such that you have to read the article twice to understand what it's conveying. It could benefit from a two-sentence introduction that addresses the context.

评论 #33876563 未加载

0xblood超过 2 年前

>More and more researchers are starting to see the importance of good data.Let me just leave this here and I just don't comment any further on this great progress within the research community

sinuhe69超过 2 年前

Perhaps the 36% errors help to beat humans' evaluation ;)

HellaSwag: 36% of this popular large language model benchmark contains errors

6 条评论

HellaSwag: 36% of this popular large language model benchmark contains errors

6 条评论