科技回声

5 条评论

Here are some example questions from the paper[0]Level 1 Question: What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website? Ground truth: 90Level 2 <photo of ice cream container showing nutrition facts> Question: If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place. Ground truth: +4.6Level 3 Question: In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes. Ground truth: White; 5876[0]: <a href="https://arxiv.org/pdf/2311.12983.pdf" rel="nofollow noreferrer">https://arxiv.org/pdf/2311.12983.pdf</a>

评论 #38390027 未加载

Thomjazz超过 1 年前

Hi some authors of the work here, thanks a lot for sharing the paper, it's been quite some time in the work and we're super happy to share it with the world.A short note on some of the reasons we decided to go with openly-sharing the questions instead of holding them back (which was another option we contemplated): - with closed-models we need to send the questions through an external AI anyway so a full privacy of the test set is not possible in general unless the leaderboard is restricted to open models (would be quite restrictive) - also, the benchmark contain a limited number of questions which are non-obvious and take a significant time for human reviewers to solve. We thus don't expect the dataset to become training material for models and to lead to having model over-fitting on the benchmark pattern in the traditional sense that happened with larger benchmark datasets including a training split. This benchmark is generally closer in philosophy to small, hand crafted benchmark datasets, like HumanEval for instance has been for code models.

andsoitis超过 1 年前

> We release our questions while retaining answers to 300 of them to power a leader-board available at this https URL.Nothing on the leaderboard: <a href="https://huggingface.co/gaia-benchmark" rel="nofollow noreferrer">https://huggingface.co/gaia-benchmark</a>

评论 #38389737 未加载

rkwasny超过 1 年前

We need to double check this questions manually first if the answers are actually correct

评论 #38391209 未加载

pikseladam超过 1 年前

interesting take. another milestone will be achieved when GAIA defeated. who will be the first? :)

5 条评论

doctoboggan超过 1 年前

评论 #38390027 未加载

Thomjazz超过 1 年前

andsoitis超过 1 年前

评论 #38389737 未加载

rkwasny超过 1 年前

We need to double check this questions manually first if the answers are actually correct

评论 #38391209 未加载

pikseladam超过 1 年前

interesting take. another milestone will be achieved when GAIA defeated. who will be the first? :)

Meta: Gaia - A Benchmark for General AI Assistants

5 条评论

Meta: Gaia - A Benchmark for General AI Assistants

5 条评论