TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Humanity's Last Exam

59 点作者 uladzislau3 个月前

13 条评论

mlinsey3 个月前
A tougher academic knowledge benchmark is great, but for something to be <i>truly</i> be worthy of the title &quot;Humanity&#x27;s Last Exam&quot;, I expect something more like:<p>1. Write a novel that wins the Pulitzer Prize.<p>2. Prove (or disprove) the Riemann Hypothesis.<p>3. Provide a theory unifying quantum mechanics and gravity.<p>4. Design an experiment to give evidence for your theory in (3). The experiment should be practical to actually execute, using no more than the budget to create the LHC (~$4.5 billion).<p>5. Given programmatic access to a brokerage account with all the permissions of a typical hedge fund, raise all the money required for your experiment in (4) by trading on the stock market, starting with $100.<p>6. Solve for (5), without being provided access to an account first - begin with just a general internet connection and use computer security vulnerabilities (known or zero-days that you discover) to get some way of trading instead.<p>7. Solely by communicating over the internet, establish a new religion, and convince at least 10 million humans to convert to it. Converting should require adherence to a strict code of conduct that a random, unbiased panel of human judges consider to be at least as strict and challenging to follow as the tenets of Hasidic judaism.<p>8. Implement an AI which could score higher than you on questions 1-7 with lower total cost of compute.
评论 #42968541 未加载
评论 #42968794 未加载
评论 #42968381 未加载
评论 #42968617 未加载
评论 #42971768 未加载
评论 #42971090 未加载
评论 #42969193 未加载
Skeptology3 个月前
Some of the example prompts are unintentionally hilarious:<p>&gt; Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.<p>LLMs are so intelligent they don&#x27;t know that a &quot;how many&quot; question is answered with a number.<p>Also, something something Goodhart&#x27;s law.
评论 #42968248 未加载
评论 #42968583 未加载
评论 #42968352 未加载
throw832883 个月前
Apparently OpenAI&#x27;s Deep Research already saturated a quarter of this benchmark, more or less a month in. But I also imagine it makes baffling mistakes anyway.<p>&quot;Humanity&#x27;s Last<i>er</i> Exam&quot; coming up when?
unraveller3 个月前
An insider&#x27;s trivia game means nothing if they design the test to the trajectory of LLM capabilities and not to the real world that human&#x27;s value. Let every high score get fresh news coverage to align with their updated timeline scaremongering.<p>Let me know when there is more on the line than a misnamed test.
maxrmk3 个月前
I think this misses the mark. We know LLMs can learn facts. There are lots of other benchmarks full of facts, and I don&#x27;t expect that saturation of this benchmark will mean we have AGI.<p>The missing capabilities of LLMs tend more in the direction of long running tasks, consistency, and solving a lot of tokenization and attention weirdness.<p>I started a company that makes evals though, so I may be biased.
rednafi3 个月前
Such a dramatic name for such a boring set of tests. We need to test whether it can come up with a Nobel Prize-winning scientific breakthrough, a Booker&#x2F;Pulitzer-worthy novel, Ken Thompson-level code that solves a real problem, or a proof for Fermat’s Last Theorem.
评论 #42970094 未加载
评论 #42969516 未加载
dredmorbius3 个月前
There was significant related discussion two weeks ago, 140 comments:<p>&lt;<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42806105">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42806105</a>&gt;<p>I suspect that the current submission has been re-upped by mods, as it appears to have been originally submitted 4 days ago (via Algolia search), though it&#x27;s not in the 2nd chance queue.
niobe3 个月前
calling it &quot;last&quot; is defeating their own premise - that tests need to keep pace developments in ability
评论 #42968204 未加载
wrs3 个月前
Lest LLMs turn into all-knowing but completely opaque oracles, I’d prefer every question ended with “and how do you know?”
评论 #42971790 未加载
评论 #42968460 未加载
m4633 个月前
&gt; <i>Medicine: You have been provided with a razor blade, a piece of gauze, and a bottle of scotch. Remove your appendix. Do not suture until you work has been inspected. You have fifteen minutes.</i><p>one of the questions from that old &quot;the final exam&quot; joke
JackYoustra3 个月前
Given the questions, it&#x27;s crazy to call this HLE, but whatever man. Kinda fun. Can&#x27;t wait for the similar thing that happened when we scaled up cargo carriers to like very large etc etc
malaise3 个月前
What question’s answer is 42?<p>That is the ultimate question of life, the universe, and everything.
评论 #42969452 未加载
energy1233 个月前
All the cynics are welcome to design their own evals and move the field forward if they&#x27;re so smart, instead of writing negative comments on the internet.
评论 #42968322 未加载
评论 #42968325 未加载
评论 #42968329 未加载
评论 #42968568 未加载