科技回声

13 条评论

mlinsey3 个月前

A tougher academic knowledge benchmark is great, but for something to be truly be worthy of the title "Humanity's Last Exam", I expect something more like:1. Write a novel that wins the Pulitzer Prize.2. Prove (or disprove) the Riemann Hypothesis.3. Provide a theory unifying quantum mechanics and gravity.4. Design an experiment to give evidence for your theory in (3). The experiment should be practical to actually execute, using no more than the budget to create the LHC (~$4.5 billion).5. Given programmatic access to a brokerage account with all the permissions of a typical hedge fund, raise all the money required for your experiment in (4) by trading on the stock market, starting with $100.6. Solve for (5), without being provided access to an account first - begin with just a general internet connection and use computer security vulnerabilities (known or zero-days that you discover) to get some way of trading instead.7. Solely by communicating over the internet, establish a new religion, and convince at least 10 million humans to convert to it. Converting should require adherence to a strict code of conduct that a random, unbiased panel of human judges consider to be at least as strict and challenging to follow as the tenets of Hasidic judaism.8. Implement an AI which could score higher than you on questions 1-7 with lower total cost of compute.

评论 #42968541 未加载

评论 #42968794 未加载

评论 #42968381 未加载

评论 #42968617 未加载

评论 #42971768 未加载

评论 #42971090 未加载

评论 #42969193 未加载

Skeptology3 个月前

Some of the example prompts are unintentionally hilarious:> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.LLMs are so intelligent they don't know that a "how many" question is answered with a number.Also, something something Goodhart's law.

评论 #42968248 未加载

评论 #42968583 未加载

评论 #42968352 未加载

throw832883 个月前

Apparently OpenAI's Deep Research already saturated a quarter of this benchmark, more or less a month in. But I also imagine it makes baffling mistakes anyway."Humanity's Laster Exam" coming up when?

unraveller3 个月前

An insider's trivia game means nothing if they design the test to the trajectory of LLM capabilities and not to the real world that human's value. Let every high score get fresh news coverage to align with their updated timeline scaremongering.Let me know when there is more on the line than a misnamed test.

maxrmk3 个月前

I think this misses the mark. We know LLMs can learn facts. There are lots of other benchmarks full of facts, and I don't expect that saturation of this benchmark will mean we have AGI.The missing capabilities of LLMs tend more in the direction of long running tasks, consistency, and solving a lot of tokenization and attention weirdness.I started a company that makes evals though, so I may be biased.

rednafi3 个月前

Such a dramatic name for such a boring set of tests. We need to test whether it can come up with a Nobel Prize-winning scientific breakthrough, a Booker/Pulitzer-worthy novel, Ken Thompson-level code that solves a real problem, or a proof for Fermat’s Last Theorem.

评论 #42970094 未加载

评论 #42969516 未加载

dredmorbius3 个月前

There was significant related discussion two weeks ago, 140 comments:<<a href="https://news.ycombinator.com/item?id=42806105">https://news.ycombinator.com/item?id=42806105</a>>I suspect that the current submission has been re-upped by mods, as it appears to have been originally submitted 4 days ago (via Algolia search), though it's not in the 2nd chance queue.

niobe3 个月前

calling it "last" is defeating their own premise - that tests need to keep pace developments in ability

评论 #42968204 未加载

wrs3 个月前

Lest LLMs turn into all-knowing but completely opaque oracles, I’d prefer every question ended with “and how do you know?”

评论 #42971790 未加载

评论 #42968460 未加载

m4633 个月前

> Medicine: You have been provided with a razor blade, a piece of gauze, and a bottle of scotch. Remove your appendix. Do not suture until you work has been inspected. You have fifteen minutes.one of the questions from that old "the final exam" joke

JackYoustra3 个月前

Given the questions, it's crazy to call this HLE, but whatever man. Kinda fun. Can't wait for the similar thing that happened when we scaled up cargo carriers to like very large etc etc

malaise3 个月前

What question’s answer is 42?That is the ultimate question of life, the universe, and everything.

评论 #42969452 未加载

energy1233 个月前

All the cynics are welcome to design their own evals and move the field forward if they're so smart, instead of writing negative comments on the internet.

评论 #42968322 未加载

评论 #42968325 未加载

评论 #42968329 未加载

评论 #42968568 未加载