FrontierMath was funded by OpenAI

483 pointsby wujerry20004 months ago

35 comments

“… we have a verbal agreement that these materials will not be used in model training”Ha ha ha. Even written agreements are routinely violated as long as the potential upside > downside, and all you have is verbal agreement? And you didn’t disclose this?At the time o3 was released I wrote “this is so impressive that it brings out the pessimist in me”[0], thinking perhaps they were routing API calls to human workers.Now we see in reality I should’ve been more cynical, as they had access to the benchmark data but verbally agreed (wink wink) not to train on it.[0: <a href="https://news.ycombinator.com/threads?id=agnosticmantis#42476268">https://news.ycombinator.com/threads?id=agnosticmantis#42476...</a> ]

评论 #42763741 未加载

评论 #42763526 未加载

评论 #42764810 未加载

评论 #42763839 未加载

评论 #42765059 未加载

评论 #42764179 未加载

评论 #42765678 未加载

评论 #42767635 未加载

评论 #42766523 未加载

lolinder4 months ago

A co-founder of Epoch left a note in the comments:> We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.Ouch. A verbal agreement. As the saying goes, those aren't worth the paper they're written on, and that's doubly true when you're dealing with someone with a reputation like Altman's.And aside from the obvious flaw in it being a verbal agreement, there are many ways in which OpenAI could technically comply with this agreement while still gaining a massive unfair advantage on the benchmarks to the point of rendering them meaningless. For just one example, knowing the benchmark questions can help you select training data that is tailored to excelling at the benchmarks without technically including the actual question in the training data.

评论 #42763489 未加载

评论 #42763920 未加载

评论 #42764194 未加载

jsheard4 months ago

Why do people keep taking OpenAIs marketing spin at face value? This keeps happening, like when they neglected to mention that their most impressive Sora demo involved extensive manual editing/cleanup work because the studio couldn't get Sora to generate what they wanted.<a href="https://news.ycombinator.com/item?id=40359425">https://news.ycombinator.com/item?id=40359425</a>

评论 #42763443 未加载

评论 #42766001 未加载

评论 #42763588 未加载

diggan4 months ago

> Tamay from Epoch AI here. We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.Not sure if "integrity of the benchmarks" should even be something that you negotiate over, what's the value of the benchmark if the results cannot be trusted because of undisclosed relationships and sharing of data? Why would they be restricted from disclosing stuff you normally disclose, and how doesn't that raise all sorts of warning flags when proposed even?

评论 #42763462 未加载

评论 #42763891 未加载

bogtog4 months ago

A lot of the comments express some type of deliberate cheating the benchmark. However, even without intentionally trying to game it, if anybody can repeatedly take the same test, then they'll be nudged to overfit/p-hack.For instance, suppose they conduct an experiment and find that changing some hyper-parameter yields a 2% boost. That could just be noise, it could be a genuine small improvement, or it may be a mix of a genuine boost along with some fortunate noise. An effect may be small enough that researchers would need to rely on their gut to interpret it. Researchers may jump on noise while believing they have discovered true optimizations. Enough of these types of nudges, and some serious benchmark gains can materialize.(Hopefully my comment isn't entirely misguided, I don't know how they actually do testing or how often they probe their test set)

评论 #42764062 未加载

zarzavat4 months ago

OpenAI played themselves here. Now nobody is going to take any of their results on this benchmark seriously, ever again. That o3 result has just disappeared in a poof of smoke. If they had blinded themselves properly then that wouldn't be the case.Whereas other AI companies now have the opportunity to be first to get a significant result on FrontierMath.

评论 #42765103 未加载

评论 #42766064 未加载

评论 #42765084 未加载

ripped_britches4 months ago

Do people actually think OpenAI is gaming benchmarks?I know they have lost trust and credibility, especially on HN. But this is a company with a giant revenue opportunity to sell products that work.What works for enterprise is very different from “does it beat this benchmark”.No matter how nefarious you think sama is, everything points to “build intelligence as rapidly as possible” rather than “spin our wheels messing with benchmarks”.In fact, even if they did fully lie and game the benchmark - do you even care? As an OpenAI customer, all I care about is that the product works.I code with o1 for hours every day, so I am very excited for o3 to be released via API. And if they trained on private datasets, I honestly don’t care. I just want to get a better coding partner until I’m irrelevant.Final thought - why are these contractors owed a right to know where funding came from? I would definitely be proud to know I contributed to the advancement of the field of AI if I was included in this group.

评论 #42765502 未加载

评论 #42766678 未加载

评论 #42765214 未加载

评论 #42766742 未加载

评论 #42766909 未加载

评论 #42768342 未加载

评论 #42765808 未加载

评论 #42768730 未加载

lionkor4 months ago

People on here were mocking me openly when I pointed out that you can't be sure LLMs (or any AIs) are actually smart unless you CAN PROVE that the question you're asking isn't in the training set (or adjacent like in this case).So with this in mind now, let me repeat: Unless you know that the question AND/OR answer are not in the training set or adjacent, do not claim that the AI or similar black box is smart.

评论 #42764442 未加载

评论 #42764479 未加载

评论 #42768920 未加载

MattDaEskimo4 months ago

There's something gross about OpenAI constantly misleading the public.This maneuver by their CEO will destroy FrontierMath and Epoch AI's reputation

评论 #42764040 未加载

benterix4 months ago

> Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.Man, this is huge.

wujerry20004 months ago

My takeaways(1) Companies will probably increasingly invest in building their own evals for their use cases because its becoming clear public/allegedly private benchmarks have misaligned incentives with labs sponsoring/cheating (2) Those evals will prob be proprietary "IP" - guarded as closely as the code or research itself (3) Conversely, public benchmarks are exhausted and SOMEONE has to invest in funding more frontier benchmarks. So this is prob going to continue.

gunalx4 months ago

So in conclusion, any evaluation of openai models on frontier math is increadibly invalidated.I would even go so far as to say this invalidates not only FrontierMath but also anything Epoch AI has and will touch.Any academic misjudgement like this massive conflict and cheating makes you unthrustworthy in a academic context.

BrenBarn4 months ago

This kind of thing is so avoidable by anyone who has not sold their soul. The answer is: if a company wants you to do a deal but requires as a condition that you not reveal to anyone that you are doing a deal with that company, you just say no. It's that simple.

Imnimo4 months ago

My guess is that OpenAI didn't cheat as blatantly as just training on the test set. If they had, surely they could have gotten themselves an even higher mark than 25%. But I do buy the comment that they soft-cheated by using elements of the dataset for validation (which is absolutely still a form of data leakage). Even so, I suspect their reported number is roughly legit, because they report numbers on many benchmarks, and they have a good track record of those numbers holding up to private test sets.What's much more concerning to me than the integrity of the benchmark number is the general pattern of behavior here from OpenAI and Epoch. We shouldn't accept secretly (even secret to the people doing the creation!) funding the creation of a benchmark. I also don't see how we can trust in the integrity of EpochAI going forward. This is basically their only meaningful output, and this is how they handled it?

评论 #42764638 未加载

j_timberlake4 months ago

Elon definitely still has a grudge against Altman and OpenAI, so when Elon uses his new political power to bludgeon OpenAI to bankruptcy with new regulations and lawsuits, it won't be for the right reasons, but I'll still think Altman and the remaining employees deserve it.

padolsey4 months ago

Many of these evals are quite easy to game. Often the actual evaluation part of benchmarking is left up to a good-faith actor, which was usually reasonable in academic settings less polluted by capital. AI labs, however, have disincentives to do a thorough or impartial job, so IMO we should never take their word for it. To verify, we need to be able to run these evals ourselves – this is only sometimes possible, as even if the datasets are public, the exact mechanisms of evaluation are not. In the long run, to be completely resilient to gaming via training, we probably need to follow suit of other fields and have third-party non-profit accredited (!!) evaluators who's entire premise is to evaluate, red-team, and generally keep AI safe and competent.

matt_daemon4 months ago

At this point eval results presented by AI companies are a joke and should not be trusted

WasimBhai4 months ago

I have been taking a course in AI policy and the O1 and the FrontierMath dataset has been an important mark for me to emphasize the world we are moving toward. It is incredibly sad to know about the conflict of interest here. However, those more knowledgeable, can you explain in plain words, does this revelation compromise OAI's claims regarding o3's performance on FrontierMath problems?

评论 #42763547 未加载

评论 #42763517 未加载

nioj4 months ago

Related <a href="https://news.ycombinator.com/item?id=42761648">https://news.ycombinator.com/item?id=42761648</a>

评论 #42763454 未加载

refulgentis4 months ago

Its increasingly odd to see HN activity that assumes the premise: if the latest benchmark results involved a benchmark that can be shown to have any data that OpenAI could have accessed, then, the benchmark results were intentionally faked.Last time this confused a bunch of people who didn't understand what test vs. train data meant and it resulted in a particular luminary complaining on Twitter, to much guffaws, how troubling the situation was.Literally every comment currently, modulo [1] assumes this and then goes several steps more, and a majority are wildly misusing terms with precise meanings, explaining at least part of their confusion.[1] modulo the one saying this is irrelevant because we'll know if it's bad when it comes out, which to be fair, if evaluated rationally, we know that doesn't help us narrowly with our suspicion FrontierMath benchmarks are all invalid because it trained on (most of) the solutions

评论 #42765493 未加载

croemer4 months ago

Tim Gowers, one of the Fields medallists contributed problems to the benchmark dataset isn't happy about being misled about OpenAIs involvement. He retweeted this: <a href="https://x.com/Mihonarium/status/1880944026603376865?t=QN3i_XlSqlPPpi2vnnV3tw&s=19" rel="nofollow">https://x.com/Mihonarium/status/1880944026603376865?t=QN3i_X...</a>

mrg3_20134 months ago

OpenAI continues to muddy the benchmarks, while Claude continues to improve their intelligence. Claude will win long term. It'd be wise to not rely on OpenAI at all. They are the first comers who will just burn cash and crash out I suspect.

atleastoptimal4 months ago

The problem is, any benchmark on a closed model couldn’t be private even in theory, as the model has to be called to run the benchmark, exposing the contents to whoever owns the model thereafter.HN loves to speculate that OpenAI is some big scam whose seeming ascendance is based on deceptive marketing hype, but o1, to anyone who has tried it seriously is undoubtedly very much within the ballpark of what OpenAI claims it is able to do. If everything they are doing really is just overfitting and gaming the tests, that discrepancy will eventually catch up to them, and people will stop using the APIs and chatgpt

karmasimida4 months ago

They should at least clarify it. The reason they don’t I feel is simply for the hype and mystique.There are ways that you could game the benchmark without adding it to the training set. By repetitively evaluating on the dataset itself it will regress into a validation set, not a test set, even in black box setting, as you can simply evaluating 100 checkpoints and pick the one that performs the best, rinse and repeatI still believe o3 is the real deal, BUT this gimmick kind sour my appetite a bit, for that those who run the company

nottorp4 months ago

So basically when you need to look good in benchmarks you fund an organization that does benchmarks in which you look good.Just like toothpaste manufacturers fund dentist's associations etc.

ForHackernews4 months ago

Unrelated to anything but what software is this blog running on? I love the sidenote feature.Why does it have a customer service popover chat assistant?

评论 #42769795 未加载

zrc1080718494 months ago

Even if OpenAI does not use these materials to directly train its models, OpenAI can collect or construct more data based on the knowledge points and test points of these questions to gain an unfair competitive advantage. It's like before the Gaokao, a teacher reads some of the Gaokao questions and then marks the test points in the book for you. This is cheating.

suchintan4 months ago

I wonder if more companies should open source their eval model outputs alongside the eval resultsWe tried doing that here at Skyvern (eval.skyvern.com)

maeil4 months ago

This isn't news, the other popular benchmarks are just as gamed and worthless, it would be shocking if this one wasn't. The other frontier model providers game them just as hard, it's not an OpenAI thing. Any benchmark that a provider themselves mentions is not worth the pixels its written on.

floppiplopp4 months ago

Unless you have been up to the shoulders in the hype-hole of Scam Altman's backside this should not come as the slightest surprise.

moi23884 months ago

“… we have a verbal agreement that these materials will not be used in model training”What about model testing before releasing it?

treksis4 months ago

so it was overfit

numba8884 months ago

if they used it in training it should be 100% hit. most likely they used it to verify and tune parameters.

评论 #42763681 未加载

评论 #42763671 未加载

m3kw94 months ago

This don’t really matter much because if the models suck when it comes out evals mean nothing next time

katamari-damacy4 months ago

“we now know how to build AGI” --Sam Altman.which should really be “we now know how to improve associative reasoning but we still need to cheat when it comes to math because the bottom line is that the models can only capture logic associatively, not synthesize deductively, which is what’s needed for math beyond recipe-based reasoning"