Grok 3: Another win for the bitter lesson

132 pointsby kiyanwang3 months ago

24 comments

bambax3 months ago

This article is weak and just general speculation.Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all other LLMs I have asked because it just repeats confused stuff that has been written elsewhere rather than looking at the actual theorem.<a href="https://x.com/skdh/status/1892432032644354192" rel="nofollow">https://x.com/skdh/status/1892432032644354192</a>Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".

评论 #43113843 未加载

评论 #43112908 未加载

评论 #43114290 未加载

评论 #43112886 未加载

评论 #43113270 未加载

评论 #43113312 未加载

评论 #43115189 未加载

bccdee3 months ago

The creation of a model which is "co-state-of-the-art" (assuming it wasn't trained on the benchmarks directly) is not a win for scaling laws. I could just as easily claim out that xAI's failure to significantly outperform existing models despite "throwing more compute at Grok 3 than even OpenAI could" is further evidence that hyper-scaling is a dead end which will only yield incremental improvements.Obviously more computing power makes the computer better. That is a completely banal observation. The rest of this 2000-word article is groping around for a way to take an insight based on the difference between '70s symbolic AI and the neural networks of the 2010s and apply it to the difference between GPT-4 and Grok 3 off the back of a single set of benchmarks. It's a bad article.

评论 #43123016 未加载

评论 #43117652 未加载

smy200113 months ago

Did they? Deepseek spent about 17 months achieving SOTA results with a significantly smaller budget. While xAI's model isn't a substantial leap beyond Deepseek R1, it utilizes 100 times more compute.If you had $3 billion, xAI would choose to invest $2.5 billion in GPUs and $0.5 billion in talent. Deepseek, would invest $1 billion in GPUs and $2 billion in talent.I would argue that the latter approach (Deepseek's) is more scalable. It's extremely difficult to increase compute by 100 times, but with sufficient investment in talent, achieving a 10x increase in compute is more feasible.

评论 #43112606 未加载

评论 #43112269 未加载

评论 #43112963 未加载

评论 #43112895 未加载

评论 #43112330 未加载

评论 #43115065 未加载

评论 #43112430 未加载

评论 #43112625 未加载

评论 #43116618 未加载

评论 #43123381 未加载

rfoo3 months ago

I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model. Hope that xAI can make Grok 3 API available next week so I can run it against some private evaluations to see if it's really this good.Another nit-pick: I don't think DeepSeek had 50k Hopper GPUs. Maybe they have 50k now after getting the world's attention and having national-sponsored grey market back them, but that 50k number is certainly dreamed-up. During the past year DeepSeek's intern recruitment ads always just mentioned "unlimited access to 10k A100s", suggesting that they may have very limited H100/H800s, and most of their research ideas were validated on smaller models on an Ampere cluster. The 10k A100 number matches with a cluster their parent hedge fund company announced a few years ago. All in all my estimation is they had more (maybe 20k) A100s, and single-digit thousands of H800s.

评论 #43112764 未加载

评论 #43118581 未加载

petesergeant3 months ago

The author has come up with their own, _unusual_ definition of the bitter lesson. In fact, as a direct quote, the original Is:> Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.eg: “the study of linguistics doesn’t help you build an LLM” or “you don’t need to know about chicken physiology to make a vision system that tells you how old a chicken is”The author then uses a narrow and _unusual_ definition of what computation _means_, by saying it simply means access to fast chips, rather than the work you can perform on them, which would obviously include how efficiently you use them.In short, this article misuses two terms to more simply say “looks like the scaling laws still work”.

viraptor3 months ago

This is a weird takeaway from the recent changes. Right now companies can scale because there's stupid amount of stupid money flowing into the AI craze, but that's going to end. Companies are already discovering the issues with monetising those systems. Sure, they can "let go" and burn the available cash, but the investors will come knocking finally. Since everyone figures out similar tech anyway, it's the people with most tech improvement experience that will be in the best position long term, while openai will be stuck trying to squeeze adverts and monitoring into their chat for cash flow.

评论 #43112385 未加载

nickfromseattle3 months ago

Side question, let's say Grok is comparable in intelligence to other leading models. Will any serious business switch their default AI capabilities to Grok?

评论 #43113762 未加载

评论 #43115333 未加载

评论 #43119800 未加载

评论 #43114339 未加载

评论 #43115136 未加载

评论 #43113752 未加载

GaggiX3 months ago

The bitter lesson is about the fact that general methods that leverage computation are ultimately the most effective. Grok 3 is not more general than DeepSeek or OpenAI models so mentioning the bitter lesson here doesn't make much sense, it's just the scaling law.

user141592653 months ago

It will be interesting to see how talent acquisition evolves. Many great engineers were put off by strong DEI-focused PR, and even more oppose the sudden opportunistic shift to the right. Will Muslims continue to want to work for Google? Will Europeans work for X? Some may have previously avoided close relations with China for ethical reasons—will the same soon apply to the US?

评论 #43112354 未加载

aqueueaqueue3 months ago

How bitter is the bitter lesson when throwing more compute js costing billions. Maybe the bitter lesson is more about money now than the hardware. You are scaling up investments not just relying on moores law. But I think there is a path for the less power consuming models that people can run affordably without VC money.

评论 #43112267 未加载

评论 #43112403 未加载

ArtTimeInvestor3 months ago

It looks like the USA is bringing all technology in-house that is needed to build AI.TSMC has a factory in the USA now, ASML too. OpenAI, Google, xAI and Nvidia are natively in the USA.While no other country is even close to build AI on their own.Is the USA going to "own" the world by becoming the keeper of AI? Or is there an alternative future that has a probability > 0?

评论 #43112266 未加载

评论 #43112288 未加载

评论 #43112250 未加载

评论 #43112313 未加载

评论 #43113081 未加载

评论 #43113181 未加载

评论 #43113084 未加载

Amekedl3 months ago

another ai hype blog entry. Not even a mention of the differently colored bars on the benchmark result. For me, grok-3 does not prove/disprove scaling laws in any meaningful capacity.

Rochus3 months ago

Interesting, but I think the article’s argument for the "bitter lesson" relies on logical fallacies. First, it misrepresents critics of scaling as dismissing compute entirely, then frames scaling and optimization as mutually exclusive strategies (which creates a false dilemma), ignoring their synergy. E.g. DeepSeek’s algorithmic innovations under export constraints augmented - and not replaced - the scaling efforts. The article also overgeneralizes from limited cases, asserting that compute will dominate the "post-training era" while overlooking potential disruptors like efficient architectures. The CEO's statements are barely suited to support its claims. A balanced view aligning with the "bitter lesson" should recognize that scaling general methods (e.g. learning algorithms) inherently requires both compute and innovation.

graycat3 months ago

> Grok 3 performs at a level comparable to, and in some cases even exceeding, models from more mature labs like OpenAI, Google DeepMind, and Anthropic. It tops all categories in the LMSys arena and the reasoning version shows strong results—o3-level—in math,...."Math"? Fields Medal level? Tenure? Ph.D.? ... high school plane geometry???As in'Grok 3 AI and Some Plane Geometry'at<a href="https://news.ycombinator.com/item?id=43113949">https://news.ycombinator.com/item?id=43113949</a>Grok 3 failed at a plane geometry exercise.

_giorgio_3 months ago

Grok is the best LLM on <a href="https://lmarena.ai/" rel="nofollow">https://lmarena.ai/</a>.---No benchmarks involved, just user preference.Rank* (UB)Rank (StyleCtrl)ModelArena Score95% CIVotesOrganizationLicense 1 1chocolate (Early Grok-3) 1402 +7/-6 7829 xAI Proprietary 2 4Gemini-2.0-Flash-Thinking-Exp-01-21 1385 +5/-5 13336 Google Proprietary 2 2Gemini-2.0-Pro-Exp-02-05 1379 +5/-6 11197 GoogleProprietary

PaulHoule3 months ago

Inference cost rules everything around me.

greyjoyduck3 months ago

Your reasoning is extremely vague and you praise the hell out of musk and xai for some weird reason...

s1mplicissimus3 months ago

oh what a surprise, a new model performs better on barcharts than the old models. yaawn

thatgerhard3 months ago

I've been using Grok3 with deep think for 2 days now and the things it built has been waaaaay passed any other LLM I've tried

readthenotes13 months ago

I had to ask Grok 3 what the bitter lesson was. It gave a plausible answer (compute scale beats human cleverness)

cowpig3 months ago

I haven't seen grok3 on any benchmark leaderboard other than lm arena. Has anyone else?

dubeye3 months ago

I use chat gpt for general brain dumpingI've compared my last week's queries and prefer Grok 3

sylware3 months ago

Is the next step ML-inference fusion? aka artificial small brain?

vasco3 months ago

That's not what "the exception that proves the rule" means.

评论 #43112216 未加载

评论 #43112244 未加载