TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

OK, I can partly explain the LLM chess weirdness now

524 点作者 dmazin6 个月前

54 条评论

tromp6 个月前
&gt; For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess.<p>Here&#x27;s one way to test whether it really understands chess. Make it play the next move in 1000 random legal positions (in which no side is checkmated yet). Such positions can be generated using the ChessPositionRanking project at [1]. Does it still rarely suggest illegal moves in these totally weird positions, that will be completely unlike any it would have seen in training (and in which the legal move choice is often highly restricted) ?<p>While good for testing legality of next moves, these positions are not so useful for distinguishing their quality, since usually one side already has an overwhelming advantage.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;tromp&#x2F;ChessPositionRanking">https:&#x2F;&#x2F;github.com&#x2F;tromp&#x2F;ChessPositionRanking</a>
评论 #42207882 未加载
评论 #42208448 未加载
评论 #42207812 未加载
评论 #42213235 未加载
评论 #42213627 未加载
评论 #42207742 未加载
评论 #42208111 未加载
评论 #42212020 未加载
评论 #42212362 未加载
评论 #42207179 未加载
评论 #42212806 未加载
sourcepluck6 个月前
&gt; For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game.<p>It&#x27;s claimed that this model &quot;understands&quot; chess, and can &quot;reason&quot;, and do &quot;actual logic&quot; (here in the comments).<p>I invite anyone making that claim to find me an &quot;advanced amateur&quot; (as the article says of the LLM&#x27;s level) chess player who ever makes an illegal move. Anyone familiar with chess can confirm that it doesn&#x27;t really happen.<p>Is there a link to the games where the illegal moves are made?
评论 #42213388 未加载
评论 #42213434 未加载
评论 #42213215 未加载
评论 #42212895 未加载
评论 #42216291 未加载
评论 #42214027 未加载
评论 #42217678 未加载
评论 #42213058 未加载
评论 #42214247 未加载
评论 #42216786 未加载
评论 #42213551 未加载
评论 #42214688 未加载
评论 #42214300 未加载
评论 #42221038 未加载
wavemode6 个月前
I have the exact same problem with this article that I had with the previous one - the author fails to provide any data on the frequency of illegal moves.<p>Thus it&#x27;s impossible to draw any meaningful conclusions. It would be similar to if I claimed that an LLM is an expert doctor, but in my data I&#x27;ve filtered out all of the times it gave incorrect medical advice.
评论 #42217312 未加载
评论 #42214819 未加载
评论 #42214585 未加载
评论 #42214794 未加载
评论 #42218400 未加载
评论 #42216047 未加载
评论 #42214942 未加载
评论 #42215317 未加载
评论 #42215484 未加载
评论 #42217970 未加载
评论 #42215241 未加载
评论 #42214448 未加载
xg156 个月前
&gt; <i>In many ways, this feels less like engineering and more like a search for spells.</i><p>This is still my impression of LLMs in general. It&#x27;s amazing that they work, but for the next tech disruption, I&#x27;d appreciate something that doesn&#x27;t make you feel like being in a bad sci-fi movie all the time.
codeflo6 个月前
&gt; everyone is wrong!<p>Well, not everyone. I wasn&#x27;t the only one to mention this, so I&#x27;m surprised it didn&#x27;t show up in the list of theories, but here&#x27;s e.g. me, seven days ago (source <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42145710">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42145710</a>):<p>&gt; At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training.<p>This is not the same thing as cheating&#x2F;replacing the LLM <i>output</i>, the theory that&#x27;s mentioned and debunked in the article. And now the follow-up adds weight to this guess:<p>&gt; Here’s my best guess for what is happening: ... OpenAI trains its base models on datasets with more&#x2F;better chess games than those used by open models. ... Meanwhile, in section A.2 of this paper (h&#x2F;t Gwern) some OpenAI authors mention that GPT-4 was trained on chess games in PGN notation, filtered to only include players with Elo at least 1800.<p>To me, it makes complete sense that OpenAI would &quot;spike&quot; their training data with data for tasks that people might actually try. There&#x27;s nothing unethical about this. No dataset is ever truly &quot;neutral&quot;, you make choices either way, so why not go out of your way to train the model on potentially useful answers?
评论 #42213614 未加载
评论 #42217658 未加载
评论 #42213319 未加载
评论 #42213475 未加载
marcus_holmes6 个月前
I notice there&#x27;s no prompt saying &quot;you should try to win the game&quot; yet the results are measured by how much the LLM wins.<p>Is this implicit in the &quot;you are a grandmaster chess player&quot; prompt?<p>Is there some part of the LLM training that does &quot;if this is a game, then I will always try to win&quot;?<p>Could the author improve the LLM&#x27;s odds of winning just by telling it to try and win?
评论 #42212718 未加载
评论 #42211908 未加载
评论 #42213065 未加载
评论 #42212998 未加载
评论 #42213119 未加载
viraptor6 个月前
I&#x27;m glad he improved the promoting, but he&#x27;s still leaving out two likely huge improvements.<p>1. Explain the current board position and the plan going forwards, <i>before</i> proposing a move. This lets the model actually think more, kind of like o1, but here it would guarantee a more focused processing.<p>2. Actually draw the ascii board for each step. Hopefully producing more valid moves since board + move is easier to reliably process than 20×move.
评论 #42207518 未加载
评论 #42208663 未加载
评论 #42212235 未加载
评论 #42207619 未加载
评论 #42207408 未加载
评论 #42208200 未加载
Jean-Papoulos6 个月前
&gt;According to that figure, fine-tuning helps. And examples help. But it’s examples that make fine-tuning redundant, not the other way around.<p>This is extremely interesting. In this specific case at least, simply giving examples is equivalent to fine-tuning. This is a great discovery for me, I&#x27;ll try using examples more often.
评论 #42211958 未加载
评论 #42213146 未加载
PaulHoule6 个月前
People have to quit this kind of stumbling in the dark with commercial LLMs.<p>To get to the bottom of this it would be interesting to train LLMs on nothing but chess games (can synthesize them endlessly by having Stockfish play against itself) with maybe a side helping of chess commentary and examples of chess dialogs “how many pawns are on the board?”, “where are my rooks?”, “draw the board”, competence at which would demonstrate that it has a representation of the board.<p>I don’t believe in “emergent phenomena” or that the general linguistic competence or ability to feign competence is necessary for chess playing (being smart at chess doesn’t mean you are smart at other things and vice versa). With experiments like this you might prove me wrong though.<p>This paper came out about a week ago<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2411.06655" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2411.06655</a><p>seems to get good results with a fine-tuned Llama. I also like this one as it is about competence in chess commentary<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2410.20811" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2410.20811</a>
评论 #42213861 未加载
jey6 个月前
Could be interesting to create a tokenizer that’s optimized for representing chess moves and then training a LLM (from scratch?) on stockfish games. (Using a custom tokenizer should improve the quality for a given size of the LLM model. So it doesn’t have to waste a lot of layers on encode and decode, and the “natural” latent representation is more straightforward)
code516 个月前
Initially LLM researchers were saying training on code samples made the &quot;reasoning&quot; better. Now, if &quot;language to world model&quot; thesis is working, shouldn&#x27;t chess actually be the smallest case for it?<p>I can&#x27;t understand why no research group is going hard at this.
评论 #42217178 未加载
amrrs6 个月前
&gt;Theory 1: Large enough base models are good at chess, but this doesn’t persist through instruction tuning to chat models.<p>I lean mostly towards this and also the chess notations - not sure if it might get chopped during tokenization unless it&#x27;s very precisely processed.<p>It&#x27;s like designing an LLM just for predicting protein sequence because the sequencing matters. The base data might have it but i don&#x27;t think that&#x27;s the intention for it to continue.
评论 #42207162 未加载
derefr6 个月前
&gt; Many, many people suggested that there must be some special case in gpt-3.5-turbo-instruct that recognizes chess notation and calls out to an external chess engine.<p>Not that I think there&#x27;s anything inherently unreasonable about an LLM understanding chess, but I think the author missed a variant hypothesis here:<p>What if that specific model, when it recognizes chess notation, is trained to silently &quot;tag out&quot; for <i>another, more specialized LLM, that is specifically trained on a majority-chess dataset</i>? (Or — perhaps even more likely — the model is trained to recognize the need to activate a chess-playing <i>LoRA adapter</i>?)<p>It would still be an LLM, so things like &quot;changing how you prompt it changes how it plays&quot; would still make sense. Yet it would be one that has spent a lot more time modelling chess than other things, and never ran into anything that distracted it enough to catastrophically forget how chess works (i.e. to reallocate some of the latent-space vocabulary on certain layers from modelling chess, to things that matter more to the training function.)<p>And I could certainly see &quot;playing chess&quot; as a good proving ground for testing the ability of OpenAI&#x27;s backend to recognize the need to &quot;loop in&quot; a LoRA in the inference of a response. It&#x27;s something LLM base models suck at; but it&#x27;s also something you intuitively <i>could</i> train an LLM to do (to at least a proficient-ish level, as seen here) if you had a model focus on just learning that.<p>Thus, &quot;ability of our [framework-mediated] model to play chess&quot; is easy to keep an eye on, long-term, as a proxy metric for &quot;how well our LoRA-activation system is working&quot;, without needing to worry that your next generation of base models might suddenly invalidate the metric by getting good at playing chess without any &quot;help.&quot; (At least not any time soon.)
评论 #42217124 未加载
tmalsburg26 个月前
Why not use temperature 0 for sampling? If the top-ranked move is not legal, it can’t play chess.
评论 #42207932 未加载
kibwen6 个月前
<i>&gt; I was astonished that half the internet is convinced that OpenAI is cheating.</i><p>If you have a problem and all of your potential solutions are unlikely, then it&#x27;s fine to assume the least unlikely solution while acknowledging that it&#x27;s statistically probable that you&#x27;re also wrong. IOW if you have ten potential solutions to a problem and you estimate that the most likely solution has an 11% chance of being true, it&#x27;s fine to assume that solution despite the fact that, by your own estimate, you have an 89% chance of being wrong.<p>The &quot;OpenAI is secretly calling out to a chess engine&quot; hypothesis always seemed unlikely to me (you&#x27;d think it would play much better, if so), but it seemed the easiest solution (Occam&#x27;s razor) and I wouldn&#x27;t have been <i>surprised</i> to learn it was true (it&#x27;s not like OpenAI has a reputation of being trustworthy).
评论 #42207999 未加载
评论 #42207755 未加载
评论 #42207772 未加载
评论 #42208498 未加载
ChrisArchitect6 个月前
Related from last week:<p><i>Something weird is happening with LLMs and Chess</i><p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42138276">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42138276</a>
MisterTea6 个月前
This happened to a friend who was trying to sim basketball games. It kept forgetting who had the ball or outright made illegal or confusing moves. After a few days of wrestling with the AI he gave up. GPT is amazing at following a linear conversation but had no cognitive ability to keep track of a dynamic scenario.
throw3108226 个月前
&gt; For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess. If this doesn’t convince youI encourage you to write a program that can take strings like 1. e4 d5 2. exd5 Qxd5 3. Nc3 and then say if the last move was legal.<p>This alone should put to rest all the arguments that LLMs lack a world model, that they&#x27;re just manipulating words, that they&#x27;re just spitting probabilistic answers, and so on.
sourcepluck6 个月前
&gt; Since gpt-3.5-turbo-instruct has been measured at around 1800 Elo<p>Where&#x27;s the source for this? What&#x27;s the reasoning? I don&#x27;t see it. I have just relooked, and stil l can&#x27;t see it.<p>Is it 1800 lichess &quot;Elo&quot;, or 1800 FIDE, that&#x27;s being claimed? And 1800 at what time control? Different time controls have different ratings, as one would imagine&#x2F;hope the author knows.<p>I&#x27;m guessing it&#x27;s not 1800 FIDE, as the quality of the games seems far too bad for that. So any clarity here would be appreciated.
评论 #42213795 未加载
fijiaarone6 个月前
It&#x27;s very clear that GPT-3.5-turbo-whatever is cheating and that LLMs cannot, in fact play chess. It was trained in sequences of chess moves and has explicit coding to recognize chess moves and respond accordingly. If you only define &quot;cheating&quot; as calling a chess engine like stockfish, then your definition of cheating is too narrow.<p>It&#x27;s exactly like the strawberry problem. No LLM can count the letters in a word. But when shown that, they explicitly were taught to recognize the prompt and count the letters in the word. They didn&#x27;t create a letter counting algorithm, but they did build a table of words and letter counts. And every &quot;new&quot; LLM explicitly looks for a phrase that looks like &quot;how many Rs are in stawberry&quot; and then the LLM looks in the &quot;letters in words&quot; neural network instead of the the &quot;what is the next likely word in this sentence net&quot;.<p>All &quot;new&quot; LLMs (in the next few weeks) will suddenly become decent at chess because they will have weighted preference to look at the &quot;chess moves&quot; neural net instead of the &quot;random numbers and letters sequence&quot; neural net when they detect a sentence that looks like &quot;d4, d5; nd3, ?&quot; etc.
deadbabe6 个月前
If you randomly position pieces on the board and then ask the LLM to play chess, where each piece still moves according to its normal rules, does it know how to play still?
评论 #42215026 未加载
furyofantares6 个月前
LLMs are fundamentally text-completion. The Chat-based tuning that goes on top of it is impressive but they are fundamentally text-completion, that&#x27;s where most of the training energy goes. I keep this in mind with a lot of my prompting and get good results.<p>Regurgitating and Examples are both ways to lean into that and try to recover whatever has been lost by Chat-based tuning.
评论 #42208096 未加载
GaggiX6 个月前
You should not finetune the models on the strongest setting of Stockfish as the move will not be understandable unless you really dig deep into the position and the model would not be able to find a pattern to make sense of it, instead I suggest training on human games of a certain ELO (less than grandmaster).
subarctic6 个月前
The author either didn&#x27;t read the hacker news comments last time, or he missed the top theory that said they probably used chess as a benchmark when they developed the model that is good at chess for whatever business reasons they had at the time.
评论 #42216103 未加载
评论 #42215223 未加载
评论 #42216132 未加载
blixt6 个月前
Really interesting findings around fine-tuning. Goes to show it doesn&#x27;t really affect the deeper &quot;functionality&quot; of the LLM (if you think of the LLM running a set of small functions on very high-dimensional numbers to produce a token).<p>Using regurgitation to get around the assistant&#x2F;user token separation is another fun tool for the toolbox, relevant for whenever you want a model that doesn&#x27;t support continuation actually perform continuation (at the cost of a lot of latency).<p>I wonder if any type of reflection or chains of thought would help it play better. I wouldn&#x27;t be surprised if getting the LLM to write an analysis of the game in English is more likely to move it out of distribution than to make it pick better chess moves.
Animats6 个月前
The main insight is that the LLM has to be trained on <i>good</i> chess games. If the training set is of random games, that&#x27;s not helpful.<p>I&#x27;m still amazed that this works at all, since it lacks actual board state.
kqr6 个月前
I get that it would make evals even more expensive, but I would also try chain-of-thought! Have it explain its goals and reasoning for the next move before making it. It might be an awful idea for something like chess, but it seems to help elsewhere.
joshka6 个月前
Why are you telling it not to explain? Allowing the LLM space to &quot;think&quot; may be helpful, and would be definitely worth explorying?<p>Why are you manually guessing ways to improve this? Why not let the LLMs do this for themselves and find iteratively better prompts?
tech_ken6 个月前
&gt; It’s ridiculously hard to find the optimal combination of prompts and examples and fine-tuning, etc. It’s a very large space, there are no easy abstractions to allow you to search through the space, LLMs are unpredictable and fragile, and these experiments are slow and expensive.<p>Regardless of the actual experiment outcome, I think this is a super valuable insight. &quot;Should we provide legal moves?&quot; section is an excellent case study of this- extremely prudent idea actually degrades model performance, and quite badly. It&#x27;s like that crocodile game where you&#x27;re pushing teeth until it clamps onto your hand.
phkahler6 个月前
You can easily construct a game board from a sequence of moves by maintaining the game state somewhere. But you can also know where a piece is bases on only its last move. I&#x27;m curious what happens if you don&#x27;t feed it a position, but feed it a sequence of moves including illegal ones but end up at a given valid position. The author mention that LLMs will play differently when the same position is arrived at via different sequences. I&#x27;m suggesting to really play with that by putting illegal moves in the sequence.<p>I doubt it&#x27;s doing much more than a static analysis of the a board position, or even moving based mostly on just a few recent moves by key pieces.
elif6 个月前
The problem I see with this prompting is he is going to great lengths instructing the LLM that he is a grandmaster and makes great moves etc. but nowhere in the prompt does he instruct the LLM to attempt to win the game of chess.<p>It may just be coming up with &quot;good moves&quot; that a &quot;grandmaster&quot; would make, but after all, grandmasters still lose 49% of their games, all the while making &quot;good moves&quot;<p>I would suppose that the LLM is actually wholly uninterested in victory given this prompting.
torginus6 个月前
Sorry - I have a somewhat question - is it possible to train models as instruct models straight away? Previously LLMs were trained on raw text data, but now we can generate instruct data directly either from &#x27;teaching LLMs&#x27; or ask existing LLMs to conver raw data into instruct format.<p>Or alternatively - if chat tuning diminishes some of the models&#x27; capability, would it make sense to have a smaller chat model prompt a large base model, and convert back the outputs?
评论 #42212945 未加载
gallerdude6 个月前
Very interesting - have you tried using `o1` yet? I made a program which makes LLM&#x27;s complete WORDLE puzzles, and the difference between `4o` and `o1` is absolutely astonishing.
评论 #42207623 未加载
评论 #42208004 未加载
bee_rider6 个月前
Extremely tangential, but how to chess engines do when playing from illegal board states? Could the LLM have a chance of competing with a real chess engine from there?<p>Understanding is a funny concept to try to apply to computer programs anyway. But playing from an illegal state seems (to me at least) to indicate something interesting about the ability to comprehend the general idea of chess.
评论 #42218117 未加载
bob10296 个月前
I find it amusing that we would frame an ensemble of models as &quot;cheating&quot;. Routing to a collection of specialized models via classification layers seems like the most obvious path for adding practical value to these solutions.<p>Why conflate the parameters of chess with checkers and go if you already have high quality models for each? I thought tool use and RAG were fair game.
atemerev6 个月前
Ah, half of the commentariat still think that “LLMs can’t reason”. Even if they have enough state space for reasoning, and clearly demonstrate that.
评论 #42210398 未加载
评论 #42208013 未加载
评论 #42207935 未加载
drivingmenuts6 个月前
Why would a chess-playing AI be tuned to do anything except play chess? Just seems like a waste. A bunch of small, specialized AI&#x27;s seems like a better idea than spending time trying to build a new one.<p>Maybe less morally challenging, as well. You wouldn&#x27;t be trying to install &quot;sentience&quot;.
评论 #42216957 未加载
boesboes6 个月前
It would be interesting to see if it can also play chess with altered rules, or actually just a novel &#x27;game&#x27; that relies on logic &amp; reasoning. Still not sure if that would &#x27;prove&#x27; LLMs do reasoning, but I&#x27;d be pretty close to convinced.
评论 #42212441 未加载
评论 #42212452 未加载
copperroof6 个月前
I just want a hacker news no-LLM filter. The site has been almost unusable for a year now.
sourcepluck6 个月前
I don&#x27;t like being directly critical, people learning in public can be good and instructive. But I regret the time I&#x27;ve put into both this article and the last one and perhaps someone else can be saved the same time.<p>This is someone with limited knowledge of chess, statistics and LLMs doing a series of public articles as they learn a little tiny bit about chess, statistics and LLMs. And it garners upvotes and attention off the coat-tails of AI excitement. Which is fair enough, it&#x27;s the (semi-)public internet, but it sort of masquerades as being half-serious &quot;research&quot;, and it kind of held things together for the first article, but this one really is thrown together to keep the buzz going of the last one.<p>The TL;DR :: one of the AIs being just-above-terrible, compared to all the others being completely terrible, a fact already of dubious interest, is down to - we don&#x27;t know. Maybe a difference in training sets. Tons of speculation. A few graphs.
__MatrixMan__6 个月前
It would be fun to play against an LLM without having to think about the prompting, if only as a novel way to get a &quot;feel&quot; for how they &quot;think&quot;.
amelius6 个月前
I wonder what would happen if they changed the prompt such that the llm is asked to explain their strategy first. Or to explain their opponent&#x27;s strategy.
cma6 个月前
One thing missing from the graphs is whether 3.5-turbo-instruct also gets better with the techniques? Is finetuning available for it?
bambax6 个月前
Very good follow-up to the original article. Thank you!
XenophileJKO6 个月前
So this article is what happens when people who don&#x27;t really understand the models &quot;test&quot; things.<p>There are several fatal flaws.<p>The first problem is that he isn&#x27;t clearly and concisely displaying the current board state. He is expecting the model to attend a move sequence to figure out the board state.<p>Secondly, he isn&#x27;t allowing the model to think elastically using COT or other strategies.<p>Honestly, I am shocked it is working at all. He has basically formulated the problem in the worst possible way.
评论 #42217134 未加载
Palmik6 个月前
It might be worth trying the experiment where the prompt is formatted such that each chess turn corresponds to one chat message.
leumassuehtam6 个月前
I&#x27;m convinced that &quot;completion&quot; models are much more useful (and smart) than &quot;chat&quot; models, being able to provide more nuanced and original outputs. When gpt4 come out, text-davinci-003 would still provide better completions with the correct prompt. Of course this model was later replaced by gpt-3.5-turbo-instruct which is explored in this post.<p>I believe the reason why such models were later deprecated was &quot;alignment&quot;.
评论 #42215379 未加载
qnleigh6 个月前
Two other theories that could explain why OpenAI&#x27;s models do so well:<p>1. They generate chess games from chess engine self play and add that to the training data (similar to the already-stated theory about their training data).<p>2. They have added chess reinforcement learning to the training at some stage, and actually got it to work (but not very well).
koolala6 个月前
Next test a image &amp; text model! Chess is way easier when you can see the board.
keskival6 个月前
&quot;I’m not sure, because OpenAI doesn’t deign to share gpt-4-base, nor to allow queries of gpt-4o in completion mode.&quot;<p>I would guess GPT-4o isn&#x27;t first pre-trained and then instruct-tuned, but trained directly with refined instruction-following material.<p>This material probably contains way fewer chess games.
评论 #42213607 未加载
timzaman6 个月前
&quot;all LLMs&quot; - OP only tested OpenAI LLMs. Try Gemini.
byyoung36 个月前
sometimes new training techniques will lead to regressions in certain tasks. My guess is this is exactly what has happened.
seizethecheese6 个月前
All the hand wringing about openAI cheating suggests a question: why so much mistrust?<p>My guess would be that the persona of the openAI team on platforms like Twitter is very cliquey. This, I think, naturally leads to mistrust. A clique feels more likely to cheat than some other sort of group.
评论 #42208018 未加载
评论 #42208886 未加载
herbst6 个月前
Is it just me or is the answer quality rapidly decreasing for the public models anyway (o-mini and whatever it degrades to)<p>I am shopping things for a bigger DIY project for a few months now and recently it started to hallucinate products with the specifications I need.<p>In fact it returns mainly broken code, hallucinates functions that never existed (zero Google results) and so on.<p>Not sure if I am using it more, or it just got much more useless.