TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Tracing the thoughts of a large language model

1072 点作者 Philpax大约 2 个月前

48 条评论

marcelsalathe大约 2 个月前
I’ve only skimmed the paper - a long and dense read - but it’s already clear it’ll become a classic. What’s fascinating is that engineering is transforming into a science, trying to understand precisely how its own creations work<p>This shift is more profound than many realize. Engineering traditionally applied our understanding of the physical world, mathematics, and logic to build predictable things. But now, especially in fields like AI, we’ve built systems so complex we no longer fully understand them. We must now use scientific methods - originally designed to understand nature - to comprehend our own engineered creations. Mindblowing.
评论 #43496419 未加载
评论 #43499768 未加载
评论 #43498900 未加载
评论 #43496730 未加载
评论 #43496636 未加载
评论 #43500283 未加载
评论 #43497932 未加载
评论 #43504305 未加载
评论 #43499247 未加载
评论 #43499382 未加载
评论 #43497804 未加载
评论 #43504766 未加载
评论 #43499518 未加载
评论 #43497795 未加载
评论 #43515101 未加载
评论 #43501415 未加载
评论 #43501964 未加载
评论 #43498918 未加载
评论 #43500650 未加载
评论 #43498430 未加载
评论 #43500875 未加载
评论 #43502425 未加载
评论 #43497834 未加载
评论 #43501792 未加载
评论 #43509031 未加载
评论 #43499580 未加载
cadamsdotcom大约 2 个月前
So many highlights from reading this. One that stood out for me is their discovery that refusal works by inhibition:<p>&gt; It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is &quot;on&quot; by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing &quot;known entities&quot; activates and inhibits this default circuit<p>Many cellular processes work similarly ie. there will be a process that runs as fast as it can and one or more companion “inhibitors” doing a kind of “rate limiting”.<p>Given both phenomena are emergent it makes you wonder if do-but-inhibit is a favored technique of the universe we live in, or just coincidence :)
评论 #43499750 未加载
评论 #43508808 未加载
评论 #43505265 未加载
polygot大约 2 个月前
There needs to be some more research on what path the model takes to reach its goal, perhaps there is a lot of overlap between this and the article. The most efficient way isn&#x27;t always the best way.<p>For example, I asked Claude-3.7 to make my tests pass in my C# codebase. It did, however, it wrote code to detect if a test runner was running, then return true. The tests now passed, so, it achieved the goal, and the code diff was very small (10-20 lines.) The actual solution was to modify about 200-300 lines of code to add a feature (the tests were running a feature that did not yet exist.)
评论 #43496818 未加载
评论 #43498742 未加载
评论 #43496831 未加载
评论 #43497162 未加载
评论 #43499672 未加载
评论 #43496806 未加载
评论 #43500317 未加载
aithrowawaycomm大约 2 个月前
I struggled reading the papers - Anthropic’s white papers reminds me of Stephen Wolfram, where it’s a huge pile of suggestive empirical evidence, but the claims are extremely vague - no definitions, just vibes - the empirical evidence seems selectively curated, and there’s not much effort spent building a coherent general theory.<p>Worse is the impression that they are begging the question. The rhyming example was especially unconvincing since they didn’t rule out the possibility that Claude activated “rabbit” simply because it wrote a line that said “carrot”; later Anthropic claimed Claude was able to “plan” when the concept “rabbit” was replaced by “green,” but the poem fails to rhyme because Claude arbitrarily threw in the word “green”! What exactly was the plan? It looks like Claude just hastily autocompleted. And Anthropic made zero effort to reproduce this experiment, so how do we know it’s a general phenomenon?<p>I don’t think either of these papers would be published in a reputable journal. If these papers are honest, they are incomplete: they need more experiments and more rigorous methodology. Poking at a few ANN layers and making sweeping claims about the output is not honest science. But I don’t think Anthropic is being especially honest: these are pseudoacademic infomercials.
评论 #43496643 未加载
评论 #43496572 未加载
评论 #43539061 未加载
评论 #43497794 未加载
评论 #43496916 未加载
smath大约 2 个月前
Reminds me of the term &#x27;system identification&#x27; from old school control systems theory, which meant poking around a system and measuring how it behaves, - like sending an input impulse and measuring its response, does it have memory, etc.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;System_identification" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;System_identification</a>
评论 #43496858 未加载
matthiaspr大约 2 个月前
Interesting paper arguing for deeper internal structure (&quot;biology&quot;) beyond pattern matching in LLMs. The examples of abstraction (language-agnostic features, math circuits reused unexpectedly) are compelling against the &quot;just next-token prediction&quot; camp.<p>It sparked a thought: how to test this abstract reasoning directly? Try a prompt with a totally novel rule:<p>“Let&#x27;s define a new abstract relationship: &#x27;To habogink&#x27; something means to perform the action typically associated with its primary function, but in reverse. Example: The habogink of &#x27;driving a car&#x27; would be &#x27;parking and exiting the car&#x27;. Now, considering a standard hammer, what does it mean &#x27;to habogink a hammer&#x27;? Describe the action.”<p>A sensible answer (like &#x27;using the claw to remove a nail&#x27;) would suggest real conceptual manipulation, not just stats. It tests if the internal circuits enable generalizable reasoning off the training data path. Fun way to probe if the suggested abstraction is robust or brittle.
评论 #43499910 未加载
评论 #43500444 未加载
评论 #43508520 未加载
评论 #43501329 未加载
评论 #43503300 未加载
fpgaminer大约 2 个月前
&gt; This is powerful evidence that even though models are trained to output one word at a time<p>I find this oversimplification of LLMs to be frequently poisonous to discussions surrounding them. No user facing LLM today is trained on next token prediction.
评论 #43496651 未加载
评论 #43496360 未加载
评论 #43496149 未加载
评论 #43496284 未加载
评论 #43499942 未加载
评论 #43496209 未加载
评论 #43499414 未加载
评论 #43496080 未加载
jacooper大约 2 个月前
So it turns out, it&#x27;s not just simple next token generation, there is intelligence and self developed solution methods (Algorithms) in play, particularly in the math example.<p>Also the multi language finding negates, at least partially, the idea that LLMs, at least large ones, don&#x27;t have an understanding of the world beyond the prompt.<p>This changed my outlook regarding LLMs, ngl.
评论 #43539352 未加载
modeless大约 2 个月前
&gt; In the poetry case study, we had set out to show that the model didn&#x27;t plan ahead, and found instead that it did.<p>I&#x27;m surprised their hypothesis was that it doesn&#x27;t plan. I don&#x27;t see how it could produce good rhymes without planning.
评论 #43498533 未加载
indigoabstract大约 2 个月前
While reading the article I enjoyed pretending that a powerful LLM just crash landed on our planet and researchers at Anthropic are now investigating this fascinating piece of alien technology and writing about their discoveries. It&#x27;s a black box, nobody knows how its inhuman brain works, but with each step, we&#x27;re finding out more and more.<p>It seems like quite a paradox to build something but to not know how it actually works and yet it works. This doesn&#x27;t seem to happen very often in classical programming, does it?
评论 #43497247 未加载
评论 #43497086 未加载
评论 #43498295 未加载
评论 #43497134 未加载
评论 #43499229 未加载
评论 #43497258 未加载
评论 #43497032 未加载
评论 #43498263 未加载
评论 #43497024 未加载
评论 #43539373 未加载
评论 #43498816 未加载
评论 #43497180 未加载
评论 #43539397 未加载
评论 #43497029 未加载
TechDebtDevin大约 2 个月前
&gt;&gt;Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.<p>This always seemed obvious to me or that LLMs were completing the next most likely sentence or multiple words.
评论 #43497734 未加载
deadbabe大约 2 个月前
We really need to work on popularizing better, non-anthropomorphic terms for LLMs, as they don’t really have “thoughts” the way people think. Such terms make people more susceptible to magical thinking.
评论 #43539647 未加载
评论 #43507807 未加载
评论 #43501190 未加载
sgt101大约 2 个月前
&gt;Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.<p>Models aren&#x27;t trained to do next word prediction though - they are trained to do missing word in this text prediction.
评论 #43504740 未加载
osigurdson大约 2 个月前
&gt;&gt; Claude can speak dozens of languages. What language, if any, is it using &quot;in its head&quot;?<p>I would have thought that there would be some hints in standard embeddings. I.e., the same concept, represented in different languages translates to vectors that are close to each other. It seems reasonable that an LLM would create its own embedding models implicitly.
评论 #43496954 未加载
评论 #43498233 未加载
jaakl大约 2 个月前
My main takeaway here is that the models cannot tell know how they really work, and asking it from them is just returning whatever training dataset would suggest: how a human would explain it. So it does not have self-consciousness, which is of course obvious and we get fooled just like the crowd running away from the arriving train in Lumiére&#x27;s screening. LLM just fails the famous old test &quot;cogito ergo sum&quot;. It has no cognition, ergo they are not agents in more than metaphorical sense. Ergo we are pretty safe from AI singularity.
评论 #43504812 未加载
评论 #43504745 未加载
zerop大约 2 个月前
The explanation of &quot;hallucination&quot; is quite simplified, I am sure there is more there.<p>If there is one problem I have to pick to to trace in LLMs, I would pick hallucination. More tracing of &quot;how much&quot; or &quot;why&quot; model hallucinated can lead to correct this problem. Given the explanation in this post about hallucination, I think degree of hallucination can be given as part of response to the user?<p>I am facing this in RAG use case quite - How do I know model is giving right answer or Hallucinating from my RAG sources?
评论 #43496382 未加载
评论 #43539442 未加载
评论 #43498670 未加载
mvATM99大约 2 个月前
What a great article, i always like how much Anthropic focuses on explainability, something vastly ignored by most. The multi-step reasoning section is especially good food for thought.
SkyBelow大约 2 个月前
&gt;Claude speaks dozens of languages fluently—from English and French to Chinese and Tagalog. How does this multilingual ability work? Is there a separate &quot;French Claude&quot; and &quot;Chinese Claude&quot; running in parallel, responding to requests in their own language? Or is there some cross-lingual core inside?<p>I have an interesting test case for this.<p>Take a popular enough Japanese game that has been released for long enough for social media discussions to be in the training data, but not so popular to have an English release yet. Then ask it a plot question, something major enough to be discussed, but enough of a spoiler that it won&#x27;t show up in marketing material. Does asking in Japanese have it return information that is lacking when asked in English, or can it answer the question in English based on the information in learned in Japanese?<p>I tried this recently with a JRPG that was popular enough to have a fan translation but not popular enough to have a simultaneous English release. English did not know the plot point, but I didn&#x27;t have the Japanese skill to confirm if the Japanese version knew the plot point, or if discussion was too limited for the AI to be aware of it. It did know of the JRPG and did know of the marketing material around it, so it wasn&#x27;t simply a case of my target being too niche.
0x70run大约 2 个月前
I would pay to watch James Mickens comment on this stuff.
diedyesterday大约 2 个月前
Regarding the conclusion about language-invariant reasoning (conceptual universality vs. multilingual processing) it helps understanding and becomes somewhat obvious if we regard each language as just a basis of some semantic&#x2F;logical&#x2F;thought space in the mind (analogous to the situation in linear algebra and duality of tensors and bases).<p>The thoughts&#x2F;ideas&#x2F;concepts&#x2F;scenarios are invariant states&#x2F;vector&#x2F;points in the (very high dimensional) space of meanings in the mind and each language is just a basis to reference&#x2F;define&#x2F;express&#x2F;manipulate those ideas&#x2F;vectors. A coordinatization of that semantic space.<p>Personally, I&#x27;m a multilingual person with native-level command of several languages. Many times it happens, I remember having a specific thought, but don&#x27;t remember in what language it was. So I can personally sympathize with this finding of the Anthropic researchers.
JPLeRouzic大约 2 个月前
This is extremely interesting: The authors look at features (like making poetry, or calculating) of LLM production, make hypotheses about internal strategies to achieve the result, and experiment with these hypotheses.<p>I wonder if there is somewhere an explanation linking the logical operations made on a on dataset, are resulting in those behaviors?
评论 #43496101 未加载
YeGoblynQueenne大约 2 个月前
&gt;&gt; Language models like Claude aren&#x27;t programmed directly by humans—instead, they‘re trained on large amounts of data.<p>Gee, I wonder where this data comes from.<p>Let&#x27;s think about this step by step.<p>So, what do we know? Language models like Claud are not programmed directly.<p>Wait, does that mean they are programmed indirectly?<p>If so, by whom?<p>Aha, I got it. They are not programmed, directly or indirectly. They are trained on large amounts of data.<p>But that is the question, right? Where does all that data come from?<p>Hm, let me think about it.<p>Oh hang on I got it!<p>Language models are trained on data.<p>But they are language models so the data is language.<p>Aha! And who generates language?<p>Humans! Humans generate language!<p>I got it! Language models are trained on language data generated by humans!<p>Wait, does that mean that language models like Claud are indirectly programmed by humans?<p>That&#x27;s it! Language models like Claude aren&#x27;t programmed directly by humans because they are indirectly programmed by humans when they are trained on large amounts of language data generated by humans!
评论 #43500217 未加载
评论 #43500622 未加载
评论 #43500561 未加载
评论 #43500344 未加载
ofrzeta大约 2 个月前
Back to the &quot;language of thought&quot; question, this time with LLMs :) <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Language_of_thought_hypothesis" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Language_of_thought_hypothesis</a>
annoyingnoob大约 2 个月前
Do LLMs &quot;think&quot;? I have trouble with the title, claiming that LLMs have thoughts.
评论 #43507800 未加载
davidmurphy大约 2 个月前
On a somewhat related note, check out the video of Tuesday&#x27;s Computer History Museum x IEEE Spectrum event, &quot;The Great Chatbot Debate: Do LLMs Really Understand?&quot;<p>Speakers: Sébastien Bubeck (OpenAI) and Emily M. Bender (University of Washington). Moderator: Eliza Strickland (IEEE Spectrum).<p>Video: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;YtIQVaSS5Pg" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;YtIQVaSS5Pg</a> Info: <a href="https:&#x2F;&#x2F;computerhistory.org&#x2F;events&#x2F;great-chatbot-debate&#x2F;" rel="nofollow">https:&#x2F;&#x2F;computerhistory.org&#x2F;events&#x2F;great-chatbot-debate&#x2F;</a>
a3w大约 2 个月前
Article and papers looks good. Video seems misleading, since I can use optimization pressure and local minima to explain the model behaviour. No &quot;thinking&quot; required, which the video claims is proven.
d--b大约 2 个月前
&gt; This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.<p>Suggesting that an awful lot of calculations are unnecessary in LLMs!
评论 #43499493 未加载
jaehong747大约 2 个月前
I’m skeptical of the claim that Claude “plans” its rhymes. The original example—“He saw a carrot and had to grab it, &#x2F; His hunger was like a starving rabbit”—is explained as if Claude deliberately chooses “rabbit” in advance. However, this might just reflect learned statistical associations. “Carrot” strongly correlates with “rabbit” (people often pair them), and “grab it” naturally rhymes with “rabbit,” so the model’s activations could simply be surfacing common patterns.<p>The research also modifies internal states—removing “rabbit” or injecting “green”—and sees Claude shift to words like “habit” or end lines with “green.” That’s more about rerouting probabilistic paths than genuine “adaptation.” The authors argue it shows “planning,” but a language model can maintain multiple candidate words at once without engaging in human-like strategy.<p>Finally, “planning ahead” implies a top-down goal and a mechanism for sustaining it, which is a strong assumption. Transformative evidence would require more than observing feature activations. We should be cautious before anthropomorphizing these neural nets.
评论 #43505367 未加载
评论 #43505804 未加载
0xbadcafebee大约 2 个月前
AI &quot;thinks&quot; like a piece of rope in a dryer &quot;thinks&quot; in order to come to an advanced knot: a whole lot of random jumbling that eventually leads to a complex outcome.
评论 #43506571 未加载
评论 #43499607 未加载
teleforce大约 2 个月前
Another review on the paper from MIT Technology Review [1].<p>[1] Anthropic can now track the bizarre inner workings of a large language model:<p><a href="https:&#x2F;&#x2F;www.technologyreview.com&#x2F;2025&#x2F;03&#x2F;27&#x2F;1113916&#x2F;anthropic-can-now-track-the-bizarre-inner-workings-of-a-large-language-model&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.technologyreview.com&#x2F;2025&#x2F;03&#x2F;27&#x2F;1113916&#x2F;anthropi...</a>
评论 #43539461 未加载
jasonjmcghee大约 2 个月前
I&#x27;m completely hooked. This is such a good paper.<p>It hallucinating how it thinks through things is particularly interesting - not surprising, but cool to confirm.<p>I would LOVE to see Anthropic feed the replacement features output to the model itself and fine tune the model on how it thinks through &#x2F; reasons internally so it can accurately describe how it arrived at its solutions - and see how it impacts its behavior &#x2F; reasoning.
trhway大约 2 个月前
&gt;We find that the shared circuitry increases with model scale, with Claude 3.5 Haiku sharing more than twice the proportion of its features between languages as compared to a smaller model.<p>While it was already generally noticeable, still one more time confirmed that larger model generalizes better instead of using its bigger numbers of parameters just to “memorize by rote” (overfitting).
kazinator大约 2 个月前
&gt; <i>Claude writes text one word at a time. Is it only focusing on predicting the next word or does it ever plan ahead?</i><p>When a LLM outputs a word, it commits to that word, without knowing what the next word is going to be. Commits meaning once it settles on that token, it will not backtrack.<p>That is kind of weird. Why would you do that, and how would you be sure?<p>People can sort of do that too. Sometimes?<p>Say you&#x27;re asked to describe a 2D scene in which a blue triangle partially occludes a red circle.<p>Without thinking about the relationship of the objects at all, you know that your first word is going to be &quot;The&quot; so you can output that token into your answer. And then that the sentence will need a subject which is going to be &quot;blue&quot;, &quot;triangle&quot;. You can commit to the tokens &quot;The blue triangle&quot; just from knowing that you are talking about a 2D scene with a blue triangle in it, without considering how it relates to anything else, like the red circle. You can perhaps commit to the next token &quot;is&quot;, if you have a way to express any possible relationship using the word &quot;to be&quot;, such as &quot;the blue circle is partially covering the red circle&quot;.<p>I don&#x27;t think this analogy necessarily fits what LLMs are doing.
评论 #43497241 未加载
评论 #43496888 未加载
评论 #43499396 未加载
评论 #43496874 未加载
评论 #43497685 未加载
westurner大约 2 个月前
XAI: Explainable artificial intelligence: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Explainable_artificial_intelligence" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Explainable_artificial_intelli...</a>
Hansenq大约 2 个月前
I wonder how much of these conclusions are Claude-specific (given that Anthropic only used Claude as a test subject) or if they extrapolate to other transformer-based models as well. Would be great to see the research tested on Llama and the Deepseek models, if possible!
hbarka大约 2 个月前
Dario Amodei was in an interview where he said that OpenAI beat them (Anthropic) by mere days to be the first to release. That first move ceded the recognition to ChatGPT but according to Dario it could have been them just the same.
评论 #43505839 未加载
HocusLocus大约 2 个月前
[Tracing the thoughts of a large language model]<p>&quot;What have I gotten myself into??&quot;
darkhorse222大约 2 个月前
Once we are aware of these neural pathways I see no reason there shouldn&#x27;t be a watcher and influencer of the pathways. A bit like a dystopian mind watcher. Shape the brain.
评论 #43509919 未加载
EncomLab大约 2 个月前
This is very interesting - but like all of these discussions it sidesteps the issues of abstractions, compilation, and execution. It&#x27;s fine to say things like &quot;aren&#x27;t programmed directly by humans&quot;, but the abstracted code is not the program that is running - the compiled code is - and that is code is executing within the tightly bounded constraints of the ISA it is being executed in.<p>Really this is all so much slight of hand - as an esolang fanatic this all feels very familiar. Most people can&#x27;t look a program written in Whitespace and figure it out either, but once compiled it is just like every other program as far as the processor is concerned. LLM&#x27;s are no different.
评论 #43496319 未加载
twoodfin大约 2 个月前
I say this at least 82.764% in jest:<p>Don’t these LLM’s have The Bitter Lesson in their training sets? What are they doing building specialized structures to handle specific needs?
teleforce大约 2 个月前
Oh the irony of not able to download the entire paper in one compact PDF format referred in the article, while apparently all the reference citations have PDF of the cited article to be downloaded and accessible from the provided online links [1].<p>Come on Anthropic, you can do much better than this unconventional and bizarre approach to publication.<p>[1] On the Biology of a Large Language Model:<p><a href="https:&#x2F;&#x2F;transformer-circuits.pub&#x2F;2025&#x2F;attribution-graphs&#x2F;biology.html" rel="nofollow">https:&#x2F;&#x2F;transformer-circuits.pub&#x2F;2025&#x2F;attribution-graphs&#x2F;bio...</a>
jxjnskkzxxhx大约 2 个月前
CaNt UnDeRsTaNd WhY pEoPlE aRe BuLlIsH.<p>ItS jUsT a StOcHaStIc PaRtOt.
LoganDark大约 2 个月前
LLMs don&#x27;t think, and LLMs don&#x27;t have strategies. Maybe it could be argued that LLMs have &quot;derived meaning&quot;, but all LLMs do is predict the next token. Even RL just tweaks the next-token prediction process, but the math that drives an LLM makes it impossible for there to be anything that could reasonably be called thought.
评论 #43496703 未加载
评论 #43496759 未加载
评论 #43496657 未加载
greesil大约 2 个月前
What is a &quot;thought&quot;?
kittikitti大约 2 个月前
What&#x27;s the point of this when Claude isn&#x27;t open sourced and we just have to take Anthropic&#x27;s word for it?
评论 #43496370 未加载
评论 #43496621 未加载
alach11大约 2 个月前
Fascinating papers. Could deliberately suppressing memorization during pretraining help force models to develop stronger first-principles reasoning?
navaed01大约 2 个月前
When and how to do stop saying LLMs are predicting the nect set of tokens and start saying they are thinking? Is this the point?
rambambram大约 2 个月前
When I want to trace the &#x27;thoughts&#x27; of my programs, I just read the code and comments I wrote.<p>Stop LLM anthropomorphizing, please. #SLAP