Chain of Recursive Thoughts: Make AI think harder by making it argue with itself

539 点作者 miles大约 1 个月前

68 条评论

dudeinhawaii大约 1 个月前

I see a lot of threads pitting models against each other (or whole swarms of them) in the hope that "wisdom of crowds" will magically appear. After a stack of experiments of my own—and after watching the recent ASU/Microsoft-Research work [1].. I've landed on a simpler takeaway:An LLM is a terrible verifier of another LLM. Subbarao Kambhampati's "(How) Do LLMs Reason/Plan?" talk shows GPT-4 confidently producing provably wrong graph-coloring proofs until a symbolic SAT solver is introduced as the referee [1]. Stechly et al. quantify the problem: letting GPT-4 critique its own answers *reduces* accuracy, whereas adding an external, sound verifier boosts it by ~30 pp across planning and puzzle tasks [2]. In other words, verification is *harder* than generation for today's autoregressive models, so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).Because of that asymmetry, stacking multiple LLMs rarely helps. The "LLM-Modulo" position paper argues that auto-regressive models simply can't do self-verification or long-horizon planning on their own and should instead be treated as high-recall idea generators wrapped by a single, sound verifier [3]. In my tests, replacing a five-model "debate" with one strong model + verifier gives equal or better answers with far less latency and orchestration overhead.[1] <a href="https://www.youtube.com/watch?v=0u2hdSpNS2o" rel="nofollow">https://www.youtube.com/watch?v=0u2hdSpNS2o</a> - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)[2] <a href="https://arxiv.org/abs/2402.08115" rel="nofollow">https://arxiv.org/abs/2402.08115</a>[3] <a href="https://arxiv.org/abs/2402.01817" rel="nofollow">https://arxiv.org/abs/2402.01817</a> (related to the talk in #1)

评论 #43842083 未加载

评论 #43841207 未加载

评论 #43840833 未加载

评论 #43842076 未加载

评论 #43853446 未加载

评论 #43848090 未加载

评论 #43840850 未加载

评论 #43848830 未加载

评论 #43846984 未加载

odo1242大约 1 个月前

Something I do sometimes is:- Have an AI chat model come up with an answer to a problem.- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.- Have a second AI model with no knowledge of the problem grade the report, and write it's own report either (a) asking for clarification / more information about the problem that the original model didn't provide or (b) pointing out an inconsistency in the argument posed by the original model. Give this report back to the original model and ask it to write it's own report back with either the necessary information or changes.- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.It's super clunky but has given pretty good results in the cases where I tried it lol

评论 #43839530 未加载

评论 #43838163 未加载

评论 #43838897 未加载

评论 #43838522 未加载

评论 #43837692 未加载

评论 #43838677 未加载

评论 #43837680 未加载

评论 #43838493 未加载

评论 #43839444 未加载

评论 #43841323 未加载

评论 #43840124 未加载

Lerc大约 1 个月前

I kind of want to try something like this at a larger scale in an always-on mode where I have a 'senate' of debate. Rather than responding to prompts on a case by case basis, provide a list of tasks (potentially with deadlines) and let the senate work on them, break off into groups to manage subtasks, challenge results , make suggestions. Even potentially a tree of analysts where suggestions only gets passed up the tree when the parent node thinks a lower analysis is particularly insightful.I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.

评论 #43837003 未加载

评论 #43837562 未加载

评论 #43836867 未加载

评论 #43837385 未加载

评论 #43837623 未加载

评论 #43843025 未加载

评论 #43839732 未加载

评论 #43838964 未加载

评论 #43837046 未加载

cube2222大约 1 个月前

This is really cool!One strategy I often use (which is much simpler and more limited than this), is to finish my message with: “Please do a round of thinking in <thinking></thinking> tags, then a round of self-critique in <critique></critique> tags, and then a final round of <thinking>, before responding.”It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).

评论 #43838977 未加载

评论 #43836316 未加载

评论 #43838152 未加载

electroly大约 1 个月前

This seems to be different than I expected from the title. I thought it would be explicitly adversarial.1. You are the assistant. Please answer the question directly.2. You are the cross-examiner. The assistant is wrong. Explain why.3. You are the assistant. The cross-examiner is wrong. Defend your claim.4. You are a judge. Did either party make their case, or is another round of argumentation required?I haven't tried this. No idea if it works. But I find it's helpful to ask ChatGPT, in separate prompts, "XYZ is true, explain why" and "XYZ is false, explain why" and see which one seems more convincing.

评论 #43837482 未加载

评论 #43837815 未加载

评论 #43838329 未加载

评论 #43836834 未加载

hnuser123456大约 1 个月前

I'm having a lot of fun experimenting with stuff like this. I'm trying to put together an unrealengine blueprints style graph editor to allow people to design workflows like this where you start with the user prompt input, which goes to one agent, which makes an initial attempt, and then that conversation history gets passed to another "agent" with a different system prompt telling it to be a harsh critic, but to also give a pass/fail signal, and loop back until the critic judges pass, then send that back to the user as output. Ideally as a little website that can call your own LLM endpoints and save/load/share workflow graphs.Mistral small 3.1 and gemma 3 feel like the first semi-competent models that can be run locally, but that competence is just a seed, and they still need to be guided with a framework that keeps them on track.Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.

评论 #43835984 未加载

评论 #43836749 未加载

评论 #43835865 未加载

评论 #43836444 未加载

评论 #43836520 未加载

jedberg大约 1 个月前

We're really going to need to figure out how to power all these GPUs with green power real quick, or we're going to melt the planet having AIs debate with themselves on the optimal solution to tik-tac-toe...

评论 #43836887 未加载

评论 #43838700 未加载

评论 #43853498 未加载

Xcelerate大约 1 个月前

I think this is how we get ML models to come up with novel ideas. Diagonalize against all the ideas they’ve already tried and dismissed via self-argument but keep certain consistency constraints. (Obviously much easier said than done.)

评论 #43835965 未加载

评论 #43835912 未加载

albertgoeswoof大约 1 个月前

How far is this going to go? Are we going to have a team of AI agents that runs a scrum team and meets for stand ups every couple of hours?Are we going to replicate government bureaucracy with agents all debating topics all day long to find the best opinion?

评论 #43841666 未加载

评论 #43838630 未加载

评论 #43842615 未加载

faramarz大约 1 个月前

That's cool! thanks for making it easy to fork and play with this!I've just begun my own iteration of adding Nash Equilibrium (NECoRT?) and reframing the "prompt engineering" to be a multi-agent negotiation. Curious what others think? <a href="https://github.com/faramarz/NECoRT/">https://github.com/faramarz/NECoRT/</a>my reasoning is that enterprise LLMs wont have any issue with the extra compute costs and would rather reconcile complex financials with various modeling optimizations.I'm very new to public repo and contributions, and hope someone can point out if I'm doing it wrong.my intention was to fork the ops codebase so I can test out my theory, and push as PR eventually

alexmolas大约 1 个月前

There are two examples in the repo, one with CoRT and another one without. And the one without it it's much better than the one that uses it. Weird choice of examples...

评论 #43836996 未加载

joshstrange大约 1 个月前

I've thought about trying this cross-model as well. Have Claude generate something, have OpenAI check it, have Gemini check that check. Firing multiple of these in parallel.There was a post here a week or so ago doing the "model checking model"-type thing with GH PRs IIRC that was interesting. I haven't had a chance to play with this idea yet.

K0balt大约 1 个月前

I’ll second this. I often use a “research assistant “ and skeptical“department head” personas working together/against each other as a research team. It works well and is occasionally hilarious, replete with the occasional HR complaint when things go off the rails. ( I typically use local uncensored models)

k2xl大约 1 个月前

I've done something similar for learning about a controversial topic. I ask it to act as if it is called Bob is a well informed supporter of one side (like Ukraine) and then act as if it is something named Alice who is a well informed supporter of another side (Russia) and they have to debate each other over a few prompts with a moderator named 'Sue'Then after a few rounds of the debate where Sue asks a bunch of questions, I ask it to go to the judges - Mark, Phil, Sarah (and I add a few personalities to each of them... Sometimes I pretend they are famous moral philosophers) and then I have them each come up with a rubric and decide who is the winner.Really fun, and helps me understand different sides of issues.

评论 #43838130 未加载

caseyy大约 1 个月前

I tried something similar when Llama2 came out, pitting two assistants, who each believed the other is the user, against each other. Ultimately, it was the same model talking with itself. The system prompts for both had various instructions to disagree and criticise the opinion of the user. I provided the first message to get things started. Usually, it’s be along the lines of “nuclear proliferation is harmful to humanity”.After 15 or so iterations, both assistants would keep repeating the same things and find agreement anyway. Sometimes, the chat became unhinged and useless, but 95/100 times, it was agreement.Happy someone else made it work.

评论 #43839718 未加载

评论 #43837229 未加载

评论 #43838986 未加载

bilekas大约 1 个月前

This is an interesting approach, it reminds me of YT creator actually. I'll find the YT creator, but basically he would make some script that would play the game like a race-course, with the goal being the finish line and iterate it N number of times, the script would keep iterating until it found the fastest solution.I believe they called that machine learning.. Or re-enforced training.I'm being slightly facetious, but my ignorant understanding of AI these days is basically the same no ?<a href="https://www.youtube.com/watch?v=SX08NT55YhA" rel="nofollow">https://www.youtube.com/watch?v=SX08NT55YhA</a>

WhitneyLand大约 1 个月前

Why try this idea on base models only?The whole point of reasoning models is to automatically use COT and related techniques to bring out more capabilities.It would be interesting to see if this is doing anything that’s not already being exploited.

ChadMoran大约 1 个月前

Fast Agent has this as a first-class citizen called "Evaluator Optimizer" pattern. Where it in a loop with a defined number of max refinements judge itself and give the output a rating, demanding it improve it's output.Highly encourage others to check out Fast Agent. It has been delightful to use. It has interactive chat mode which I love and it's really tight and easy to implement.<a href="https://github.com/evalstate/fast-agent">https://github.com/evalstate/fast-agent</a>

Der_Einzige大约 1 个月前

Debate as a reasoning tactic is massively undervalued. There's tons of papers on this at places like NeurIPS, ICML, ICLR, etc.Hell, even a whole quanta article. <a href="https://www.quantamagazine.org/debate-may-help-ai-models-converge-on-truth-20241108/" rel="nofollow">https://www.quantamagazine.org/debate-may-help-ai-models-con...</a>I got to meet and talk to the authors of this paper at NeurIPS. They're class acts!

lepisma大约 1 个月前

Debates have worked good for me while learning something new:<a href="https://lepisma.xyz/2024/10/19/interventional-debates-for-studying-gray-topics/index.html" rel="nofollow">https://lepisma.xyz/2024/10/19/interventional-debates-for-st...</a>I believe there are researches on this too.

aaroninsf大约 1 个月前

Question: has the the adversarial approach been roled into any coding copilots/assistant frameworks?Costs of various kinds aside I've wanted that from assistance's inception — with precisely the features many call out and home-roll here, difference by both model/provider, and, "role"...It seems like if you have the money/compute to burn, and can live with the reasoning wall-clock time,this has got to be the best approach for the foreseeable future, for a lot of specific requirements.(I also have wondered if this would illuminate the edges of what modern production models are capable of, "aggregating and integrating" over a variety of contributions might make more clear what the limits of their abilities are.)

badmonster大约 1 个月前

Have you experimented with weighting the self-evaluations based on specific criteria (e.g., correctness, clarity, creativity), or using external validators to guide the AI’s final choice? Curious how much tuning the evaluation step impacts overall performance.

mortarion大约 1 个月前

I think Gemini 2.5 already does something similar. If you read the "thinking descriptions" that it outputs it often thinks about going back to older thoughts to verify and criticize.

yieldcrv大约 1 个月前

Reminds me of baby agi from 2 years agobut I guess that was before chain of thought models

zekenie大约 1 个月前

I feel like itd be cool to try prompts based on an adversarial justice system… attorney agents arguing both sides, a judge ruling on “the law”—adherence to instructions etc

评论 #43838741 未加载

hu3大约 1 个月前

Here's some related challenge I'm facing. Maybe someone can help me:I also managed to make AI critique itself and that improved code generation a ton.For a TypeScript backend project that runs with Bun, I tell AI to also generate and run unit tests after every code change suggested by AI.How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?Docker works but I like to keep things simple.Deno supports revoking file access but I'd like to keep using Bun.

评论 #43844428 未加载

评论 #43838063 未加载

评论 #43838082 未加载

评论 #43838819 未加载

评论 #43839733 未加载

schnitzelstoat大约 1 个月前

I probably don't understand the modern, complex models. But doesn't it basically predict the next token given the context and the better models use more training data and can consider a larger context, and have more parameters to better retain information from the training data etc.But the fundamental way they operate is the same - predicting the next token given previous tokens. Where/how does reasoning happen here?

评论 #43844021 未加载

stormfather大约 1 个月前

I made a trading bot that ingested news. The prompt to assess impact was to simulate a debate between Charlie Munger and Warren Buffet on whether to invest.

评论 #43838953 未加载

thunderbong大约 1 个月前

A lot of the comments here are reminiscent of the early Google days when everyone was finding ways to search better!

j45大约 1 个月前

There appear to be no shortage of token saving attempts that can end up using more tokens, whether it's a monthly paid plan or API.Having an approach to recognize what is needed from the AI software, and anticipate how it may default to respond based on it's programming is critical.

pkdpic大约 1 个月前

So glad to see a write up on this finally. I'm no machine learning phd but I always wondered why this wasn't more of a thing. Like an extension of a GAN conceptually, sort of, not really at all Im sure.Also I think I kind of assumed OpenAI might be doing this behind the curtain?

mritchie712大约 1 个月前

Did something similar (OverkiLLM) to this waayyyy back in August with open LLMs. I'm sure it'd work much better now:<a href="https://www.definite.app/blog/overkillm" rel="nofollow">https://www.definite.app/blog/overkillm</a>

rriley大约 1 个月前

Makes me wonder what would happen if we combine LLMs with recursive genetic algorithms. Similar to <a href="https://github.com/DivergentAI/dreamGPT">https://github.com/DivergentAI/dreamGPT</a>

noworriesnate大约 1 个月前

I’ve had success telling the model it really needs to poop and if it gets to the point quickly it’ll be able to leave the meeting and go do that. It actually works amazingly well.It’s also a lot more ethical than verbal abuse, which some people say improves the results as well.Programming isn’t what it used to be.

评论 #43837606 未加载

ausbah大约 1 个月前

at some point this doesn’t make LLMs feel useful. I have to wait 10x as long just so my LLM can have a somewhat higher chance of actually answer my question correctly?

cwillu大约 1 个月前

Any api that lets you constrain output to a formal syntax should let you do away with the “first output a number, and only then explain yourself” boilerplate.

killerstorm大约 1 个月前

This is similar to Tree-of-Thought with self-evaluation.

daxfohl大约 1 个月前

Maybe have a "reconcile" option, for it to see if it can mix and match the best parts of each alternative rather than just choosing one.

grzracz大约 1 个月前

Your readme demo images are wrong: the terminal one is the non-CoRT one and the GUI one is the one with CoRT. Confused me for a while

Svoka大约 1 个月前

Oh. I was just asking "Use dialectic method on your solution" in the end of the prompt... It does make it think harder.

ashoeafoot大约 1 个月前

Give it reward and punishment evaluations, exploring the noise in parallel, extinction for the non rewarding answers ?

keyle大约 1 个月前

When will we get the `4o` vs `o3` background conversation in "thinking" leading to a more correct result?

kevinrineer大约 1 个月前

This sounds like the zeitgeist is approaching genetic algorithms, which are super fun. Adversarial stuff is great.

throwawayForMe2大约 1 个月前

I wonder if the Scholastic method of the Schoolmen would be useful with its argument and counter argument style.

alex1138大约 1 个月前

Every single one of my prompts would be "Are you suuuuuuure you're not hallucinating that?"

Garlef大约 1 个月前

Similarly, letting the LLM generate a socratic dialogue can work pretty well to get deeper into a topic.

mangoman大约 1 个月前

a paper with a similar idea on scaling test time reasoning, this is sorta how all the thinking models work under the hood. <a href="https://arxiv.org/abs/2501.19393" rel="nofollow">https://arxiv.org/abs/2501.19393</a>

gnarlouse大约 1 个月前

This seems like low hanging fruit; are we seriously supposed to believe this is new and novel?

评论 #43843823 未加载

irthomasthomas大约 1 个月前

my favourite pattern rn: llm "write a savage, yet grounded roast of: $content" llm -c "Write an equally savage rebuttal" llm -c "first arbitrate and then synthesize a final review."

asdfman123大约 1 个月前

And when I do this people say I'm overanalyzing

评论 #43838827 未加载

animitronix大约 1 个月前

Adversarial networks have been a thing for a while

stevefan1999大约 1 个月前

That is just reinforcement learning in disguise

akomtu大约 1 个月前

The modern Alchemy: the belief that you can extract gold (intelligence) from iron (autocomplete by imitation) by mixing iron with itself.

csours大约 1 个月前

Yes, give the computers anxiety too!

lonetripper大约 1 个月前

all this hard thinking yet humanity fails to come up with just one girlfriend for me

robofanatic大约 1 个月前

soon there will be AI debates. Different models debating with each other on a topic

mparnisari大约 1 个月前

So like rubber ducking for AI?

评论 #43837151 未加载

评论 #43837035 未加载

jbellis大约 1 个月前

does it actually make a difference to do M rounds of N vs one round of M*N?

评论 #43839747 未加载

firgrove大约 1 个月前

this is amazing - I love seeing novel approaches to optimizing

celltalk大约 1 个月前

One of my doctoral propositions is, dialog leads to true artificial intelligence.

getcrunk大约 1 个月前

Hello cnn’s

parrit大约 1 个月前

I want to see "Meh" vs. "Holy crap" as a benchmark in a paper published by Google. Or more likely I suspect, Andrej.

codr7大约 1 个月前

Better yet, let it argue with another AI, preferably using voice; instant entertainment.

antisthenes大约 1 个月前

Cool. Now I can justify talking to myself.

m3kw9大约 1 个月前

Isn’t this best of n?

评论 #43836481 未加载

lenerdenator大约 1 个月前

I, too, like to give Terminator lite anxiety.

hansmayer大约 1 个月前

Right, so... but you do realise its still just producing random output based on how you reconfigured it's weights, right? Sometimes it will happen to resonate with what you need. But it still neither thinking nor arguing with itself.

DyslexicAtheist大约 1 个月前

> "I made my AI think" ...utterly moronic.They don't “think” ... not even in the most autistic sense of the word.They can generate solutions by combining existing knowledge in unique ways. But they don't “think”.

评论 #43842516 未加载