LLMs get lost in multi-turn conversation

370 pointsby simonpure4 days ago

30 comments

It's nice to see a paper that confirms what anyone who has practiced using LLM tools already knows very well, heuristically. Keeping your context clean matters, "conversations" are only a construct of product interfaces, they hurt the quality of responses from the LLM itself, and once your context is "poisoned" it will not recover, you need to start fresh with a new chat.

评论 #43991453 未加载

评论 #43991418 未加载

评论 #43992376 未加载

评论 #43994935 未加载

评论 #43992013 未加载

评论 #43991475 未加载

评论 #43995465 未加载

评论 #43991691 未加载

评论 #43991413 未加载

评论 #43993173 未加载

评论 #43994819 未加载

评论 #43997455 未加载

评论 #43992484 未加载

评论 #43994646 未加载

评论 #43994684 未加载

评论 #43993653 未加载

评论 #43995063 未加载

评论 #43995121 未加载

评论 #43995556 未加载

评论 #43992394 未加载

评论 #43991847 未加载

评论 #43996199 未加载

Sharlin4 days ago

Seems like this is an aspect of their well-known overconfidence and the inability to self-reflect and recognize they have to ask for more details because their priors are too low. If you look at the output of reasoning models, it’s clear that the idea of asking for clarification very rarely occurs to them – when they’re confused, it’s just endless speculation of what the user might have meant.This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.

评论 #43991499 未加载

评论 #43992217 未加载

评论 #44003197 未加载

评论 #43992419 未加载

评论 #43994491 未加载

评论 #43991913 未加载

评论 #43991877 未加载

评论 #43992008 未加载

tmountain4 days ago

I often ask the LLM for a concise summary of the discussion so far—formatted as a prompt. I then edit it appropriately and use it to start a new conversation without the baggage. I have found this to be a very effective technique, but I imagine it will be automated sometime soon.

评论 #43998094 未加载

评论 #43996990 未加载

评论 #43994299 未加载

airylizard4 days ago

Why I came up with TSCE(Two-Step Contextual Enrichment).+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.Free open framework, check the repo try it yourself<a href="https://github.com/AutomationOptimization/tsce_demo">https://github.com/AutomationOptimization/tsce_demo</a>I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.It works, all the data as well as the entire script used for testing is in the repo.

评论 #43994773 未加载

评论 #43993828 未加载

zacksiri4 days ago

I've been working on solving this with quite a bit of success, I'll be sharing more on this soon. It involves having 2 systems 1st system is the LLM itself and another system which acts like a 'curator' of thoughts you could say.It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.

评论 #43992505 未加载

评论 #43991787 未加载

评论 #43991934 未加载

评论 #43991622 未加载

评论 #43992341 未加载

jumploops4 days ago

It's amazing that branching/forking isn't a core aspect of the main chat tools.You can edit responses, sure, but then a bunch of other context is lost.My flow is basically:1. plan2. build3. branch (into some feature/esoteric dependency issue)4. goto #2Prompt pruning/branching should be a first-class tool for any LLM usage.

评论 #43993105 未加载

评论 #43992496 未加载

podgorniy4 days ago

There is a noticable issue when one builds LLMs interfaces around single turn conversations. Majority people expect linear conversations.I've built telegram bot <a href="http://t.me/experai_bot" rel="nofollow">http://t.me/experai_bot</a> as univresal UI to LLMs (with somewhat reduced functionality) exactly around idea "non-reply message means new conversation". Wanna keep context? Keep replying to replies of bot. Non-power user strugge with this idea.--Also I observed that OpenAI models performed worse replying to the same questions (for example list of options in reply got shorter) even with smallest system message. That was the case with 3.5, 4o. Don't know how modern ones behave. That made me decide not to include any system messages by default Still I give option to add ones if you need. You can even toggle them to mix-and-match.

permo-w4 days ago

I feel like at this point the LLM space is just filled with people solving and resolving the same problems over and over

评论 #43992569 未加载

评论 #43992058 未加载

评论 #43991549 未加载

评论 #43992047 未加载

t-kalinowski4 days ago

This was the main reason I wrote promptdown. I want to be able to edit the full chat history every turn, and the append-only standard chat interfaces don't make that easy.<a href="https://github.com/t-kalinowski/promptdown">https://github.com/t-kalinowski/promptdown</a>

SamPatt4 days ago

I always felt the derision around the term "prompt engineering" was partially due to people overestimating the importance of the initial prompt and underestimating the importance of managing the ongoing context.You develop a knack for how to steer the models or start a new conversation through experience. The system or initial prompt are important, but nothing will save you if you naively keep a conversation going too long.

评论 #43992407 未加载

ranyume4 days ago

I'd like more research done on context understanding other than NIAH. I don't believe LLMs support the context length companies say they support. But I need to know this to effectively use the tools. At least for coding.Stuff like this:1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.

评论 #43992011 未加载

dr_dshiv4 days ago

This is the best paper on machine psychology [1] I’ve yet seen. Rigorous, empirical, insightful — and very practical.[1] <a href="http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstract" rel="nofollow">http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...</a>

badmonster4 days ago

Why do LLMs struggle so much with recovering from early wrong turns in multi-turn conversations — even when all prior context is available and tokenized?Is it due to the model's training distribution (mostly single-shot completions), the way context windows are encoded, or an architectural bottleneck?Feels like there's no dynamic internal state that evolves over the conversation — only a repeated re-parsing of static history. Has anyone seen work on integrating memory/state mechanisms that allow belief revision within a session, not just regurgitation of past tokens?

评论 #43992897 未加载

评论 #43993220 未加载

评论 #43995697 未加载

评论 #43993366 未加载

jsemrau4 days ago

That's no surprise. When I was working on game theory and agent reasoning I reached the same conclusion a year ago.My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.

评论 #43992437 未加载

aleksituk4 days ago

This is very interesting and I like the conversation about not only the technology itself, but also about the importance of thinking about the interface as a user experience and where / how it fits the paradigm.We've been working on a lot of data processing and generation tasks. We've been doing this using an API primarily, but sometimes I end up testing creating data in a chat window and I first chat through what the requirements are for the data analysis / processing and then once I'm done I would like the whole conversation to be then summarised into basically a one-prompt process so that I can re-use it (because I can't really process new inputs via the chat).Even when you do manage to get it down to a single prompt you can use in a chat and then ask the chat to just keep producing new data (like imagine a blog post in certain style if the base content is given as input and I'm making like 20 of them). If you produce these in the chat, there's notable benefits in that if something is wrong with the blog post the chat suggests, you can immediately edit it. The trouble is that the context window starts becoming so big that the chat starts to forget what the original instruction is and eventually you do have to just create a new chat.One way to solve for this is having a chat with selective memory where you keep a task in memory, but you have the chat forget/not-include all the generated data in the context so that it stays clean, but only bring it to the context if the user refers to it.Has anyone else done data processing types of tasks in chats and had issues like this? Are there some other tools to use or tricks to do in chats?

Zobat4 days ago

This must mean that LLMs really are like genies in bottles. You get three questions answered, anything after that will be nonsense.

sattard3 days ago

Why haven't AI code editors not built this at their core yet, to automatically consolidate previous conversational turns into a more structured context summary. Instead of relying solely on the model’s memory of all prior exchanges, surely these tools should take responsibility for intermittently “restating” the clarified requirements so the model doesn’t have to reconstruct context from scratch (or worse, pick up mistakes). This might mitigate compounding errors and reduce verbosity.

RandyOrion1 day ago

One problem with this paper is that authors didn't conduct experiments on popular LLMs from Qwen and Mistral. Why?

veunes4 days ago

Kind of wild how even the best models still struggle with keeping context straight over time. Definitely feels like a big challenge if we want these things to hold real conversations.

debuggerson4 days ago

The more we chat, the more irrelevant details pile up. For example, a small mention early on might get repeated or build on itself, leading to a lot of unnecessary context. As the conversation continues, it becomes harder for the model to focus on the main point because it gets tangled in all the extra information. Unlike humans, who can intuitively filter out the noise, LLMs struggle to keep track of what’s truly important in longer, more complex exchanges.

dontreact4 days ago

My take: multi turn evals are hard because to do it really correctly you have to simulate a user. This is not yet modeled well enough for multi turn to work as well as it could.

overflow8974 days ago

I believe we're already using llms to evaluate llm output for training, I wonder if there's some variation of that which could be used to identify when one llm gets "stuck".I guess chain of thought in theory should do that but having variations on prompt and context might behave differently?

sky22244 days ago

Ha, kind of funny to see this right now. I've been fighting copilot in vscode in trying to get it to output anything once I take the context down to a very specific problem. It feels like I have to reset and almost reground the model into what I'm trying to accomplish at a certain point.

Workaccount24 days ago

Reminds me of Claude plays pokemon, where it would note something insignificant, and then fixate on it for hours.

guardiang4 days ago

Exactly why expert steering should be valued.

coderatlarge4 days ago

i’ve see deepseek-coder local get into an infinite loop generating the same line over and over. which i assume without evidence is some sort of feedback from the generated line back into the generation process. so kind of getting lost in thought and going off topic from the simple .h api that my prompt asked for.

评论 #43992274 未加载

giordanol4 days ago

Would love to see metrics that isolate recovery behaviour (if any)

WhitneyLand4 days ago

Any reason to not call bullshit on this paper?One of the biggest developments in language models over the last year has been test-time reasoning (aka inference scaling or “thinking”). Most vendors tested offer such a model. It’s plausible it could make a huge difference here, and they did not bother to test it or even mention it?Things like COT and planning can really affect this and those are just a couple of things that happen automatically in more advanced models.Seems like it wouldn’t have been hard to add this to the experiment, but they could’ve called it out in a “Limitations” or “Future Work” section. Or at least a single sentence like “We did not test chain-of-thought prompting, which may mitigate some of these issues”.

tsunamifury4 days ago

Have you seen a bunch of humans in a room?

评论 #43993108 未加载

alganet4 days ago

Humans also often get lost in multi-turn conversation.I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.

评论 #43991433 未加载

评论 #43993566 未加载