The unreasonable effectiveness of an LLM agent loop with tool use

433 pointsby crawshaw3 days ago

40 comments

Strongly recommend this blog post too which is a much more detailed and persuasive version of the same point. The author actually goes and builds a coding agent from zero: <a href="https://ampcode.com/how-to-build-an-agent" rel="nofollow">https://ampcode.com/how-to-build-an-agent</a>It is indeed astonishing how well a loop with an LLM that can call tools works for all kinds of tasks now. Yes, sometimes they go off the rails, there is the problem of getting that last 10% of reliability, etc. etc., but if you're not at least a little bit amazed then I urge you go to and hack together something like this yourself, which will take you about 30 minutes. It's possible to have a sense of wonder about these things without giving up your healthy skepticism of whether AI is actually going to be effective for this or that use case.This "unreasonable effectiveness" of putting the LLM in a loop also accounts for the enormous proliferation of coding agents out there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider, Codex... and a ton of also-rans; as one HN poster put it the other day, it seems like everyone and their mother is writing one. The reason is that there is no secret sauce and 95% of the magic is in the LLM itself and how it's been fine-tuned to do tool calls. One of the lead developers of Claude Code candidly admits this in a recent interview.[0] Of course, a ton of work goes into making these tools work well, but ultimately they all have the same simple core.[0] <a href="https://www.youtube.com/watch?v=zDmW5hJPsvQ" rel="nofollow">https://www.youtube.com/watch?v=zDmW5hJPsvQ</a>

评论 #44003808 未加载

评论 #44000739 未加载

评论 #44000028 未加载

评论 #44000133 未加载

评论 #43999361 未加载

评论 #44002234 未加载

评论 #44005134 未加载

评论 #44003725 未加载

评论 #44004127 未加载

评论 #44000238 未加载

评论 #43999593 未加载

评论 #44010227 未加载

kgeist3 days ago

Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.It didn't go well. I started with 4o:- It used a deprecated package.- After I pointed that out, it didn't update all usages - so I had to fix them manually.- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".That's when I gave up.Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

评论 #43999401 未加载

评论 #43999358 未加载

评论 #43999373 未加载

评论 #43999497 未加载

评论 #43999248 未加载

评论 #43999916 未加载

评论 #43999390 未加载

评论 #43999162 未加载

评论 #43999097 未加载

评论 #44002837 未加载

评论 #44001181 未加载

评论 #43999300 未加载

评论 #44003198 未加载

评论 #44003824 未加载

评论 #44002185 未加载

评论 #44001136 未加载

评论 #44000695 未加载

评论 #43999263 未加载

评论 #43999055 未加载

评论 #44001568 未加载

评论 #44001697 未加载

评论 #43999169 未加载

评论 #43999296 未加载

评论 #43999556 未加载

评论 #43999028 未加载

评论 #43999610 未加载

评论 #44008480 未加载

评论 #43999272 未加载

评论 #44000527 未加载

评论 #43999402 未加载

tqwhite2 days ago

I've been using Claude Code, ie, a terminal interface to Sonnet 3.7 since the day it came out in mid March. I have done substantial CLI apps, full stack web systems and a ton of utility crap. I am much more ambitious because of it, much as I was in the past when I was running a programming team.I'm sure it is much the same as this under the hood though Anthropic has added many insanely useful features.Nothing is perfect. Producing good code requires about the same effort as it did when I was running said team. It is possible to get complicated things working and find oneself in a mess where adding the next feature is really problematic. As I have learned to drive it, I have to do much less remediation and refactoring. That will never go away.I cannot imagine what happened to poor kgeist. I have had Claude make choices I wouldn't and do some stupid stuff, never enough that I would even think about giving up on it. Almost always, it does a decent job and, for a most stuff, the amount of work it takes off of my brain is IMMENSE.And, for good measure, it does a wonderful job of refactoring. Periodically, I have a session where I look at the code, decide how it could be better and instruct Claude. Huge amounts of complexity, done. "Change this data structure", done. It's amazingly cool.And, just for fun, I opened it in a non-code archive directory. It was a junk drawer that I've been filling for thirty years. "What's in this directory?" "Read the old resumes and write a new one." "What are my children's names?" Also amazing.And this is still early days. I am so happy.

评论 #44000878 未加载

评论 #44000748 未加载

评论 #44010485 未加载

simonw2 days ago

I'm very excited about tool use for LLMs at the moment.The trick isn't new - I first encountered it with the ReAcT paper two years ago - <a href="https://til.simonwillison.net/llms/python-react-pattern" rel="nofollow">https://til.simonwillison.net/llms/python-react-pattern</a> - and it's since been used for ChatGPT plugins, and recently for MCP, and all of the models have been trained with tool use / function calls in mind.What's interesting today is how GOOD the models have got at it. o3/o4-mini's amazing search performance is all down to tool calling. Even Qwen3 4B (2.6GB from Ollama, runs happily on my Mac) can do tool calling reasonably well now.I gave a workshop at PyCon US yesterday about building software on top of LLMs - <a href="https://simonwillison.net/2025/May/15/building-on-llms/" rel="nofollow">https://simonwillison.net/2025/May/15/building-on-llms/</a> - and used that as an excuse to finally add tool usage to an alpha version of my LLM command-line tool. Here's the section of the workshop that covered that:<a href="https://building-with-llms-pycon-2025.readthedocs.io/en/latest/tools.html" rel="nofollow">https://building-with-llms-pycon-2025.readthedocs.io/en/late...</a>My LLM package can now reliably count the Rs in strawberry as a shell one-liner:<pre><code> llm --functions ' def count_char_in_string(char: str, string: str) -> int: """Count the number of times a character appears in a string.""" return string.lower().count(char.lower()) ' 'Count the number of Rs in the word strawberry' --td</code></pre>

评论 #43999436 未加载

评论 #43999657 未加载

suninsight2 days ago

It only seems effective, unless you start using it for actual work. The biggest issue - context. All tool use creates context. Large code bases come with large context out of the bat. LLM's seem to work, unless they are hit with a sizeable context. Anything above 10k and the quality seems to deteriorate.Other issue is that LLM's can go off on a tangent. As context builds up, they forget what their objective was. One wrong turn, and in the rabbit hole they go never to recover.The reason I know, is because we started solving these problems an year back. And we aren't done yet. But we did cover a lot of distance.[Plug]: Try it out at <a href="https://nonbios.ai" rel="nofollow">https://nonbios.ai</a>:- Agentic memory → long-horizon coding- Full Linux box → real runtime, not just toy demos- Transparent → see & control every command- Free beta — no invite needed. Works with throwaway email (mailinator etc.)

评论 #44003448 未加载

评论 #44003000 未加载

评论 #44002665 未加载

benoau2 days ago

This morning I used cursor to extract a few complex parts of my game prototype's "main loop", and then generate a suite of tests for those parts. In total I have 341 tests written by Cursor covering all the core math and other components.It has been a bit like herding cats sometimes, it will run away with a bad idea real fast, but the more constraints I give it telling it what to use, where to put it, giving it a file for a template, telling it what not to do, the better the results I get.In total it's given me 3500 lines of test code that I didn't need to write, don't need to fix, and can delete and regenerate if underlying assumptions change. It's also helped tune difficulty curves, generate mission variations and more.

评论 #44000816 未加载

cadamsdotcom2 days ago

> "Oh, this test doesn't pass... let's just skip it," it sometimes says, maddeningly.Here is a wild idea. Imagine running a companion, policy-enforcing LLM, independently and in parallel, which is given instructions to keep the main LLM behaving according to instructions.If the companion LLM could - in real time - ban the coding LLM from emitting "let's just skip it" by seeing the tokens "let's just" and then biasing the output such that the word "skip" becomes impossible to emit.Banning the word "skip" from following "let's just", forces the LLM down a new path away from the undesired behavior.It's like Structured Outputs or JSON mode, but driven by a companion LLM, and dynamically modified in real time as tokens are emitted.If the idea works, you could prompt the companion LLM to do more advanced stuff - eg. ban a coding LLM from making tests pass by deleting the test code, ban it from emitting pointless comments... all the policies that we put into system prompts today and pray the LLM will do, would go into the companion LLM's prompt instead.Wonder what the Outlines folks think of this!

评论 #44000267 未加载

评论 #44005843 未加载

评论 #44000335 未加载

magicalhippo2 days ago

Assuming the title is a play on the paper "The Unreasonable Effectiveness of Mathematics in the Natural Sciences"[1][2] by Eugene Wigner.[1]: <a href="https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_of_Mathematics_in_the_Natural_Sciences" rel="nofollow">https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...</a>[2]: <a href="https://www.hep.upenn.edu/~johnda/Papers/wignerUnreasonableEffectiveness.pdf" rel="nofollow">https://www.hep.upenn.edu/~johnda/Papers/wignerUnreasonableE...</a>

评论 #44000431 未加载

评论 #44000428 未加载

评论 #44001547 未加载

outworlder2 days ago

> If you don't have some tool installed, it'll install it.Terrifying. LLMs are very 'accommodating' and all they need is someone asking them to do something. This is like SQL injection, but worse.

评论 #44001194 未加载

评论 #44006744 未加载

_bin_3 days ago

I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.

评论 #43998797 未加载

评论 #43999599 未加载

评论 #43999022 未加载

jbellis3 days ago

Yes!Han Xiao at Jina wrote a great article that goes into a lot more detail on how to turn this into a production quality agentic search: <a href="https://jina.ai/news/a-practical-guide-to-implementing-deepsearch-deepresearch/" rel="nofollow">https://jina.ai/news/a-practical-guide-to-implementing-deeps...</a>This is the same principle that we use at Brokk for Search and for Architect. (<a href="https://brokk.ai/" rel="nofollow">https://brokk.ai/</a>)The biggest caveat: some models just suck at tool calling, even "smart" models like o3. I only really recommend Gemini Pro 2.5 for Architect (smart + good tool calls); Search doesn't require as high a degree of intelligence and lots of models work (Sonnet 3.7, gpt-4.1, Grok 3 are all fine).

评论 #44001738 未加载

评论 #43999074 未加载

kuahyeow2 days ago

What protection do people use when enabling an LLM to run `bash` on your machine ? Do you run it in a Docker container / LXC boundary ? `chroot` ?

评论 #44000369 未加载

评论 #44002663 未加载

stpedgwdgfhgdd2 days ago

Bit of topic, but worth sharing;Yesterday was for me a milestone, i connected Claude Code through MCP with Jira (sse). I asked it to create a plan for a specific Jira issue, ah, excuse me, work item.CC created the plan based on the item’s description and started coding. It created a branch (wrong naming convention, needs fix), made the code changes and pushed. Since the Jira item had a good description, the plan was solid and the code so far as well.Disclaimer; this was a simple problem to solve, but the code base is pretty large.

mtaras2 days ago

Just a couple of days ago I discovered this truth myself while building a proactive personal assistant. It boiled down to just giving it access to managing notes and messaging me, and calling it periodically with chat history and it's notes provided. It's surprisingly intelligent and helpful, even though I'm using model that's far from being SOTA (Gemini Flash 2.5)

评论 #44003114 未加载

rbren2 days ago

If you're interested in hacking on agent loops, come join us in the OpenHands community!Here's our (slightly more complicated) agent loop: <a href="https://github.com/All-Hands-AI/OpenHands/blob/f7cb2d0f64666e1f090a5152d7c002aa6f28caf9/openhands/controller/agent_controller.py#L771">https://github.com/All-Hands-AI/OpenHands/blob/f7cb2d0f64666...</a>

评论 #44001672 未加载

mukesh6102 days ago

I built this very same thing today! The only difference is that i pushed the tool call outputs into the conversation history and resent it back to the LLM for it to summarize, or perform further tool calls, if necessary, automagically.I used ollama to build this and ollama supports tool calling natively, by passing a `tools=[...]` in the Python SDK. The tools can be regular Python functions with docstrings that describe the tool use. The SDK handles converting the docstrings into a format the LLM can recognize, so my tool's code documentation becomes the model's source of truth. I can also include usage examples right in the docstring to guide the LLM to work closely with all my available tools. No system prompt needed!Moreover, I wrote all my tools in a separate module, and just use `inspect.getmembers` to construct the `tools` list that i pass to Ollama. So when I need to write a new tool, I just write another function in the tools module and it Just Works™Paired with qwen 32b running locally, i was fairly satisfied with the output.

评论 #43999487 未加载

lacker2 days ago

I've been using Claude Code, and I really prefer the command line to the IDE-integrated ones. I'm curious about Gemini's increased context size, though. Is anyone successfully using one of the open source CLI agents together with Gemini, and has something to recommend there?

评论 #44001611 未加载

Koshima2 days ago

It's fascinating how quickly the ecosystem around LLM agents is evolving. I think a big part of this "unreasonable effectiveness" comes from the fact that most of these tools are essentially chaining high-confidence steps together without requiring perfect outputs at each stage. The trick is finding the right balance between autonomy and supervision. I wonder if we'll soon see an "agent stack" emerge, similar to the full-stack frameworks in web development, where different layers handle prompts, memory, tool calls, and state management.

SafeDusk2 days ago

Sharing an agent framework (from scratch) that works very well with just 7 composable tools: read, write, diff, browse, command, think and ask.Just pushed an update this week for OpenAI-compatibility too!<a href="https://github.com/aperoc/toolkami">https://github.com/aperoc/toolkami</a>

jawns2 days ago

Not only can this be an effective strategy for coding tasks, but it can also be used for data querying. Picture a text-to-SQL agent that can query database schemas, construct queries, run explain plans, inspect the error outputs, and then loop several times to refine. That's the basic architecture behind a tool I built, and I have been amazed at how well it works. There have been multiple times when I've thought, "Surely it couldn't handle THIS prompt," but it does!Here's an AWS post that goes into detail about this approach: <a href="https://aws.amazon.com/blogs/machine-learning/build-a-robust-text-to-sql-solution-generating-complex-queries-self-correcting-and-querying-diverse-data-sources/" rel="nofollow">https://aws.amazon.com/blogs/machine-learning/build-a-robust...</a>

Quenby2 days ago

Although AI can help us with many repetitive tasks, can it always remain rational and handle complex situations? As mentioned in the article, simple loops and tool integration are effective, but once the complexity increases, the limitations of AI may become apparent. So, we should continue to improve these systems, ensuring they can reliably work in more scenarios.

BrandiATMuhkuh2 days ago

That's really cool. One week ago I implemented an SQL tool and it works really well. But sometimes it still just makes up table/column names. Luckily it can read the error and correct itself.But today I went to the next level. I gave the LLM two tools. One web search tool and one REST tool.I told it at what URL it can find API docs. Then I asked it to perform some tasks for me.It was really cool to watch an AI read docs, make api calls and try again (REPL) until it worked

baalimago2 days ago

I'm just going to shamelessly selfplug the blogpost I wrote about this in August last year: <a href="https://lorentz.app/blog-item.html?id=clai&heading=tooling" rel="nofollow">https://lorentz.app/blog-item.html?id=clai&heading=tooling</a>

danjc2 days ago

We built tools to give context to an ai chat help embedded in our product. Included is the ability for it to see recent activity logs, the definition of the current object and the ability to search and read help articles.The quality of the chats still amazes me months later.Where we find it got something wrong, we add more detail to the relevant help articles.

neumann2 days ago

This is great, and I like seeing all the implementations people are making for themselves.Anyone using any opensource tooling that bundles this effectively to allow different local models to be used in this fashion?I am thinking this would be nice to run fully locally to access my code or my private github repos from my commandline and switch models out (assuming through llama.ccp or Ollama)?

评论 #44001681 未加载

hbbio2 days ago

Yes, agent loops are simple, except, as the article says, a bit of "pump and circumstance"!If anyone is interested, I tried to put together a minimal library (no dependency) for TypeScript: <a href="https://github.com/hbbio/nanoagent">https://github.com/hbbio/nanoagent</a>

themichaellai2 days ago

I've also been defining agents as "LLM call in a while loop with tools" to my co-workers as well — I'd add that if you provide it something like a slack tool, you can enable the LLM to ask for help (human in the (while) loop).

bicepjai2 days ago

Which agent is token hungry ? I notice cline is top on the list. Roo eats less than cline. Are there agents that we can configure how the interactions go ? How does Claude code compare to other agents ?

bhouston3 days ago

I found this out too - it is quite easy and effective:<a href="https://benhouston3d.com/blog/building-an-agentic-code-from-scratch" rel="nofollow">https://benhouston3d.com/blog/building-an-agentic-code-from-...</a>

mips_avatarabout 19 hours ago

Really good explanation of agent loops!

andes3142 days ago

This is what the no-code API-to-MCP creator uses at usetexture.com! I was surprised to find out this is not what the Claude client uses (as of May 2025)

amelius2 days ago

Huh, isn't this already built-in, in most chat UIs?

评论 #44000798 未加载

评论 #44000819 未加载

bdbenton52552 days ago

Woke up this morning to start on a new project.Started with a math visualizer for machine learning, saw an HN post for this soon after and scrapped it. It was better done by someone else.Started on an LLM app that looped outputs, saw this post soon after and scrapped it. It was better done by someone else.It is like every single original notion I have is immediately done by someone else at the exact same time.I think I will just move on to rudimentary systems programming stuff and avoid creative and original thinking, just need basic and low profile employment.

评论 #44001166 未加载

评论 #44000745 未加载

评论 #44000658 未加载

tom_m1 day ago

Definitely going to have some "super worms" out there in the future.

amunozo2 days ago

Has anyone tried to build something like this with a local model such as Qwen 3?

polishdude202 days ago

I'd love to try this to help me handle a complex off by one cent problem I'm having.

stavros2 days ago

Unfortunately, I haven't had a good experience with tool use. I built a simple agent that can access my calendar and add events, and GPT-4.1 regularly gaslights me, saying "I've added this to your calendar" when it hasn't. It takes a lot of prompting for it to actually add the event, but it will insist it has, even though it never called the tool.Does anyone know of a fix? I'm using the OpenAI agents SDK.

pawanjswal2 days ago

Wild how something so powerful boils down to just 9 lines of code.

subtlesoftware2 days ago

Link to the full script is broken?

jonstewart2 days ago

Maybe I’m just writing code differently than many people, but I don’t spend much time executing complicated or unique shell commands. I write some code, I run make check (alias “mkc”), I run git add —update (alias “gau”), I review my commit with git diff —cached (“gdc”), and I commit (“gcm”).I can see how an LLM is useful when needing to research which tool arguments to use for a particular situation, but particular situations are infrequent. And based on how frequently wrong coding assistants are with their suggestions, I am leery of letting them run commands against my filesystem.What am I missing?