Wasting Inferences with Aider

139 pointsby Stwernerabout 1 month ago

21 comments

fxtentacleabout 1 month ago

For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same beginner's mistake as the last person on the last day. Eventually, I'd rather spend half an hour of my own time than to explain the problem once more...Why anyone thinks having 3 different PRs for each jira ticket might boost productivity, is beyond me.Related anime: I May Be a Guild Receptionist, But I'll Solo Any Boss to Clock Out on Time

评论 #43673564 未加载

评论 #43673746 未加载

评论 #43674331 未加载

评论 #43673742 未加载

评论 #43673980 未加载

denidomanabout 1 month ago

The current challenge is not to create a patch, but to verify it.Testing a fix in a big application is a very complex task. First of all, you have to reproduce the issue, to verify steps (or create them, because many issues don't contain clear description). Then you should switch to the fixed version and make sure that the issue doesn't exists. Finally, you should apply little exploratory testing to make sure that the fix doesn't corrupted neighbour logic (deep application knowledge required to perform it).To perform these steps you have to deploy staging with the original/fixed versions or run everything locally and do pre-setup (create users, entities, etc. to achieve the corrupted state).This is very challenging area for the current agents. Now they just can't do these steps - their mental models just not ready for a such level of integration into the app and infra. And creation of 3/5/10/100 unverified pull requests just slow down software development process.

评论 #43675137 未加载

评论 #43674101 未加载

评论 #43674538 未加载

wrsabout 1 month ago

I’ve been using Cursor and Code regularly for a few months now and the idea of letting three of them run free on the codebase seems insane. The reason for the chat interface is that the agent goes off the rails on a regular basis. At least 25% of the time I have to hit the stop button and go back to a checkpoint because the automatic lawnmower has started driving through the flowerbed again. And paradoxically, the more capable the model gets, the more likely it seems to get random ideas of how to fix things that aren’t broken.

评论 #43674310 未加载

评论 #43694398 未加载

评论 #43680595 未加载

tekacsabout 1 month ago

Over the last two days, I've built out support for autonomy in Aider (a lot like Claude Code) that hybridizes with the rest of the app:<a href="https://github.com/Aider-AI/aider/pull/3781">https://github.com/Aider-AI/aider/pull/3781</a>Edit: In case anyone wants to try it, I uploaded it to PyPI as `navigator-mode`, until (and if!) the PR is accepted. By I, I mean that it uploaded itself. You can see the session where it did that here: <a href="https://asciinema.org/a/9JtT7DKIRrtpylhUts0lr3EfY" rel="nofollow">https://asciinema.org/a/9JtT7DKIRrtpylhUts0lr3EfY</a>Edit 2: And as a Show HN, too: <a href="https://news.ycombinator.com/item?id=43674180">https://news.ycombinator.com/item?id=43674180</a>and, because Aider's already an amazing platform without the autonomy, it's very easy to use the rest of Aider's options, like using `/ask` first, using `/code` or `/architect` for specific tasks [1], but if you start in `/navigator` mode (which I built, here), you can just... ask for a particular task to be done and... wait and it'll often 'just get done'.It's... decidedly expensive to run an LLM this way right now (Gemini 2.5 Pro is your best bet), but if it's $N today, I don't doubt that it'll be $0.N by next year.I don't mean to speak in meaningless hype, but I think that a lot of folks who are speaking to LLMs' 'inability' to do things are also spending relatively cautiously on them, when tomorrow's capabilities are often here, just pricey.I'm definitely still intervening as it goes (as in the Devin demos, say), but I'm also having LLMs relatively autonomously build out large swathes of functionality, the kind that I would put off or avoid without them. I wouldn't call it a programmer-replacement any time soon (it feels far from that), but I'm solo finishing architectures now that I know how to build, but where delegating them to a team of senior devs would've resulted in chaos.[1]: also for anyone who hasn't tried it and doesn't like TUI, do note that Aider has a web mode and a 'watch mode', where you can use your normal editor and if you leave a comment like '# make this darker ai!', Aider will step in and apply the change. This is even fancier with navigator/autonomy.

评论 #43673939 未加载

评论 #43674535 未加载

评论 #43674590 未加载

pton_xdabout 1 month ago

The trend with LLMs so far has been: if you have an issue with the AI, wait 6 months for a more advanced model. Cobbling together workarounds for their deficiencies is basically a waste of effort.

danenaniaabout 1 month ago

Plandex[1] uses a similar “wasteful” approach for file edits (note: I’m the creator). It orchestrates a race between diff-style replacements plus validation, writing the whole file with edits incorporated, and (on the cloud service) a specialized model plus validation.While it sounds wasteful, the calls are all very cheap since most of the input tokens are cached, and once a valid result is achieved, other in-flight requests are cancelled. It’s working quite well, allowing for quick results on easy edits with fallbacks for more complex changes/large files that don’t feel incredibly slow.1 - <a href="https://github.com/plandex-ai/plandex">https://github.com/plandex-ai/plandex</a>

kgeistabout 1 month ago

I've noticed that large models from different vendors often end up converging on more or less the same ideas (probably because they're trained on more or less the same data). A few days ago, I asked both Grok and ChatGPT to produce several stories with an absurd twist, and they consistently generated the same twists, differing only in minor details. Often, they even used identical wording!Is there any research into this phenomenon? Is code generation any different? Isn't there a chance that several "independent" models might produce the same (say, faulty) result?

joshstrangeabout 1 month ago

This is a very interesting idea and I really should consider Aider in the "scriptable" sense more, I only use interactively.I might add another step after each PR is created where another agent(s?) review and compare the results (maybe have the other 2 agents review the first agents code?).

评论 #43673190 未加载

DeathArrowabout 1 month ago

I don't really think having an agent fleet is a much better solution than having a single agent.We would like to think that having 10 agents working on the same task will improve the chances of success 10x.But I would argue that some classes of problems are hard for LLMs and where one agent will fail, 10 agents or 100 agents will fail too.As an easy example I suggest leetcode hard problems.

评论 #43675664 未加载

评论 #43673678 未加载

评论 #43674725 未加载

评论 #43674796 未加载

IshKebababout 1 month ago

We're going to have no traditional programming in 2 years? Riiight.It would also be nice to see a demo where the task was something that I couldn't have done myself in essentially no time. Like, what happens if you say "tasks should support tags, and you should be able to filter/group tasks by tag"?

评论 #43673234 未加载

评论 #43674483 未加载

评论 #43679384 未加载

评论 #43675692 未加载

canterburryabout 1 month ago

I wouldn't be surprised if someone tries to leverage this with their customer feature request tool.Imagine having your customers write feature requests for your saas, that immediately triggers code generation and a PR. A virtual environment with that PR is spun up and served to that customer for feedback and refinement. Loop until customer has implemented the feature they would like to see in your product.Enterprise plan only, obviously.

aqme28about 1 month ago

It's cute but I don't see the benefit. In my experience, if one LLM fails to solve a problem, the other ones won't be too different.If you picked a problem where LLMs are good, now you have to review 3 PRs instead of just 1. If you picked a problem where they're bad, now you have 3 failures.I think there are not many cases where throwing more attempts at the problem is useful.

emorning3about 1 month ago

I see 'Waste Inferences' as a form of abductive reasoning.I see LLMs as a form of inductive reasoning, and so I can see how WI could extend LLMs.Also, I have no doubt that there are problems that can't be solved with just an LLM but would need abductive extensions.Same comments apply to deductive (logical) extensions to LLMs.

评论 #43679624 未加载

phamiltonabout 1 month ago

Sincere question: Has anyone figured out how we're going to code review the output of an agent fleet?

评论 #43673287 未加载

评论 #43676405 未加载

评论 #43673808 未加载

评论 #43676050 未加载

评论 #43673485 未加载

precomputeabout 1 month ago

Feels like a way to live with a bad decision rather than getting rid of it.

lherronabout 1 month ago

I love this! I have a similar automation for moving a feature through ideation/requirements/technical design, but I usually dump the result into Cursor for last mile and to save on inference. Seeing the cost analysis is eye opening.There’s probably also some upside to running the same model multiple times. I find Sonnet will sometimes fail, I’ll roll back and try again with same prompt but clean context, and it will succeed.

评论 #43675717 未加载

KTibowabout 1 month ago

I wonder if using thinking models would work better here. They generally have less variance and consider more options, which could achieve the same goal.

billmalarkyabout 1 month ago

I've been lucky enough to have a few conversations with Scott a month or so ago and he is doing some really compelling work around the AISDLC and creating a factory line approach to building software. Seriously folks, I recommend following this guy closely.There's another guy in this space I know who's doing similar incredible things but he doesn't really speak about it publicly so don't want to discuss w/o his permission. I'm happy to make an introduction for those interested just hmu (check my profile for how).Really excited to see you on the FP of HN Scott!

evertedsphereabout 1 month ago

love to see "Why It Matters" turn into the heading equivalent to "delve" in body text (although different in that the latter is a legitimate word while the former is a "we need to talk about…"–level turn of phrase)

dimalabout 1 month ago

Makes me think of The Sorcerers Apprentice.

charlie0about 1 month ago

The 10 cents is BS. It was only that because it was a trivial bug. A non-trivial bug requires context and the more context something requires, the more expensive it gets. Also once you are working with larger apps you have to pick the context, especially with LLMs that have smaller windows.