TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

AI agents: Less capability, more reliability, please

423 点作者 serjester大约 1 个月前

59 条评论

simonw大约 1 个月前
Yeah, the &quot;book a flight&quot; agent thing is a running joke now - it was a punchline in the Swyx keynote for the recent AI Engineer event in NYC: <a href="https:&#x2F;&#x2F;www.latent.space&#x2F;p&#x2F;agent" rel="nofollow">https:&#x2F;&#x2F;www.latent.space&#x2F;p&#x2F;agent</a><p>I think this piece is underestimating the difficulty involved here though. If only it was as easy as &quot;just pick a single task and make the agent really good at that&quot;!<p>The problem is that if your UI involves human beings typing or talking to you in a human language, there is an unbounded set of ways things could go wrong. You can&#x27;t test against every possible variant of what they might say. Humans are bad at clearly expressing things, but even worse is the challenge of ensuring they have a concrete, accurate mental model of what the software can and cannot do.
评论 #43540011 未加载
评论 #43536142 未加载
评论 #43537089 未加载
评论 #43536257 未加载
评论 #43537591 未加载
评论 #43536583 未加载
评论 #43536731 未加载
评论 #43536068 未加载
评论 #43539116 未加载
评论 #43536088 未加载
评论 #43539058 未加载
评论 #43539104 未加载
wiradikusuma大约 1 个月前
Booking a flight is actually task I cannot outsource to a human assistant, let alone AI. Maybe it&#x27;s a third-world problem or just me being cheap, but there are heuristics involved when booking flights for a family trip or even just for myself.<p>Check the official website, compare pricing with aggregator, check other dates, check people&#x27;s availability on cheap dates. Sometimes I only do the first step if the official price is reasonable (I travel 1-2x a month, so I have expectation &quot;how much it should cost&quot;).<p>Don&#x27;t get me started if I also consider which credit card to use for the points rewards.
评论 #43536336 未加载
评论 #43538122 未加载
评论 #43536536 未加载
评论 #43547348 未加载
评论 #43537460 未加载
评论 #43539691 未加载
评论 #43542337 未加载
评论 #43540935 未加载
extr大约 1 个月前
The problem I find in many cases is that people are restrained by their imagination of what&#x27;s possible, so they target existing workflows for AI. But existing workflows exist for a reason: someone already wanted to do that, and there have been countless man-hours put into the optimization of the UX&#x2F;UI. And by definition they were possible before AI, so using AI for them is a bit of a solution in search of a problem.<p>Flights are a good example but I often cite Uber as a good one too. Nobody wants to tell their assistant to book them an Uber - the UX&#x2F;UI is so streamlined and easy, it&#x27;s almost always easy enough to just do it yourself (or if you are too important for that, you probably have a private driver already). Basically anything you can do with an iPhone and the top 20 apps is in this category. You are literally competing against hundreds of engineers&#x2F;product designers who had no other goal than to build the best possible experience for accomplishing X. Even if LLMs would have been helpful a priori - they aren&#x27;t after every edge case has already been enumerated and planned for.
评论 #43538507 未加载
评论 #43538886 未加载
peterjliu大约 1 个月前
We&#x27;ve (ex Google Deepmind researchers) been doing research in increasing the reliability of agents and realized it is pretty non-trivial, but there are a lot of techniques to improve it. The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks. We made our own benchmarks to measure progress.<p>Plug: We just posted a demo of our agent doing sophisticated reasoning over a huge dataset ((JFK assassination files -- 80,000 PDF pages): <a href="https:&#x2F;&#x2F;x.com&#x2F;peterjliu&#x2F;status&#x2F;1906711224261464320" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;peterjliu&#x2F;status&#x2F;1906711224261464320</a><p>Even on small amounts of files, I think there&#x27;s quite a palpable difference in reliability&#x2F;accuracy vs the big AI players.
评论 #43538606 未加载
joshdavham大约 1 个月前
My rule of thumb has thus far been: if I’m gonna allow AI to write any bit of code for me, then I must, at a bare minimum, be able to understand that code.<p>There’s no way I could do what some of these “vibe coders” are doing where they allow AI to write code for them that they don’t even understand.
评论 #43536559 未加载
评论 #43537637 未加载
评论 #43536457 未加载
评论 #43538600 未加载
twotwotwo大约 1 个月前
FWIW, work has pushed use of Cursor and I quickly came around to a related conclusion: given a reliability vs. anything tradeoff, you more or less always have to prefer reliability. For example, even ignoring subtle head-scratcher type bugs, a faster model&#x27;s output on average needs more revision before it basically works, and on average you end up spending more energy on that than you save by reducing time to first response. Up-front work that decreases the chance of trouble--detailing how you want something done, explicitly pulling into context specific libraries--also tends to be worth it on net, even if the agent might have gotten there by searching (or you could get it there through follow-up requests).<p>That&#x27;s my experience working with a largeish mature codebase (all on non-prod code) where you can&#x27;t get far if you can&#x27;t use various internal libraries correctly. With standalone (or small greenfield) projects, where results can lean more on public info from pre-training and there&#x27;s not as much project specific info to pull in, you might see different outcomes.<p>Maybe the tech and surrounding practice will change over time, but in my short experience it&#x27;s mostly been about trying to just get to &#x27;acceptable&#x27; for this kind of task.
gcp123大约 1 个月前
I&#x27;ve spent the last six months building a coding agent at work, and the reliability issues are killing us. Our users don&#x27;t want &#x27;superhuman&#x27; results 10% of the time - they want predictable behavior they can trust.<p>When we tried the &#x27;full agent&#x27; approach (letting it roam freely through our codebase), we ended up with some impressive demos but constant production incidents. We&#x27;ve since pivoted to more constrained workflows with human checkpoints, and while less flashy, user satisfaction has gone way up.<p>The Cursor wipeout incident is a perfect example. It&#x27;s not about blaming users who don&#x27;t understand git - it&#x27;s about tools that should know better. When I hand my code to another developer, they understand the implied contract of &#x27;don&#x27;t delete all my shit without asking.&#x27; Why should AI get a pass?<p>Reliable &gt; clever. It&#x27;s the difference between a senior engineer who delivers consistently and a junior who occasionally writes brilliant code but breaks the build every other week.&quot;
getnormality大约 1 个月前
&quot;Less capability, more reliability, please&quot; is what I want to say about everything that&#x27;s happened in the past 20 years. Of everything that&#x27;s happened since then, I&#x27;m happy to have a few new capabilities: smartphones, driving directions, cloud storage, real-time collaborative editing of documents. I don&#x27;t need anything else. And now I just want my gadget batteries to last longer, and working parental controls on my kids&#x27; devices.
danso大约 1 个月前
I think the replies [0] to the mentioned reddit thread sums up my (perhaps complacent?) feelings about the current state of automated AI programming:<p>&gt; <i>Does it terrify anyone else that there is an entire cohort of new engineers who are getting into programming because of AI, but missing these absolute basic bare necessities?</i><p>&gt; &gt; <i>Terrify? No, it&#x27;s reassuring that I might still have a place in the world.</i><p>[0] <a href="https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;cursor&#x2F;comments&#x2F;1inoryp&#x2F;comment&#x2F;mdoksrj&#x2F;?utm_source=share&amp;utm_medium=web3x&amp;utm_name=web3xcss&amp;utm_term=1&amp;utm_content=share_button" rel="nofollow">https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;cursor&#x2F;comments&#x2F;1inoryp&#x2F;comment&#x2F;mdo...</a>
评论 #43537492 未加载
dfxm12大约 1 个月前
<i>Google Flights already nails this UX perfectly</i><p>Often when using an AI agent, I think to myself that a web search gets me what I need more reliably and just as quick. Maybe AI has to learn to crawl before it learns to walk, but each agent I use is leaving me without confidence that it will ever be useful and I genuinely wonder if they&#x27;ve ever been tested before being published...
评论 #43536250 未加载
bhu8大约 1 个月前
I have been thinking about the exact same problem for a while and was literally hours away from publishing a blogpost on the subject.<p>+100 on the footnote:<p>&gt; agents or workflows?<p>Workflows. Workflows, all the way.<p>The agents can start using these workflows once they are actually ready to execute stuff with high precision. And, by then we would have figured out how to create effective, accurate and easily diagnozable workflows, so people will stop complaining about &quot;I want to know what&#x27;s going on inside the black box&quot;.
评论 #43539560 未加载
评论 #43537569 未加载
SkyPuncher大约 1 个月前
Unfortunately, the picked example kind of weighs down the point. Cursor has an <i>extremely</i> vocal minority (beginner coders) that isn&#x27;t really representative of their heavy weight users (professional coders). These beginner users face significant issues that come from being new to programming, in general. Cursor gives them amazing capabilities, but it also lets them make the same dumb mistakes that most professional developers have done once or twice in their career.<p>That being said, back in February I was trying out of bunch of AI personal assistant apps&#x2F;tools. I found, without fail, every single one of them was advertising features their LLMs could theoretically accomplish, but in practice couldn&#x27;t. Even worse was many of these &quot;assistants&quot; would proactively suggest they could accomplish something but when you sent them out to do it, they&#x27;d tell you they couldn&#x27;t.<p>* &quot;Would you like me to call that restaurant?&quot;....&quot;Sorry, I don&#x27;t have support for that yet&quot;<p>* &quot;Would you like me to create a reminder?&quot;....Created the reminder, but never executed it<p>* &quot;Do you want me to check their website?&quot;....&quot;Sorry, I don&#x27;t support that yet&quot;<p>Of all of the promised features, the only thing I ended up using any of them for was a text message interface to an LLM. Now that Siri has native ChatGPT support, it&#x27;s not necessary.
narmiouh大约 1 个月前
I feel like OP would have been better of not referencing the viral thread about a developer not using any version control and surprised when the AI made changes, I don&#x27;t think anyone who doesn&#x27;t understand version control should be using a tool like cursor, there are other SAAS apps that build and deploy apps using AI and for people with the skill demonstrated in the thread, that is what they should be using.<p>It&#x27;s like saying rm -rf &#x2F; should have more safeguards built in. It feels unfair to call out the AI based tools for this.
评论 #43537480 未加载
评论 #43537400 未加载
评论 #43546631 未加载
评论 #43537532 未加载
LeifCarrotson大约 1 个月前
Unfortunately, LLMs, natural language, and human cognition largely are what they are. Mix the three together and you don&#x27;t get reliability as a result.<p>It&#x27;s not like there&#x27;s a lever in Cursor HQ where one side is &quot;Capability&quot; and one side is &quot;Reliability&quot;, and they can make things better just by tipping it back towards the latter.<p>You can bias designs and efforts in that direction, and get your tool to output reversible steps or bake in sanity checks to blessed actions, but that doesn&#x27;t change the nature of the problem.
BrenBarn大约 1 个月前
Seems related to another recent post: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43542259">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43542259</a><p>I tend to think that what this article is asking for isn&#x27;t achievable, because what people mean by &quot;AI&quot; is precisely &quot;we don&#x27;t know how it works&quot;.<p>An analogy I&#x27;ve used sometimes when talking with people about AI is the &quot;I know a guy&quot; situation. Someone you know comes and tells you &quot;I know a guy who can do X for you&quot;, where &quot;do X&quot; is &quot;write your class paper&quot; or &quot;book a flight&quot; or &quot;describe what a supernova is&quot; or &quot;invest your life savings&quot;. In this situation, the more important the task, the more you would probably want to know about this &quot;guy&quot;. What are his credentials? Has he done this before? How often has he failed? What were the consequences? Can he be trusted? Etc.<p>The thing that &quot;a guy&quot; and an AI have in common is that you don&#x27;t know what they&#x27;re doing. Where they differ is in your ability to gradually gain knowledge. In real life, &quot;know a guy&quot; situations become transformed into something more specific as you gain information about who the person is and how they do what they do, and especially as you understand more about the system of consequences in which they are embedded (e.g., &quot;if this painter had ruined many people&#x27;s houses he would have been sued into oblivion, or at least I would have heard about it&quot;). And also real people are unavoidably embedded in the system of physical reality which imposes certain constraints that bound plausibility (e.g., if someone tells you &quot;I know a guy who can paint your entire house in five seconds&quot; you will smell a rat).<p>Asking for &quot;reliability&quot; means asking for a network of causes and effects that surrounds and supports whatever &quot;guy&quot; or AI you&#x27;re relying on. At this point I don&#x27;t see any mechanism to provide that other than social and ultimately legal pressure, and I don&#x27;t see any strong action being taken in that direction.
jlaneve大约 1 个月前
I appreciate the distinction between agents and workflows - this seems to be commonly overlooked and in my opinion helps ground people in reliability vs capability. Today (and in the near future) there&#x27;s not going to be &quot;one agent to rule them all&quot;, so these LLM workflows don&#x27;t need to be incredibly capable. They just need to do what they&#x27;re intended to do _reliably_ and nothing more.<p>I&#x27;ve started taking a very data engineering-centric approach to the problem where you treat an LLM as an API call as you would any other tool in a pipeline, and it&#x27;s crazy (or maybe not so crazy) what LLM workflows are capable of doing, all with increased reliability. So much so that I&#x27;ve tried to package my thoughts &#x2F; opinions up into an AI SDK for Apache Airflow [1] (one of the more popular orchestration tools that data engineers use). This feels like the right approach and in our customer base &#x2F; community, it also maps perfectly to the organizations that have been most successful. The number of times I&#x27;ve seen companies stand up an AI team without really understanding _what problem they want to solve_...<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;astronomer&#x2F;airflow-ai-sdk" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;astronomer&#x2F;airflow-ai-sdk</a>
bendyBus大约 1 个月前
&quot;If your task can be expressed as a workflow, build a workflow&quot;. 100% true but the thing all these &#x27;agent pattern&#x27; or &#x27;workflow&#x27; diagrams miss is that real tasks require back-and-forth with a user, not just a Rube Goldberg machine that gets triggered in response to a _single user message_. What you need is not &#x27;tool use&#x27; but something like &#x27;process use&#x27;. This is what we did at Rasa, giving you a declarative way to define multi-step processes. An LLM lets you have a fluent conversation, but the execution of the task is pre-defined and deterministic: <a href="https:&#x2F;&#x2F;rasa.com&#x2F;docs&#x2F;learn&#x2F;concepts&#x2F;calm&#x2F;" rel="nofollow">https:&#x2F;&#x2F;rasa.com&#x2F;docs&#x2F;learn&#x2F;concepts&#x2F;calm&#x2F;</a> The fact that every framework starts with a `while` loop around an LLM and then duct-tapes on some &quot;guardrails&quot; betrays a lack of imagination.
jedberg大约 1 个月前
I&#x27;ve been working on this problem for a while. There are whole companies that do this. They all work by having a human review a sample of the results and score them (with various uses of magic to make that more efficient). And then suggest changes to make it more accurate in the future.<p>The best companies can get up to 90% accuracy. Most are closer to 80%.<p>But it&#x27;s important to remember, we&#x27;re expecting perfection here. But think about this: Have you ever asked someone to book a flight for you? How did it go?<p>At least in my experience, there&#x27;s usually a few back and forth emails, and then something is always not quite right or as good as if you did it yourself, but you&#x27;re ok with that because it saved you time. The one thing that makes it better is if the same person does it for you a couple of times and learned your specific habits and what you care about.<p>I think the biggest problem in AI accuracy is expecting the AI to be better than a human.
评论 #43538649 未加载
评论 #43538574 未加载
_cs2017_大约 1 个月前
Does anyone have AI agent use cases that that you think might happen within this year and that feels very exciting to you?<p>I personally struggle to find a new one (AI agent coding assistants already exist, and of course I&#x27;m excited about them, especially as they get better). I will not, any time soon, trust unsupervised AI to send emails on my behalf, make travel reservations, or perform other actions that are very costly to fix. AI as a shopping agent just isn&#x27;t too exciting for me, since I do not believe I actually know what features in a speaker &#x2F; laptop &#x2F; car I want until I do my own research by reading what experts and users say.
hirako2000大约 1 个月前
The problem with Devin wasn&#x27;t that it was a black box doing too much. It&#x27;s that the outcome demo&#x27;d were fake and what was inside the box wasn&#x27;t an &quot;AI engineer.&quot;<p>Transparency? If it worked even unreliably, nobody would care what it does. Problem is stochastic machines aren&#x27;t engineers, don&#x27;t reason, are not intelligence.<p>I find articles attacking Ai but finding excuses in some mouse rather than pointing at the elephant, exhausting.
tristor大约 1 个月前
The thing I most want an AI agent to do is something I can&#x27;t trust to any third-party, it&#x27;d need to be local, and it&#x27;s something well within LLM capabilities today. I just want a &quot;secretary in my pocket&quot; to take notes during conversations and produce minutes, but do so in a way that&#x27;s secure and privacy-respecting (e.g. I can use it at work or at home).
评论 #43543616 未加载
kuil009大约 1 个月前
It&#x27;s natural to expect reliability from AI agents — but I don&#x27;t think Cursor is a fair example. It&#x27;s a developer tool deeply integrated with git, where every action can have serious consequences, as in any software development context.<p>Rather than blaming the agent, we should recognize that this behavior is expected. It’s not that AI is uniquely flawed — it&#x27;s that we&#x27;re automating a class of human communication problems that already exist.<p>This is less about broken tools and more about adjusting our expectations. Just like hunters had to learn how to manage gunpowder weapons after using bows, we’re now figuring out how to responsibly wield this new power.<p>After all, when something works exactly as intended, we already have a word for that: software.
评论 #43540930 未加载
ankit219大约 1 个月前
Agents in the current format are unlikely to go beyond a current levels of reliability. I believe agents are a good use case in a low trust environments (outside of coding where you could see the errors quickly with testing or deployment) like inter-company communications and tasks, where there are already systems in place for checks and things going wrong. Might be a hot space in some time. For intra company, high trust environment cannot just be a workflow automation given any error would need the knowledge worker to redo the whole thing to check if its correct. We can do it via other agents - less chances of it going wrong - but more chances it screws up in the same place as previous one.
rambambram大约 1 个月前
I heard you, so we decided to now tweak the dials a bit. The dial for &#x27;capability&#x27; we can turn back a little, no problem, but the dial for &#x27;reliability&#x27;, uhm yeah... I&#x27;m sorry, but we couldn&#x27;t find that dial. Sorry.
killjoywashere大约 1 个月前
We have been looking at Hamming distance vs time to signature for ambient note generation in medicine. Any other metrics? Lots of metrics in the ML papers, but a lot of them seem sus. They take a lot of work to reproduce or they are designed around some strategy like maxing out the easy true negatives (so you get desirable accuracy and F1 score), etc. as someone trying to build validation protocols I can get vendors to enable (need them to write certain data from memory to a DB table we can access) I’d welcome that discussion. Right now the MBAs running the hospital systems are doing whatever their ML buddies say without regard to patient or provider.
janalsncm大约 1 个月前
I think many people share the same sentiment. We don’t need agents that can <i>kind of</i> do many things. We need reliable programs that are really good at doing a single thing. I said as much about Manus when it came out.<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43350950">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43350950</a><p>There are mistakes in the Manus demo if you actually look at it. And with so many AI demos, they never want you to look too closely because the thing that was created is fairly mediocre. No one is asking for the tsunami of sludge except for VCs apparently.
cryptoz大约 1 个月前
This is refreshing to read. I, like everyone apparently, am working on my own coding agent [1]. And I suppose it&#x27;s not that capable yet. But it sure is getting more reliable. I have it only modify 1 file at a time. It generates tickets for itself to complete - but never enough tickets to really get all the work done. The tickets it does generate, however, it often can complete (at least, in simple cases haha). The file modification is done through parsing ASTs and modifying those, so the AI doesn&#x27;t go off and do all kinds of things to your whole codebase.<p>And I&#x27;m so sick of everything trying for 100% automation and failing. There&#x27;s a place for the human in the loop, in <i>quickly</i> identifying bugs the AI doesn&#x27;t have the context for, or large-scale vision, or security or product-focused mindset, etc.<p>It&#x27;s going to be AI and humans collaborating. The solutions that figure that out the best are going to win IMO. AI won&#x27;t be doing everything and humans won&#x27;t be doing it all either. The tools with the best human-AI collaboration are where it&#x27;s at.<p>[1] <a href="https:&#x2F;&#x2F;codeplusequalsai.com" rel="nofollow">https:&#x2F;&#x2F;codeplusequalsai.com</a>
评论 #43536048 未加载
whatnow37373大约 1 个月前
Agents introduce causality, reflection, necessity and various other sub-components never to be found in purely stochastic completion engines. This is an improvement, but it does require breaking down what each &quot;agent&quot; needs to do. What are the &quot;core components&quot; of cognition?<p>That&#x27;s why I claim that any sufficiently complicated cognitive architecture contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Immanuel Kant&#x27;s work.
wg0大约 1 个月前
Totally agree with author here. Also, reliability is pretty hard to achieve when the underlying models are all mountains of probability that no one yet understands how they do what they exactly do and how to precisely fix a problem without affecting other parts.<p>Here&#x27;s CNBC Business is pushing greed that these aren&#x27;t AI wrappers but next best thing after fire, bread and axe[0]<p>[0]. <a href="https:&#x2F;&#x2F;youtu.be&#x2F;mmws6Oqtq9o" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;mmws6Oqtq9o</a>
评论 #43539609 未加载
Havoc大约 1 个月前
What has me slightly puzzled is why there isn’t a sharp pivot towards typed languages for vibe coding.<p>Would be much easier for the AI&#x2F;IDE to confirm the code is likely good. Or well better than untyped. The whole rust if it compiles it probably works thing.<p>Instead it’s all python&#x2F;JS let LLM write code and pray you don’t hit run time errors on a novel code path<p>I get that there is more python training data but still seems like the inferior fit for LLM assisted coding
qoez大约 1 个月前
You get more reliability from better capability though. More capability means being better at not misclassifying subtle tasks, which is what causes reliability issues.
andreash大约 1 个月前
We are building this with <a href="https:&#x2F;&#x2F;lets.dev" rel="nofollow">https:&#x2F;&#x2F;lets.dev</a>. We believe there will be great demand for less capable, but much more determinisic agents. I also recommend everyone to read &quot;What is an agent?&quot; by Harrison Chase. <a href="https:&#x2F;&#x2F;blog.langchain.dev&#x2F;what-is-an-agent&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.langchain.dev&#x2F;what-is-an-agent&#x2F;</a>
评论 #43542327 未加载
genevra大约 1 个月前
I agree up until the coding example. If someone doesn&#x27;t know about version control I don&#x27;t think that&#x27;s any fault of the company trying to stretch the technology to its limits and let people experiment. Cursor is a really cool step in a direction, and it&#x27;s weird to say we should clamp what it&#x27;s doing because people might not be competent enough to fix its mistakes.
cadamsdotcom大约 1 个月前
Models aren’t great at deciding whether an action is irreversible - and thus whether to stop to ask for input&#x2F;advice&#x2F;approval. Hence agentic systems usually are given a policy to follow.<p>Perhaps the question “is this irreversible?” should be delegated to a separate model invocation.<p>There could be a future in which agentic systems are a tree of model and tool invocations, maybe with a shared scratchpad.
YetAnotherNick大约 1 个月前
I think the author is doing apples to oranges comparison. If you have AI acting agnatically, capability is likely positively correlated with reliability. If you don&#x27;t have AI agents, it is more reliable.<p>AI agents are not there yet and even cursor has agent mode not selected by default. I have seen cursor agent quite a bit worse that the raw model with human selected context.
jappwilson大约 1 个月前
Can&#x27;t wait for this being a plot point in a murder mystery, someone gamed the AI agent to create a planned &quot;accident&quot;
nottorp大约 1 个月前
But but...<p>People don&#x27;t get promoted for reliability. They get promoted for new capabilities. Everyone thinks they&#x27;re the next Google.
prng2021大约 1 个月前
I think the best shot we have at solving this problem is an explosion of specialized agents. That will limit how off the rails each one can go at interpreting or performing some type of task. The end user still just needs to interact with one agent though, as long as it can delegate properly to subagents.
piokoch大约 1 个月前
Funny note about Cursor. Commercial project, rather expensive, cannot figure out that it would be good to use, say, version control not to break somebody&#x27;s work. That&#x27;s why I prefer Aider (free), which is simply committing whatever it does, so any change could be reverted. Easily.
rglover大约 1 个月前
&gt; Given the intensifying competition within AI, teams face a difficult balance: move fast and risk breaking things, or prioritize reliability and risk being left behind.<p>Can we please retire this dichotomy? Part of why teams do this in the first place is because there&#x27;s this language of &quot;being left behind.&quot;<p>We badly need to retreat to a world in which rigorous engineering is applauded and <i>expected</i>—not treated as a nice to have or &quot;old world thinking.&quot;
mentalgear大约 1 个月前
Capability demos (like Rabbit R1 vaporware) will go up as long as the market is hot and investors (like lemmings) foolishly running after those companies that are best @ hype.
shireboy大约 1 个月前
&quot; It’s easy to blame the user&#x27;s missing grasp of basic version control, but that misses the deeper point.&quot;<p>Uhh, no, that&#x27;s pretty much the point. A developer without basic understanding of version control is like a pilot without a basic understanding of landing. A ton of problems with AI (or any other tool, including your own brain) get fixed by iterating on small commits and branching. Throw away the commit or branch if it really goes sideways. I can&#x27;t fathom working on something for 4 months without realizing a problem or having any way to roll back.<p>That said, the one argument I could see is if Cursor (or copilot, etc) had built in to suggest &quot;this project isn&#x27;t in source control, we should probably do that before getting too far ahead of ourselves.&quot;, then help the user setup sc, repo, commit, etc. The topic _is_ tricky and I do remember not totally grasping git, branching, etc.
评论 #43537748 未加载
vivzkestrel大约 1 个月前
remember 2016 chatbots anymore. sounds like the same thing all over again except this time we got hallucinations and unpredictability
xg15大约 1 个月前
&gt; <i>If your task can be expressed as a workflow, build a workflow.</i><p>And miss out on the sweet, sweet VC millions? Naah.
fullstackwife大约 1 个月前
Are we reinventing software engineering? What happened to the &quot;write code for error&quot; principle?
marban大约 1 个月前
Giving up accuracy for a bit of convenience—if any at all—almost never pays off. Looking at you, Alexa.
评论 #43536981 未加载
cnst大约 1 个月前
This is my biggest complaint about AI.<p>Instead of creating easy-to-navigate help sections of the website, and explaining the product and everything clearly, the flashy vendors simply put everything behind an opaque model as if that&#x27;s somehow better.<p>Then you have to guess what to type to get the most basic info about fees, terms and procedures of a service.<p>You want to see how the Pros are doing it? Well, they&#x27;re not using any AI! Tesla, for example, still has a regular PDF and a regular section-based manual (in HTML) where you can read the details about your car.<p>$TSLA is priced as being the most innovative auto manufacturer, and they&#x27;re clearly proficient with the AI (Autopilot&#x2F;FSD), yet when it comes to user&#x27;s manual, clearly they&#x27;re following the same process as all the legacy automakers always have had (besides not hiding the PDF behind a parts paywall, and having an open-access HTML version of the manual, too, of course). Why? Because that actually works!
amogul大约 1 个月前
Reliability, consistency and accuracy is the next frontier that we all have to tackle it sucks. Friend of mine is building Empromptu.ai to tackle exactly this. From what she told me built a model where that let&#x27;s you define accuracy based on your use case and their models optimize your whole system towards it.
donfotto大约 1 个月前
&gt; choosing a small number of tasks to execute exceptionally well<p>And that is the Unix philosophy
bobosha大约 1 个月前
i think this agents vs workflow is a false dichotomy. A workflow - at least as I understand it - is the atomic unit of an agent i.e an agent stitches workflow(s) together.
segh大约 1 个月前
Lots of people are building on the edge of current AI capabilities, where things don&#x27;t quite work, because in 6 months when the AI labs release a more capable model, you will just be able to plug it in and have it work consistently.
评论 #43537390 未加载
评论 #43536450 未加载
评论 #43536288 未加载
techblaze3大约 1 个月前
Appreciate the effort in writing this.
daxfohl大约 1 个月前
We can barely make deterministic distributed services reliable. And microservices now have a bad reputation for being expensive distributed spaghetti. I&#x27;m not holding my breath for distributed AI agents to be a thing.
asdev大约 1 个月前
want reliability? build automation instead of using non deterministic models to complete tasks
anishpalakurT大约 1 个月前
Check out BAML at boundaryml.com
评论 #43541019 未加载
revskill大约 1 个月前
Ai can uhnderstand its output.
aucisson_masque大约 1 个月前
Can you actually make the LLM more reliable tho ?<p>As far as I know, llm hallucinations are inherent to them and will never be completely removed. If I book a flight, i want 100,0% reliability, Not 99% ( which we are still far away today).<p>People got to take llm for what they are, good bullshiter, awesome to translate text or reformulate words but it&#x27;s not designed to have thought or be an alternate secretary. Merely a secretary tool.
fennecbutt大约 1 个月前
Lmao, training models off what is essentially a process directly inspired by imperfect &quot;Good enough&quot; biological processes and expecting it to be a calculator.<p>Ofc I&#x27;m not defending all thy hype and I look forward to more advanced models that get it right more often.<p>But I do laugh at him tech people and managers who expect ml based on an analog process to be sterile and clean like a digital environs.
ramesh31大约 1 个月前
More capability, less reliability please. I want something that can achieve superhuman results 1 out of 10 times, not something that gives mediocre human results 9 out of 10 times.<p>All of reality is probabilistic. Expecting that to map deterministically to solving open ended complex problems is absurd. It&#x27;s vectors all the way down.
评论 #43537262 未加载
评论 #43536239 未加载
评论 #43537360 未加载
评论 #43536391 未加载
评论 #43536358 未加载