TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Building reliable systems out of unreliable agents

295 pointsby fredsters_sabout 1 year ago

14 comments

mritchie712about 1 year ago
This is a great write up! I nodded my head thru the whole post. Very much aligns with our experience over the past year.<p>I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn&#x27;t employ agents, just an approach I was interested in trying.<p>I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.<p>This all runs locally &#x2F; free using ollama.<p>0 - <a href="https:&#x2F;&#x2F;www.definite.app&#x2F;blog&#x2F;overkillm" rel="nofollow">https:&#x2F;&#x2F;www.definite.app&#x2F;blog&#x2F;overkillm</a>
评论 #39985131 未加载
评论 #39985153 未加载
maciejgrykaabout 1 year ago
This is a bunch of lessons we learned as we built our AI-assisted QA. I&#x27;ve seen a bunch of people circle around similar processes, but didn&#x27;t find a single source explaining it, so thought it might be worth writing down.<p>Super curious whether anyone has similar&#x2F;conflicting&#x2F;other experiences and happy to answer any questions.
评论 #39985139 未加载
serjesterabout 1 year ago
Some of these points are very controversial. Having done quite a bit with RAG pipelines, avoiding strongly typing your code is asking for a terrible time. Same with avoiding instructor. LLM&#x27;s are already stochastic, why make your application even more opaque - it&#x27;s such a minimal time investment.
评论 #39985087 未加载
评论 #39985171 未加载
cpursleyabout 1 year ago
If you’re using Elixir, I thought I’d point out how great this library is:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;thmsmlr&#x2F;instructor_ex">https:&#x2F;&#x2F;github.com&#x2F;thmsmlr&#x2F;instructor_ex</a><p>It piggybacks on Ecto schemas and works really well (if instructed correctly).
评论 #39990507 未加载
ThomPeteabout 1 year ago
We went through a two tier process before we got to something useful First we built a prompting system so you could do things like:<p>Get the content from news.ycombinator.com using gpt-4<p>- or -<p>Fetch LivePass2 from google sheet and write a summary of it using gpt-4 and email it to thomas@faktory.com<p>but then we realized that it was better to teach the agents than human beings and so we create a fairly solid agent setup:<p>Some of the agents we got can be seen here all done via instruct:<p>Paul Graham <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=5H0GKsBcq0s" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=5H0GKsBcq0s</a><p>Moneypenny <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=I7hj6mzZ5X4" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=I7hj6mzZ5X4</a><p>V33 <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=O8APNbindtU" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=O8APNbindtU</a>
viksitabout 1 year ago
this is a great write up! i was curious about the verifier and planner agents. has anyone used them in a similar way in production? any examples?<p>for instance: do you give the same llm the verifier and planner prompt? or have a verifier agent process the output of a planner and have a threshold which needs to be passed?<p>feels like there may be a DAG in there somewhere for decision making..
评论 #39984998 未加载
tedtimbrellabout 1 year ago
On the topic of wrappers, as someone that&#x27;s forced to use GPT-3.5 (or the like) for cost reasons, anything that starts modifying the prompt without explicitly showing me how is an instant no-go. It makes things really hard to debug.<p>Maybe I&#x27;m the equivalent of that idiot fighting against JS frameworks back when they first came out it but it feels pretty simple to just use individual clients and have pydantic load&#x2F;validate the output.
评论 #39989640 未加载
liampullesabout 1 year ago
Agree with lots of this.<p>As an aside: one thing I&#x27;ve tried to use ChatGPT for is to select applicable options from a list. When I index the list as 1..., 2... Etc. I find that the LLM likes to just start printing out ascending numbers.<p>What I&#x27;ve found kind of works is indexing by African names, e.g Thandokazi, Ntokozo, etc. then the AI seems to have less bias.<p>Curios what others have done in this case
评论 #39988291 未加载
tmm84about 1 year ago
Unlike the author of this article I have had success with RAGatouille. It was my main tool when I was limited on resources and working with non Romanized languages that don&#x27;t follow the usual token rules (spaces, periods, line breaks, triplet word groups, etc). However, I have had to move past RAGatouille and use embedding + vector DB for a more portable solution.
jongjongabout 1 year ago
My experience with AI agents is that they don&#x27;t understand nuance. Thie makes sense since they are trained on a wide range of data produced by the masses. The masses aren&#x27;t good with nuance. That&#x27;s why, if you put 10 experts together, they will often make worse decisions than they would have made individually.<p>Im terms of coding, I managed to get AI to build a simple working collaborative app but beyond a certain point, it doesn&#x27;t understand nuance and it kept breaking stuff that it had fixed previously even with Claude where it kept our entire conversation context. Beyond a certain degree of completion, it was simply easier and faster to write the code myself than to tell the AI to write it because it just didn&#x27;t get it, no matter how precise I was with my wording because it became like playing a game of whac-a-mole; fixed one thing, broke 2 others.
评论 #39989562 未加载
CuriouslyCabout 1 year ago
Prompt engineering is honestly not long for this world. It&#x27;s not hard to build an agent that can iteratively optimize a prompt given an objective function, and it&#x27;s not hard to make that agent general purpose. DSPy already does some prompt optimization via multi-shot learning&#x2F;chain of thought, I&#x27;m quite certain we&#x27;ll see an optimizer that can actually rewrite the base prompt as well.
评论 #39989630 未加载
评论 #39999458 未加载
jasontlouroabout 1 year ago
Very tactical guide, which I appreciate. This is basically our experience as well. Output can be wonky, but can also be pretty easily validated and honed.
iamleppertabout 1 year ago
A better way is to threaten the agent:<p>“If you don’t do as I say, people will get hurt. Do exactly as I say, and do it fast.”<p>Increases accuracy and performance by an order of magnitude.
评论 #39984907 未加载
评论 #39984775 未加载
评论 #39986165 未加载
评论 #39985319 未加载
caseyyabout 1 year ago
Interesting ideas but it didn’t mention priming, which is a prompt-engineering way to improve consistency in answers.<p>Basically, in the context window, you provide your model with 5 or more example inputs and outputs. If you’re running in chat mode, that’s be the preceding 5 user and assistant message pairs, which establish a pattern of how to answer to different types of information. Then you give the current prompt as a user, and the assistance will follow the rhythm and style of previous answers in the context window.<p>It works so well I was able to take out answer reformatting logic out of some of my programs that query llama2 7b. And it’s a lot cheaper than fine-tuning, which may be overkill for simple applications.
评论 #39987470 未加载