This is a great write up! I nodded my head thru the whole post. Very much aligns with our experience over the past year.<p>I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn't employ agents, just an approach I was interested in trying.<p>I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.<p>This all runs locally / free using ollama.<p>0 - <a href="https://www.definite.app/blog/overkillm" rel="nofollow">https://www.definite.app/blog/overkillm</a>
This is a bunch of lessons we learned as we built our AI-assisted QA. I've seen a bunch of people circle around similar processes, but didn't find a single source explaining it, so thought it might be worth writing down.<p>Super curious whether anyone has similar/conflicting/other experiences and happy to answer any questions.
Some of these points are very controversial. Having done quite a bit with RAG pipelines, avoiding strongly typing your code is asking for a terrible time. Same with avoiding instructor. LLM's are already stochastic, why make your application even more opaque - it's such a minimal time investment.
If you’re using Elixir, I thought I’d point out how great this library is:<p><a href="https://github.com/thmsmlr/instructor_ex">https://github.com/thmsmlr/instructor_ex</a><p>It piggybacks on Ecto schemas and works really well (if instructed correctly).
We went through a two tier process before we got to something useful First we built a prompting system so you could do things like:<p>Get the content from news.ycombinator.com using gpt-4<p>- or -<p>Fetch LivePass2 from google sheet and write a summary of it using gpt-4 and email it to thomas@faktory.com<p>but then we realized that it was better to teach the agents than human beings and so we create a fairly solid agent setup:<p>Some of the agents we got can be seen here all done via instruct:<p>Paul Graham
<a href="https://www.youtube.com/watch?v=5H0GKsBcq0s" rel="nofollow">https://www.youtube.com/watch?v=5H0GKsBcq0s</a><p>Moneypenny
<a href="https://www.youtube.com/watch?v=I7hj6mzZ5X4" rel="nofollow">https://www.youtube.com/watch?v=I7hj6mzZ5X4</a><p>V33
<a href="https://www.youtube.com/watch?v=O8APNbindtU" rel="nofollow">https://www.youtube.com/watch?v=O8APNbindtU</a>
this is a great write up! i was curious about the verifier and planner agents. has anyone used them in a similar way in production? any examples?<p>for instance: do you give the same llm the verifier and planner prompt? or have a verifier agent process the output of a planner and have a threshold which needs to be passed?<p>feels like there may be a DAG in there somewhere for decision making..
On the topic of wrappers, as someone that's forced to use GPT-3.5 (or the like) for cost reasons, anything that starts modifying the prompt without explicitly showing me how is an instant no-go. It makes things really hard to debug.<p>Maybe I'm the equivalent of that idiot fighting against JS frameworks back when they first came out it but it feels pretty simple to just use individual clients and have pydantic load/validate the output.
Agree with lots of this.<p>As an aside: one thing I've tried to use ChatGPT for is to select applicable options from a list. When I index the list as 1..., 2... Etc. I find that the LLM likes to just start printing out ascending numbers.<p>What I've found kind of works is indexing by African names, e.g Thandokazi, Ntokozo, etc. then the AI seems to have less bias.<p>Curios what others have done in this case
Unlike the author of this article I have had success with RAGatouille. It was my main tool when I was limited on resources and working with non Romanized languages that don't follow the usual token rules (spaces, periods, line breaks, triplet word groups, etc). However, I have had to move past RAGatouille and use embedding + vector DB for a more portable solution.
My experience with AI agents is that they don't understand nuance. Thie makes sense since they are trained on a wide range of data produced by the masses. The masses aren't good with nuance. That's why, if you put 10 experts together, they will often make worse decisions than they would have made individually.<p>Im terms of coding, I managed to get AI to build a simple working collaborative app but beyond a certain point, it doesn't understand nuance and it kept breaking stuff that it had fixed previously even with Claude where it kept our entire conversation context. Beyond a certain degree of completion, it was simply easier and faster to write the code myself than to tell the AI to write it because it just didn't get it, no matter how precise I was with my wording because it became like playing a game of whac-a-mole; fixed one thing, broke 2 others.
Prompt engineering is honestly not long for this world. It's not hard to build an agent that can iteratively optimize a prompt given an objective function, and it's not hard to make that agent general purpose. DSPy already does some prompt optimization via multi-shot learning/chain of thought, I'm quite certain we'll see an optimizer that can actually rewrite the base prompt as well.
Very tactical guide, which I appreciate. This is basically our experience as well. Output can be wonky, but can also be pretty easily validated and honed.
A better way is to threaten the agent:<p>“If you don’t do as I say, people will get hurt. Do exactly as I say, and do it fast.”<p>Increases accuracy and performance by an order of magnitude.
Interesting ideas but it didn’t mention priming, which is a prompt-engineering way to improve consistency in answers.<p>Basically, in the context window, you provide your model with 5 or more example inputs and outputs. If you’re running in chat mode, that’s be the preceding 5 user and assistant message pairs, which establish a pattern of how to answer to different types of information. Then you give the current prompt as a user, and the assistance will follow the rhythm and style of previous answers in the context window.<p>It works so well I was able to take out answer reformatting logic out of some of my programs that query llama2 7b. And it’s a lot cheaper than fine-tuning, which may be overkill for simple applications.