I saw a lot of initial buzz about the promise of agent based workflows and it seemed to be the obvious way to get LLMs to the edge of decision making and leverage many specialized models. It seems the chatter has died down but there are growing projects out there in the space. Before I invest the time to explore and work with tools I’d love some feedback from the community and if and how they are being used.
I’ve seen a lot of attempts but nothing that worked really well. Using an agent as a glorified search engine can work, but trying to replace actual humans to handle anything but the most standard use cases is still incredibly hard. There’s a lot of overhyped rhetoric at the moment around this tech, and looks like we’re heading into another period of post-hype disillusionment.<p>Legal angles here also also super interesting. There’s a growing body of scenarios where companies are held accountable for the goofs of their AI “assistants.” Thus we’re likely heading for some comical train wrecks as companies that don’t properly vet this stuff set themselves up for some expensive disasters (eg think the AI assistant doing things that will get the company into trouble).<p>I’m bullish on the tech, but bearish on the ability of folks to deploy it at scale without making a big expensive mess.
I built an AI-agents tech demo[1], and am now pivoting. A few thoughts:<p>* I was able to make a simple AI agent that could control my Spotify account, and make playlists based on its world knowledge (rather than Spotify recommendation algos), which was really cool. I used it pretty frequently to guide Spotify into my music tastes, and would say I got value out of it.<p>* GPT-4 worked quite well actually, GPT-3.5 worked maybe 80% of the time. Mixtral did not work at all, aside from needing hacks/workarounds to get function-calling working in the first place.<p>* It was very slow and VERY expensive. Needing CoT was a limitation. Could easily rack up $30/day just testing it.<p>My overall takeaway: it's too early: too expensive, too slow, too unreliable. Unless you somehow have a breakthrough with a custom model.<p>From the marketing side, people just don't "get it." I've since niched down, and it's very, very promising from a business perspective.<p>[1] <a href="https://konos.ai" rel="nofollow">https://konos.ai</a>
Apparently Pieter Levels:
"
Interior AI now has >99% profit margins<p>- GPU bill is $200/month for 21,000 designs per month or about 1¢ per render (no character training like Photo AI helps costs)
- Hosted on a shared VPS with my other sites @ $500/mo, but % wise Interior AI is ~$50 of that<p>+= $250/month in costs<p>It makes about $45,000 in MRR and so $44,730 is pure profits! It is 100% ran by AI robots, no people<p>I lead the robots and do product dev but only when necessary"<p><a href="https://twitter.com/levelsio/status/1773443837320380759" rel="nofollow">https://twitter.com/levelsio/status/1773443837320380759</a>
I get a lot of value out of Copilot and GPT4 for coding, but that's about it.<p>It's true that have to wrestle a lot with them to get them to do what I want for more complex tasks... so they are great for certain tasks and terrible for others, but when I'm in Xcode, I dearly miss vscode because of Copilot autocomplete, which I guess is an indication that it adds <i>some</i> value<p>One unexpected synergy has been how good GPT4 is at explaining why my rust code is so bad, thanks to the very verbose compiler messages and availability of high quality training data (i.e. the great rust code in the wild)—despite GPT4 not always being great at writing <i>new</i> rust code from a blank file.<p>Part of me thinks in the future this loop is going to be a bit more automated, with an LLM in the mix... similar to how LSPs are "obvious" and ubiquitous these days<p>On an unrelated note, I also wrote a small python script for translating my Xcode project's localizable strings into ~10 different languages with some carefully constructed instructions and error checking (basically some simple JSON validation before OpenAI offered JSON as a response type). I only speak ~2 of the target languages, and only 1 natively, but from a quick review the translations seemed mostly fine. Definitely a solid starting point
I've been playing with AI agents for months, and most of them are pretty bad. They often get stuck in loops, which is frustrating. This happens in MultiOn, AutoGPT, and others.<p>I've used Devin a few times (see: <a href="https://x.com/varunshenoy\_/status/1767591341289250961?s=20" rel="nofollow">https://x.com/varunshenoy\_/status/1767591341289250961?s=20</a>), and while it's far from perfect, it's by far the best I've seen. It doesn't get stuck in loops, and it keeps trying new things until it succeeds. Devin feels like a fairly competent high school intern.<p>Interestingly, Devin seems better suited as an entry-level analyst than a software engineer. We've been using it internally to scrape and structure real estate listings. Their stack for web RPA and browser automation works _really_ well. And it makes sense why this is important: if you want to have a successful agent, you need to provide it with good tools. Again, it's not flawless, but it gives me hope for the future of AI agents.
Most of the application right now is for purposes for which quality isn't a high priority. (Also, plagiarism laundering.)<p><i>Don't</i> put it in charge of paying bills.<p><i>Do</i> put it in charge of making SEO content sites, conducting mass scam automated interactions, generating bulk code where company tolerates incompetence, making stock art for blog posts that don't need to look professional, handling customer service for accounts you don't care about, etc.
Aren’t agents bottlenecked by the underlying models? I’ve read that the number of “chain of thought” steps needed is proportional to task complexity. And if each step has the same probability p of success, probability of success is p^n, where n is the number of steps needed (potentially high). At a 99% success rate per step and 5 steps that’s a 95% overall success rate. 90% drops down to 60%. Not sure what the real numbers are but this seems like it could be a problem without significantly more intelligent ML models?
50 comments so far with 4 about non-agent codegen and rest confirming OPs observations.<p>I'm seeing also an explosion in the number of comments advertising their AI tool on anything remotely related to AI topics. Makes me think we are headed for a major correction.
What's an agent based workflow? :)<p>I use LLMs as a glorified search engine. That was better than web search at some point, I'm not sure the publicly available LLMs are that good any more. Gemini seems to be extremely worried to not offend anyone instead of giving me results lately.<p>At least it's still useful for 'give me the template code for starting an XXX' ...
I'm working on an agent-based tool for software development. I'm getting quite a lot of value out of it. The intention is to minimize copy-pasting and work on complex, multi-file features that are too large for ChatGPT, Copilot, and other AI development tools I've tried.<p><a href="https://github.com/plandex-ai/plandex">https://github.com/plandex-ai/plandex</a><p>It's working quite well though I am still ironing out some kinks (PRs welcome btw).<p>I think the key to agents that really work is understanding the limitations of the models and working around them rather than trying to do <i>everything</i> with the LLM.<p>In the context of software development, imo we are currently at the stage of developer-AI symbiosis and probably will be for some time. We aren't yet at the stage where it makes sense to try to get an agent to code and debug complex tasks end-to-end. Trying to do this is a recipe for burning lots of tokens and spending more time and than it would take to build the thing yourself. But if you follow the 80/20 rule and get the AI to do the bulk of the work, intervening frequently to keep it on track and then polishing up the final product manually at the end, huge productivity gains are definitely in reach.
When I hear AI agents, I hear RL (reinforcement learning), not LLMs. RL may not be having the moment that LLMs are, but the progress in recent years is incredible and they are absolutely solving real world problems. I was just listening to a podcast about using an RL algorithm to enhance the plasma containment system in a fusion reactor, and the results were incredible. It quickly learned a policy that was competitive with the existing system that had been hand built over many years at a cost of millions. It even provided some new insights and surprises. RL is SOTA in robotics control, and some new algorithms like Dreamer V3 can generalize in realtime without millions of samples. It's has already grown way beyond solving ATARI games and is in many cases being used to train LLMs and other generative AI.<p>There is a good amount of research going into combining LLMs with RL for decision making, it is a powerful combination. LLMs help with high level reasoning and goal setting, and of course provide a smooth interface for interacting with humans and with other agents. LLMs also contain much of the collective knowledge of humanity, which is very useful for training agents to do things. If you want to train a robot to make a sandwich it's helpful to know things, like what is a sandwich, and that it is necessary to move, get bread, etc.<p>These feedback loop LLM agent projects are kind of misguided IMO. AI agents are real and useful and progressing fast, but we need to combine more tools than just LLM to build effective systems.<p>Personally, I am using LLMs quite effectively for ecommerce: classifying messages, drafting responses, handling simple requests like order cancellation. All kinds of glue stuff that used to be painful is now automated and easy. I could go on.
The best quote I’ve heard from our clients is “don’t trust AI with anything you wouldn’t trust a high schooler to do.”<p>That line of reasoning has held true across basically every project we’ve touched that tried to incorporate LLMs into a core workflow.
The term "AI agents" might be a bit overhyped. We're using AI agents for the orchestration of our fully automated web scrapers. But instead of trying to have one large general purpose agent that is hard to control and test, we use many smaller agents that basically just pick the right strategy for a specific sub-task in our workflows. In our case, an agent is a medium-sized LLM prompt that has a) context and b) a set of functions available to call.
For example we use it for:<p>- Navigation: Detect navigation elements and handle actions like pagination or infinite scroll automatically.<p>- Network Analysis: Identify desired data within network calls.<p>- Data transformation: Clean and map the data into the desired format. Finetuned small and performant LLMs are great at this task with a high reliability.<p>The main challenge:<p>We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.<p>The integration of tightly constrained agents with traditional engineering methods effectively solved this issue for us.
I get the same feeling. AI Agents sounds very cool but reliability is a huge issue right now.<p>The fact that you can get vastly different outcomes for similar runs (even while using Claude 3 Opus with tool/function calling) can drive you insane. I read somewhere down in this thread that one way to mitigate these problems is my implementing a robust state machine. I reckon this can help, but I also believe that somehow leveraging memory from previous runs could be useful too. It's not fully clear in my mind how to go about doing this.<p>I'm still very excited about the space though. It's a great place to be and I love the energy but also measured enthusiasm from everyone who is trying to push the boundaries of what is possible with agents.<p>I'm currently also tinkering with my own Python AI Agent library to further my understanding of how they work: <a href="https://github.com/kenshiro-o/nagato-ai">https://github.com/kenshiro-o/nagato-ai</a> . I don't expect it to become the standard but it's good fun and a great learning opportunity for me :).
To summarize, agents are (essentially) LLMs in a loop: take actions, think, plan, etc, then repeat.<p>Currently, from what I've seen, current LLMs "diverge" when put into a loop. They seem to reason acceptably in small chunks, but when you string the chunks together, they go off the rails and don't recover.<p>Can you slap another layer of LLM on top to explicitly recover? People have tried this, it seems like nobody has figured out the error correction needed to get it to converge well.<p>My personal opinion is that this is the measure of whether we have AGI or not. When LLM-in-a-loop converges, self-corrects, etc, then we're there.<p>It's likely all current agent code out there is just fine, and when you plug in a smart enough LLM it'll just work.
Whenever I read about how AI is going to automate art or some other creative job I think about a quote I read (maybe here) that went something like “that which is made without effort is enjoyed without pleasure”.
Not sure if this would qualify has an "agent", but I developed my own AI personal assistant that runs as a telegram bot. I can use it from everywhere easily, handles my events, reminders, sends me a daily agenda and memorizes useful things for me. I even integrated it with whisper so that I can send a telegram voice message and don't need to write.
From a product/selling perspective, no value at all since I haven't even considered that (I'm building it for myself and my needs). But daily usefulness value? heck yeah!
I don't think ai agents are good enough to replace every job today, but they're starting to nip at the more junior / menial knowledge jobs<p>I've seen a lot of success come from AI sales agents, just doing basic SDR style work<p>We're having some success automating manual workflows for companies at Skyvern, but we've only begun to scratch the surface.<p>I suspect that this will play out a lot like the iPhone era -- first few years will be a lot of discovery and iteration, then things will kick into superdrive and you'll see major shifts in user behavior
Devin seems like the first able to be commercialized. In my opinion the only way to do it well right now is you need to build your own system, the out of box open source projects are just some foundational work.<p>I actually don’t think we will need agents in the future, I think one model will be able to morph itself or just delegate copies of itself like MoE for actions.<p>It just seems extremely unlikely to me foundation models don’t get exponentially smarter over the next few years and can’t do this.
Lots more good answers in [Ask HN: What have you built with LLMs? | Hacker News](<a href="https://news.ycombinator.com/item?id=39263664">https://news.ycombinator.com/item?id=39263664</a>) too<p>If most people's only experience with AI is the chat.openai.com interface then yeah I can see why it seems like too much hassle to most people. The trick is figure out your long prompts ahead of time, and hardcode each one into a HTTP Request in something else (Tasker, BetterTouchTools, Alfred, Apple Shortcuts, etc). For me, I have dozens of long prompts to do exactly what I want, assigned to wakewords, hotkeys, and trigger words on my mac/watch/phone. Another key thing is I use FAST models, i.e. Groq not GPT-4. Latency makes AI too much hassle. i.e. 1. Instant (<1 second end-to-end) answers in just a few words, to voice questions spoken into my watch any time I hold the side button 2. Summarize long articles and youtube videos before I decide to spend time on them 3. Add quick code snippets in plain english with hotkeys or voice 4. Get the main arguments for and against something just to frame it … stuff like that. If it would make your life easier for an AI to save you 1 second per task, why not.
I've found that while agents cannot replace anyone, they can sure help with the use of various things.<p>First, we know these AIs are trained with data from the general Internet, and that data is vast.<p>Second, the general Internet contains owner manuals and support forums for practically every active product there is, globally. These are every possible product too: physical products, virtual products like software or music, and experience products like travel or education. Between the owner’s manuals and the support forums for these products there is extremely deep knowledge about the purpose, use and troubleshooting of these products.<p>Third, one cannot just ask an LLM direct deep questions about some random product and expect a deep knowledge answer. One has to first create the context within the LLM that activates the area(s) of deep knowledge you want your answers to arise. This requires the use of long form prompts that create the expert you want, and once that expert is active in the LLM’s context, then you ask it questions and receive the deep knowledge answers desired.<p>Fourth, one can create an LLM agent that helps a person create the LLM agent they want, the LLM agent can help generate new agents, and dependency chains between different agents are not difficult at all, including information exchange between groups of agents collaborating on shared replies to requests.<p>And last, all that deep information about using pretty much every software there is can be tapped with careful prompting to create the context of an expert user of that software, and experts such as these can become plugins and drivers for that software. It's at our finger tips...!
It's a bit like with the chatbots revolution from 8-10 years ago (not LLM, just make a choice and maybe parse a few keywords to navigate a chatbot state machine)<p>Sure, we can do that, but do users want that?<p>I don't want to chat, talk or interact with people, I want the most efficient ui possible for the task at hand.
When I do chat with someone is because some businesses are crap at automating and I need a human to fix something.
Even then I don't want a robot that can't do anything.<p>The only exception I can think of is tutoring but then I'd really question the validity of the answers. RAG is pretty cool in that regard because it can point at the original paragraph being used to answer the question.<p>That might be useful to someone but that's not my favourite way of learning.<p>Give me a summary of the content, give me the content, Ctrl+F and I'm good to go.<p>For low stakes things like gaming where the agent messing up would just be a fun bug, I think it can be great.<p>Looking forward to automatically generated side quests based on actions and npc which get pissed if I put a box on their head and hire mercenaries if I murder their families.
I’ve found the OpenAI assistants API not really up to snuff in terms of predictable behavior yet.<p>That said, I’m very bullish on agents overall though and expect that once they get their assistants behaving a bit more predictably we will see some cool stuff.<p>It’s really quite magical to see one of these think through how to solve a problem, use custom tools that you implement to help solve it, and come to a solution.
Agents are largely an attempt take the Human-LLM pair and and use the LLM to replace all the work the human finds trivial but which the LLM is terrible at.<p>Trying to get more inference value per-prompt is a good thing. Starting by trying to get it to do long-chain tasks per-prompt makes no sense.<p>I'm a huge fan of LLMs for productivity, but even small tasks often require multiple prompts of build-up or fix-up. We should work toward getting those done in a single prompt more often, then work toward slightly larger tasks etc.<p>Plugins and GPTs are both attempts at getting more/better inference per-prompt. There is some progress there, but it's pretty limited. There's also plenty of people building task-specific tools that get better results than someone using the chat interface due a lot of prompt work.<p>So there <i>is</i> incremental progress happening, but it's been fairly slow. The fact that it's this much work to get incrementally more inference value per prompt makes it very hard to imagine anyone closing the whole loop immediately with an agent.
Just like many here have said already, GPT4 is being useful for coding for me. It is an amazing parser, specially, and save me precious time. Of course it's not able to do anything on it's own or without supervision, but is has been better than looking up to examples on Google.<p>I also have been experimenting with it to replace the intention classifier part of Google's dialogflow. We use it at work for our chatbot. Earlier, we used Watson and it was amazing, but became very expensive. Dialogflow is cheap, but it is as innacurate with complex natural language as it is cheap.<p>Mixtral (8x7B) has proved extremely accurate in identifying intentions with a consistent JSON output, giving it a short context, so i assume a simple 7B model would do the job. I still don't know if it is financially worth it, but it's something i'm gonna try if i can't fix the dialogflow's intentions. But in no way the model's output would directly interface with a client. That's asking for trouble.
I made a specific design decision to avoid and minimize agentic behavior in aider. My biggest concerns are that agentic loops are slow and expensive. Even worse, they often "go off the rails" and spend a lot of time and money diligently doing the wrong thing, which you ultimately have to undo/redo.<p>Instead, I've found the "pair programming chat" UX to be the most effective for AI coding. With aider, you collaborate with the AI on coding, asking for a sequence of bite sized changes to your code base. At each step, you can offer direction or corrections if you're unhappy with how the AI interpreted your request. If you need to, you can also just jump in and make certain code changes yourself within your normal IDE/editor. Aider will notice your edits and make sure the AI is aware as your chat continues.<p><a href="https://github.com/paul-gauthier/aider">https://github.com/paul-gauthier/aider</a>
The better question is: is the "value" worth the enormous environmental cost from the massive amounts of power used in training datasets, storing them, and running the GPUs/TPUs?<p>NVIDIA estimates they'll ship up to 2M H100 GPUs. They have a TDP of about 300-400W each. Assume that because of their high cost, their utilization is very high. Assume another 2/3rds of that is used for cooling, which would be another 200W. Be generous and throw out all the overhead from the host computers, storage, power distribution, UPS systems, and networking.<p>2M * 600W = 1.2GW.<p>Let's say you only operated them during the daytime and wanted to do so from solar power. You'd need between ten and twenty square miles worth of solar panels to do so.
Have not experimented with it personally, but here is Andrew Ng's talk on the subject: <a href="https://www.youtube.com/watch?v=sal78ACtGTc" rel="nofollow">https://www.youtube.com/watch?v=sal78ACtGTc</a>
I was hype on them initially, but after a few months of using them, I find that they are only useful for simple questions, and for coding help with syntax for basic things I have forgotten.<p>Anything more complex just turns into an irritating back and forth game that when I finally arrive at the solution, I feel like I wasted my time not getting practical experience, but rather gaming a magic 8 ball into giving me what I wanted.<p>It just doesn't feel satisfying to me to use them anymore. I don't deny that they improve my productivity, but its at the cost of enjoying what i do. I was never able to enter that feeling of zen flow, while using LLMs regularly.
Opinions mine based on learning from scratch this space in the last couple months only.<p>I feel like these architectures built on top of last gen LLMs are mostly useless now.<p>The current gen jump was significant enough that creating a complex chain of thought with RAG on last gen usually is surpassed by 0 shots on current gen.<p>So instead of spending time and money building it it's better to focus on 0-shot and update your models to the latest version.<p>Feeding LLM outputs into other LLM inputs IMHO just will increase the bias. Initially I expected to mix and match different models to avoid it but that didn't work as much as I expected.<p>It depends a lot on your application honestly.
We are doing that internally, so I think it is now more a craft than a "product". For example, we look at lot of specific codebase repositories (e.g. GitHub) and try LLMs over the diff just before and after a security code audit was done.<p>Another one is listening to many social media (e.g. Twitter) posts to sense if there is a business opportunity. SDRs scan the results in an Slack channel manually but based on these signals.<p>Finally, this is now a workflow but we did this [1] that is a piece in our work.<p>[1] <a href="https://news.ycombinator.com/item?id=39280358">https://news.ycombinator.com/item?id=39280358</a>
I haven’t found any useful agent workflows, and I’ve not found a tool that’s more productive for me (doing arch/design/implementation of systems) than just copy and pasting from the Playground.
Seems like we are still in the sheepdog phase where AI agents are extremely capable but not really autonomous helpers that still need overall coordination and control. Logical extension is layers upon layers so the next stage is a shepherd AI and then a farm AI. Then a meta AI that can discern and separate the layers and implement each and combine. May develop like a film studio with area experts combined for a particular project rather than a static one structure fulfils all approach.
Devin is the only one that I've been able to use. Set up some projects for me, added some features. Could improve, obviously, but net positive for me in terms of time
I build a lot of side projects and I have gotten a lot of value from <a href="https://www.goagentic.com/" rel="nofollow">https://www.goagentic.com/</a> to send personalised cold emails at scale. I no longer need to spend time researching the prospects as the tool researches every prospect, crafts a personalised message based on what I am selling and send the emails. So far with a 2-5% positive reply rate.
We use it for manual research. Think of the times when you visit a certain prospect's website or company website to detect a certain information or to find a hook to talk to them about.<p>We use agents in workflow to be able to do this in bulk. Problem is it does take a long time but at least it saves time at the end of the day and saves you from manually visiting a list of 100 different domains to see a piece of information
Agents still very new and nothing that works for production yet.<p>Specifically on using AI for coding, I wrote about different levels of AI coding from L1 to L5, we are still at L2/L3 stage for mature and production ready tech. Agents are L4/L5:<p><a href="https://prompt.16x.engineer/blog/ai-coding-l1-l5" rel="nofollow">https://prompt.16x.engineer/blog/ai-coding-l1-l5</a>
I just (as in, five minutes ago) hooked GPT 4 up to my 3D printer and it's fantastic, I use an ESP32 Box and I can ask it what files I have on my printer, I can ask it to print a file, I even added calendar integration so it can read me my events and add new ones. I love it.<p>All that's left is for someone to bundle it all up into a nice package, and we'll be in the future.
We've built a low-code AI agent platform with primary use case in e-commerce (replacing first-line of humans for basic things like product search, QA, etc). It works fairly well if you assemble the script correctly. And if it fails - it just falls back to humans, so customers don't see much difference in their experience.
I do. I use assistants as containers for different conversations for my GTM work:
An assistant for marketing and copywriting
An assistant for customer support
An assistant sales conversations.<p>These agents aren't super smart: just few PDFs for context plus a few sentences system prompts.<p>I do get what I want in 80% of use cases (not measured, just a feeling).
Agents and tooling around LLM’s can probably make some small number of applications viable, but for the most part we need better foundation models to deliver on the hype.<p>We’re definitely in the “wait” phase of the wait calculation. Everyone is expecting GPT5/q* to change things but really we won’t know until we see it.
I think agents are not a fad, they're here to stay since using an LLM in an agent system is the only way to let it access real world task, which they will end up doing when they're good and reliable enough.<p>That said, I believe the current best models are still not good enough - but let's wait a few months.
I personally use an AI FAQ bot to automate FAQ questions in some of my Discord servers.
It doesn't always work as well as a human answering but it does help, in most cases.
In other words, AI Agents can be very helpful but can only be trusted/useful to a limited extent compared to humans.
NVIDIA and one step down the chain OpenAI are making lots of money, mostly by convincing people that LLMs solve all problems.<p>LLMs are perfect for this, super flashy, with a ton of hype. In reality, LLMs are really bad at most applications, they are a solution in search of a problem.
Some. We are building some new processes from the ground up and will use Agents as a first draft contributor. This is typically where we find the most slowdown. And we will consistently search for the word "delve" as a misspelling :)
One of the hardest lessons I learned in tech, maybe in life, was that
if people don't want it you're done.<p>It doesn't matter that you think it's the coolest and most amazing
technology in history. It may be. So what?<p>It doesn't matter that experts from every part of industry are yelling
that "this is the future", that the march of this tech is
"inevitable". They need to believe that, for their own reasons.<p>It doesn't matter that academics from Yale, Harvard and MIT are
publishing a dozen new papers on it every week. For the mostpart their
horizon ends at the campus gate.<p>It doesn't matter that investors are clamouring to give you money and
inviting you to soirees to woo you because your project has the latest
buzzwords in the name. Investors have to invest in something.<p>And it doesn't matter if market research people are telling you that
the latent demand and growth opportunity is huge. People tell them
what they want to hear.<p>The real test - and I wish I had known this when I was twenty - is do
ordinary people on the London Omnibus want it? Not my inner ego
projection. Not my wishful thinking. Not what "the numbers" say. Go
and ask them.<p>My experience right now - from asking people (for a show I make) is
that people are shit scared of AI and if they don't hold a visceral
distaste for it they've an ambivalence that's about as stable
nitro-glycerine on a hot day. I know that may be a difficult thing to
hear as a business person.<p>If you are harbouring in your heart any remnant of the idea that you
can create demand, that they will "see the light" and once they have a
taste will be back for more, or that by will and power they can be
<i>made</i>, regulated and peer pressured into accepting your "vision",
then you'd be wise to gently let go of those thoughts.
I've been a bit disappointed by the AI. I'll admit going in with low expectations (I know about the whole AI summer/winter cycle) and I was blown away that ChatGPT could play Jeopardy! with just a prompt since I remember being blown away by Watson and AlphaGo. But then I had it help me write a letter, and by the time I got it to do anything useful, I basically had to write an outline for it, and then I realized I had already done the hard part. I asked it to write some boilerplate code for an interface to the Slack API in Python, but it used a deprecated API, and it assumed I had a valid token. Turns out Slack has lots of different kinds of tokens and I was using the wrong one, and the AI couldn't help me figure that out. After that, I remembered the story about pain point for radiologists. They don't need help diagnosing cancer, they need help with their internet connectivity.
When I was technical blogging on how to learn from open-source code [1], I used it quite frequently to get unstuck and/or to figure out how to tease apart a large question into multiple smaller functions. For example, I had no idea how to break up this long `sed` command [2] into its constituent parts, so I plugged it into ChatGPT and asked it to break down the code for me. I then Googled the different parts to confirm that ChatGPT wasn't leading me astray.<p>If I had asked StackOverflow the same question, it would have been quickly closed as being not broadly applicable enough (since this `sed` command is quite specific to its use case). After ChatGPT broke the code apart for me, I was able to ask StackOverflow a series of more discrete, more broadly-applicable questions and get a human answer.<p>TL;DR- I quite like ChatGPT as a search engine when "you don't know what you don't know", and getting unblocked means being pointed in the right direction.<p>1. <a href="https://www.richie.codes/shell" rel="nofollow">https://www.richie.codes/shell</a><p>2. <a href="https://github.com/rbenv/rbenv/blob/e8b7a27ee67a5751b899215b4d35fd86ab552dae/libexec/rbenv-versions#L60">https://github.com/rbenv/rbenv/blob/e8b7a27ee67a5751b899215b...</a>
the use case i "feel" these are useful are for studying any given topic. having one single page helps avoid many google searches, tangential questions, etc., but i'm always looking out for inaccuracies.
Full disclaimer up top: I have been working on agents for about a year now building what would eventually become HDR [1][2].<p>The first issue is that agents have extremely high failure rates. Agents really don't have the capacity to learn from either success or failure since their internal state is fixed after training. If you ask an agent to repeatedly do some task it has a chance of failing every single time. We have been able to largely mitigate this by modeling agentic software as a state machine. At every step we have the model choose the inputs to the state machine and then we record them. We then 'compile' the resulting state-transition table down into a program that we can executed deterministically. This isn't totally fool proof since the world state can change between program runs, so we have methods that allow the LLM to make slight modifications to the program as needed. The idea here is that agents should never have to solve the same problem twice. The cool thing about this approach is that smarter models make the entire system work better. If you have a particularly complex task, you can call out to gp4-turbo or claude3-opus to map out the correct action sequence and then fall back to less complex models like mistral 7b.<p>The second issue is that almost all software is designed for people, not LLMs. What is intuitive for human users may not be intuitive for non-human users. We're focused on making agents reliably interact with the internet so I'll use web pages as an example. Web pages contain tons of visually encoded information in things like the layout hierarchy, images, etc. But most LLMs rely on purely text inputs. You can try exposing the underling HTML or the DOM to the model, but this doesn't work so well in practice. We get around this by treating LLMs as if they were visually impaired users. We give them a purely text interface by using ARIA trees. This interface is much more compact than either the DOM or HTML so responses come back faster and cost way less.<p>The third issue I see with people building agents is they go after the wrong class of problem. I meet a lot of people who want to use agents for big ticket items such as planning an entire trip + doing all the booking. The cost of a trip can run into the thousands of dollars and be a nightmare to undo if something goes wrong. You really don't want to throw agents at this kind of problem, at least not yet, because the downside to failure is so high. Users generally want expensive things to be done well and agents can't do that yet.<p>However there are a ton of things I would like someone to do for me that would cost less than five dollars of someones time and the stakes for things going wrong are low. My go to example is making reservations. I really don't want to spend the time sorting through the hundreds of nearby restaurants. I just want to give something the general parameters of what I'm looking for and have reservations show up in my inbox. These are the kinds of tasks that agents are going to accelerate.<p>[1] <a href="https://github.com/hdresearch/hdr-browser">https://github.com/hdresearch/hdr-browser</a>
[2] <a href="https://hdr.is" rel="nofollow">https://hdr.is</a>
I've been disappointed by my few experiments with Langchain's agent tooling. Things I have experienced:<p>- The pythonrepl or llm-math agent not being used when it should be and the agent returning a wrong or approximate answer.<p>- The wikipedia and webbrowsed agents doing spurious research in an attempt to answer a question I did not ask (hallucinating a question, essentially).<p>- Agents getting stuck in a loop of asking the same question over and over until they time out.<p>- The model not believing an answer it gets from an agent (eg using a Python function to get today's date and not believing the answer because "The date is in the future").<p>When you layer all this on top of the usual challenges of writing prompts (plus, with Python function, writing the docstring so the agent knows when to call it), wrong answers, hallucination, etc, etc, I'm unconvinced. But maybe I'm doing it wrong!
I am!<p>In my experience, you need to keep a human in the loop. This implies that you can't get the technology to scale, but I'm optimistic because LLMs have rapidly gotten better at following directions while I've been using them over the last six months.<p>Summarization is probably the clearest strength of LLMs over a human. With ever-growing context windows, summarizing books in one shot becomes feasible. Most books can be summarized in one sentence, though the most useful, information-dense ones cannot.<p>I had Gemini 1.5 Pro summarize an old book titled Natural Hormonal Enhancement yesterday. Having just read the book, the result was acceptable.<p><a href="https://hawleypeters.com/summary-of-natural-hormonal-enhancement/" rel="nofollow">https://hawleypeters.com/summary-of-natural-hormonal-enhance...</a><p>For information-dense books, it seems clear to me that chatting with the book is the way to go. I think there's promise to build a competent agent for this kind of use case. Imagine gathering 15 papers and then chatting about their contents with an agent with queries like:<p>What's the consensus?
Where do these papers diverge in their conclusions?
Please translate this passage into plain English.<p>I haven't done this myself, but I have a hard time imagining such an agent being useless. Perhaps this is a failure of imagination on my part.<p>The brightest spot in my experimentation is [Cursor](<a href="https://cursor.sh" rel="nofollow">https://cursor.sh</a>). It's good for little dev tasks like refactoring a small block of code and chatting about how to use vim. I imagine it'd be able to talk about how to set up various configs, particularly if you @ the documentation, a feature that it supports, including [adding documentation](<a href="https://docs.cursor.sh/features/custom-docs" rel="nofollow">https://docs.cursor.sh/features/custom-docs</a>).<p>Edit: I think a lot of disappointment comes from these kinds of tools not being AGI, or a replacement for a human that does some repetitive task. They magnify the power of somebody that's already curious and driven. They still empower lazy, disengaged users, but with goals like doing the bare minimum, and avoiding work altogether, these tools cannot help one accomplish much of use.
I use one that reads my RSS feeds writes a radio DJ voice over and uses an elevenlabs API call to generate the voice (their Santa voice from last year works really well). Combines it with one of my Spotify playlists and gives me a 45 minutes radio show for my commute... pretty much changed how I consume news and content like hn.
At my work, I have colleagues who speak English as a second language. Many of them are using LLMs to up their document and other writing.<p>It’s actually quite awful. It’s obvious the text is LLM generated because of the verbose, generic writing style. It communicates clearly but without substance. Not gonna lie, I secretly judge these people.
We built a conversion rate optimizing AI Agent and saw about 45% click through rate lift on our own homepage. In other beta testing companies that used it, we saw a similar average and range has been +15%-175%. Agent (AB3.ai) can be tried here: <a href="https://AB3.ai" rel="nofollow">https://AB3.ai</a>