It looks like the key insight here is to have the LLM generate its own tools (as in GPT/Claude tool calling) via Python code generation and apply cosine similarity RAG to select which tools are available at each step using the tool description and the problem/step while using recent history to error correct.<p>The agent starts with some human created tooling like a tool to read the file system or create another tool using python code, then starts accumulating custom Python functions it wrote itself with tool calling metadata like descriptions and input/output types. Every time it reaches a step, if it doesn't find a relevant tool, it creates a new one. Apparently this improves performance on complex tasks (via GAIA benchmark) with diminishing returns on simpler tasks.
Putting this idea out there, haven't seen anyone implement it:<p>Use vector embeddings to represent each task as a story, an abstraction of 1. the past, 2. the present, 3. the future - on a kind of global "story map".<p>Each embedding would be generated by all available sense inputs at a point in time. The most useful embeddings alg will be able to combine sight, hearing, internal monologue, visual imagination etc into one point on a high-dimensional map.<p>At each time step, find the closest successful "memory" (based on embedding of 1+2+3) and do some LLM exploration to adapt the memory to the new, novel situation.<p>Attempt the new "story", and do something like A* to get closer to the desired "future", tweaking the story each time and plotting failed attempts on the embedding map.<p>Theory being that over time, the map will become populated with successful attempts and embedding will be able to abstract between similar situations based on 1+2+3.<p>I'm not the guy to implement it, and I imagine new models training with a "reasoning step" are doing a similar thing at training-time.
The paper evaluates itself on the GAIA benchmark and it was my first time hearing about it, so I tried to evaluate myself as a human.<p>Here's a level 3 question from the GAIA paper (level 3 = hardest):<p>>In NASA’s Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes.<p>I timed myself solving the problem. It took me 9 minutes, 5 Google searches, 14 web pages, multiple Ctrl+F in these pages and 1 calculator use to figure out the answer.<p>DynaSaur seems to have a 10% to 20% success rate at this level.<p>Try for yourself. This is one of the few empirically grounded reference levels for how far we are from AGI.
I don't like the way LLM papers are written. LLMs receive inputs and produce outputs that are best represented as plaintext with some special characters. Simply showing a few examples of the agent's core LLM text continuation job would explain the architecture much better than figures. I can't help but feel that the authors which do this are intentionally obfuscating things.
Writers basically said in this paper let us just save certain amounts of working code snippets generated by llm and hope that they are also needed in the future, at the same time concluding that the saved code is sparsed , so this research paper at this stage is just useless.
This is super big news if it’s real.<p>Basically, given an agent with an initial set of predefined actions and goal, they’re saying “decompose this into steps and pick and action to achieve each step”. Pretty standard stuff.<p><i>Then</i> they say, hey, if you can’t solve the problem with those actions (ie. failed repeatedly when attempting to solve), write some arbitrary generic python code and use that as your action for the next step.<p>Then save that as a new generic action, and slowly build up a library of actions to augment the initial set.<p>The thing is, there’s no meaningful difference between the task “write code to solve this task” and “write code to solve this action”; if you can meaningfully generate code that can, without error, perform arbitrary tasks, you’ve basically solved programming.<p>So… that would be quite a big deal.<p>That would be a real “Devon” that would actually be able to write arbitrary code to solve arbitrary problems.<p>…which makes me a bit sceptical.<p>Still, this seems to have at least worked reasonably well (as shown by being a leader on the GAIA leaderboard) so they seem to have done <i>something</i> that works, but I’m left wondering…<p>If you’ve figured out how to get an agent to write error free deterministic code to perform arbitrary actions in a chain of thought process, why are you pissing around with worrying about accumulating a library of agent actions?<p>That’s all entirely irrelevant and unnecessary.<p>Just generate code for each step.<p>So… something seems a bit strange around this.<p>I’d love to see a log of the actual problem / action / code sequences.
This is a great application of dynamic tooling. But figure 5 is kind of flawed. It’s not a fair comparison, when the tool call you provide doesn’t work. Obviously the LLM with code execution capabilities will do better.
Generating code to do stuff was the idea of OpenAI Codex in 2021.<p>This paper basically just adds a cache? Not really novel as we already have Codex, Code Interpreter, etc.