Short summary of the paper:<p>Take Gemma-2B. Take your API. Use ChatGPT-3.5 to generate 1,000 "correct" API function call responses by dint of placing only your API calls in the pre-prompt, then prompting it. I imagine they use ChatGPT to create the request language as well. Then make 1,000 "incorrect" API call responses by filling the pre-prompt with functions not from your API.<p>Finetune.<p>Note that they use "functional tokens" in training - they convert a function to a particular, previously unused tokenization, and refer to it that way. They claim this speeds up inference (I'm sure it does). They don't make any claims as to whether or not it changes their accuracy (I bet that it does). It definitely makes the system more fragile / harder to train for large and very large APIs.<p>Outcome: highly capable <i>single API</i> function call LLM. They say you could do it with as little as 100 training inputs if you really wanted.<p>I think this is interesting, but not world-shattering. I could imagine building a nice little service company on it, basically just "send us a git repo and you'll get a helpful function call API for this version of your code which you can hook up to an API endpoint / chatbot".<p>Limitations are going to be largely around Gemma-2B's skills -- A 2B model isn't super sophisticated. And you can see they specify "<30 tokens" for the prompt. But, I imagine this could be trained quickly enough that it could be part of a release CI process. There are a number of libraries I use that I would like to have access to such a model.<p>I'd be interested in something that has general knowledge of a large set of packages for a language, and could pull in / finetune / MoE little models for specific repositories I'm coding on. Right now I would rely on either a very large model and hope its knowledge cutoff is right (Claude/GPT-4), or using a lot of a large context window. There might be some Goldilocks version in the middle here which would be helpful in a larger codebase but be faster and more accurate than the cloud monopoly providers.
> To mitigate such errors, we propose designating functions as unique functional tokens.<p>I just skimmed the paper but this seems to be the crux of it. They map functions to a single token and can then fine-tune models to use the token instead of the function name. This increases accuracy of smaller LLMs and reduces total number of tokens required for prompts and for generations, which is where they get their speed gains from.<p>The paper is worth a look just to see "Figure (2)"
I'm going to start commenting on ArXiV paper links with the same request.<p>1. Show me the data<p>2. Show me the code<p>3. Show me the model<p>If we can't play and modify it easily it doesn't belong in HN.
They might even get higher accuracies with a dedicated classification layer. By using the existing vocabulary they are spreading the probability mass across a <i>much</i> larger space. If they stuck to N options where N is the total number of functions available to the model I suspect they could get to 100% accuracy.<p>It's also not clear whether there is sufficient ambiguity in the test data for this to be a generalizable model. The difficulty with "intent recognition" (which they don't mention but is what this problem is called for agents like Siri) is that human generated inputs vary widely and are often badly formed. If they haven't done extensive evaluation with human users and/or they've constrained the functions to be quite distinct then they aren't yet tackling a hard problem, they've just got a complex setting.
This is the frontier—tiny, specialized models like this and ReALM [0], coupled to the application logic and able to run on-device.<p>Eventually devices will be powerful enough to run more general purpose models locally, but for high-frequency user tasks with a low tolerance for error, smaller specialized models may always win.<p>[0]: <a href="https://arxiv.org/abs/2403.20329" rel="nofollow">https://arxiv.org/abs/2403.20329</a>
"What is better than one recipe for Octopus?"<p>I can't be the only person who heard that line in their head instantly when reading that headline.
So, I guess it's a LoRa for function calls. Makes sense that this would work well, and bodes well for creating really cheap request routers in more advanced cloud-based situations.