TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Octopus v2: On-device language model for super agent

91 pointsby lawrencechenabout 1 year ago

8 comments

vessenesabout 1 year ago
Short summary of the paper:<p>Take Gemma-2B. Take your API. Use ChatGPT-3.5 to generate 1,000 &quot;correct&quot; API function call responses by dint of placing only your API calls in the pre-prompt, then prompting it. I imagine they use ChatGPT to create the request language as well. Then make 1,000 &quot;incorrect&quot; API call responses by filling the pre-prompt with functions not from your API.<p>Finetune.<p>Note that they use &quot;functional tokens&quot; in training - they convert a function to a particular, previously unused tokenization, and refer to it that way. They claim this speeds up inference (I&#x27;m sure it does). They don&#x27;t make any claims as to whether or not it changes their accuracy (I bet that it does). It definitely makes the system more fragile &#x2F; harder to train for large and very large APIs.<p>Outcome: highly capable <i>single API</i> function call LLM. They say you could do it with as little as 100 training inputs if you really wanted.<p>I think this is interesting, but not world-shattering. I could imagine building a nice little service company on it, basically just &quot;send us a git repo and you&#x27;ll get a helpful function call API for this version of your code which you can hook up to an API endpoint &#x2F; chatbot&quot;.<p>Limitations are going to be largely around Gemma-2B&#x27;s skills -- A 2B model isn&#x27;t super sophisticated. And you can see they specify &quot;&lt;30 tokens&quot; for the prompt. But, I imagine this could be trained quickly enough that it could be part of a release CI process. There are a number of libraries I use that I would like to have access to such a model.<p>I&#x27;d be interested in something that has general knowledge of a large set of packages for a language, and could pull in &#x2F; finetune &#x2F; MoE little models for specific repositories I&#x27;m coding on. Right now I would rely on either a very large model and hope its knowledge cutoff is right (Claude&#x2F;GPT-4), or using a lot of a large context window. There might be some Goldilocks version in the middle here which would be helpful in a larger codebase but be faster and more accurate than the cloud monopoly providers.
评论 #39936783 未加载
评论 #39918809 未加载
gardnrabout 1 year ago
&gt; To mitigate such errors, we propose designating functions as unique functional tokens.<p>I just skimmed the paper but this seems to be the crux of it. They map functions to a single token and can then fine-tune models to use the token instead of the function name. This increases accuracy of smaller LLMs and reduces total number of tokens required for prompts and for generations, which is where they get their speed gains from.<p>The paper is worth a look just to see &quot;Figure (2)&quot;
评论 #39915344 未加载
评论 #39921565 未加载
wanderingmindabout 1 year ago
I&#x27;m going to start commenting on ArXiV paper links with the same request.<p>1. Show me the data<p>2. Show me the code<p>3. Show me the model<p>If we can&#x27;t play and modify it easily it doesn&#x27;t belong in HN.
评论 #39914400 未加载
评论 #39915361 未加载
评论 #39914853 未加载
iandanforthabout 1 year ago
They might even get higher accuracies with a dedicated classification layer. By using the existing vocabulary they are spreading the probability mass across a <i>much</i> larger space. If they stuck to N options where N is the total number of functions available to the model I suspect they could get to 100% accuracy.<p>It&#x27;s also not clear whether there is sufficient ambiguity in the test data for this to be a generalizable model. The difficulty with &quot;intent recognition&quot; (which they don&#x27;t mention but is what this problem is called for agents like Siri) is that human generated inputs vary widely and are often badly formed. If they haven&#x27;t done extensive evaluation with human users and&#x2F;or they&#x27;ve constrained the functions to be quite distinct then they aren&#x27;t yet tackling a hard problem, they&#x27;ve just got a complex setting.
turnsoutabout 1 year ago
This is the frontier—tiny, specialized models like this and ReALM [0], coupled to the application logic and able to run on-device.<p>Eventually devices will be powerful enough to run more general purpose models locally, but for high-frequency user tasks with a low tolerance for error, smaller specialized models may always win.<p>[0]: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2403.20329" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2403.20329</a>
mikeceabout 1 year ago
&quot;What is better than one recipe for Octopus?&quot;<p>I can&#x27;t be the only person who heard that line in their head instantly when reading that headline.
zhiyuan8about 1 year ago
Hi all,<p>Thanks for discussing our work, feel free to contact us for follow-up demos and collaborations!<p>alexchen@nexa4ai.com zack@nexa4ai.com
CGamesPlayabout 1 year ago
So, I guess it&#x27;s a LoRa for function calls. Makes sense that this would work well, and bodes well for creating really cheap request routers in more advanced cloud-based situations.
评论 #39916821 未加载