TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Fuyu-8B: A multimodal architecture for AI agents

205 点作者 averylamp超过 1 年前

16 条评论

tasdfqwer0897超过 1 年前
Hey I work at Adept and helped make this! Happy to answer questions. The thing I think is especially neat&#x2F;notable is how simple you can make the model architecture while still getting good performance. I expect we&#x27;ll continue to see bits of these models get deleted in the next few years<p>Note that you can get the model weights on HuggingFace here: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;adept&#x2F;fuyu-8b" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;adept&#x2F;fuyu-8b</a>
评论 #37937718 未加载
评论 #37934217 未加载
评论 #37936164 未加载
评论 #37936654 未加载
评论 #37934545 未加载
评论 #37931667 未加载
评论 #37934573 未加载
评论 #37932699 未加载
fpgaminer超过 1 年前
The architecture is quite compelling. I would not have expected it to work as well as it does. Glancing at the benchmarks it&#x27;s basically on par with other VLMs in its class, despite having no separate image encoder.<p>Is there an associated paper? Or more specifically, details on the training dataset? It must have been a mix of text and VLM tasks, otherwise one or the other capability would have rotted during training. But I wonder if they trained off strictly VLM corpora, or also used plain image-text datasets like CLIP. It would be interesting if only the former.<p>Also makes me wonder if it could be trained on something like CommonCrawl where all the images are retained and interspersed correctly throughout the text. This model could theoretically train just fine off that, and it would unlock a whole new dataset effectively.<p>And has there been an inspection of what the model is outputting for predicted image &quot;tokens&quot;? Is it correctly predicting projected image patches to any degree of accuracy? And could therefore also generate images inline with text if another de-projection layer was trained?
评论 #37938643 未加载
评论 #37940847 未加载
joanfihu超过 1 年前
I’ve done a review for UI navigation<p><a href="https:&#x2F;&#x2F;joanfihu.wordpress.com&#x2F;2023&#x2F;10&#x2F;19&#x2F;evaluating-adepts-fuyu-model-for-ui-navigation&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;joanfihu.wordpress.com&#x2F;2023&#x2F;10&#x2F;19&#x2F;evaluating-adepts-...</a>
评论 #38052500 未加载
abrichr超过 1 年前
Thank you to the amazing team at Adept.ai for making this available!<p>For anyone interested in contributing to a fully open source alternative, join us at <a href="https:&#x2F;&#x2F;github.com&#x2F;OpenAdaptAI&#x2F;OpenAdapt">https:&#x2F;&#x2F;github.com&#x2F;OpenAdaptAI&#x2F;OpenAdapt</a><p>Lots of interesting work to be done, including integrating with Fuyu-8B!
评论 #37937117 未加载
thatcherc超过 1 年前
Really cool that the image patches are converted to tokens with just a linear projection instead of a big embedding model! I wonder if that trick will prove viable for other multimodel media like audio.
评论 #37934702 未加载
mark_l_watson超过 1 年前
This looks so cool, and from reading the Hugging Face model card it should be easy enough to run. I do almost all of my work with text, NLP, IR, etc., and I have wanted to try multi-modal models. I just bookmarked the model card page.<p>I am also getting even more excited by the explosion of work on open models. I still haven’t adjusted to how good mistral-7B is, and it runs on my Mac without breaking a sweat.
评论 #37940521 未加载
yeldarb超过 1 年前
This looks epic. Definitely going to explore adding it to Autodistill[1] this weekend. Any chance you&#x27;ll be publicly releasing the internal OCR finetune?<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;autodistill&#x2F;autodistill">https:&#x2F;&#x2F;github.com&#x2F;autodistill&#x2F;autodistill</a>
devinprater超过 1 年前
Awesome! I can&#x27;t wait to see how we can make local models for, say, describing images offline, or even getting a few screenshots of, say, a video game and describing what&#x27;s going on.
stavros超过 1 年前
This looks great! Is there any software that supports these? Llama.cpp, Ollama, LM studio, etc are really convenient, but I don&#x27;t think they have image support yet?
paulkon超过 1 年前
Can this be used to click around in the browser with text prompts? Maybe after some fine-tuning on screen recordings of specific workflows in browsers.
WanderPanda超过 1 年前
Why don‘t these benchmarks judge the likelihood of the example answer? Just taking the MAP predictions seems like a waste of information
thefcpk超过 1 年前
One thing that puzzles me is the lack of multilingual models... it is a bit sad to see everything through the English language.
评论 #37935988 未加载
评论 #37935901 未加载
StephenAshmore超过 1 年前
Fascinating! I love seeing more multimodal ML. Thanks for sharing!
og_kalu超过 1 年前
Oh wow. This seems to be the best released vlm model. The chart&#x2F;UI understanding displayed in particular is superb.
评论 #37934045 未加载
lxe超过 1 年前
Comparable with llava13b in benchmarks! Great work!
ronsor超过 1 年前
Before someone else does, I&#x27;m going to point out that CC-BY-NC is technically not an open source license.
评论 #37934297 未加载