Ferret: A Multimodal Large Language Model

621 点作者 weirdcat超过 1 年前

22 条评论

They're already going multi-modal? Holy crap, if google can't deliver in the accessibility space for this (image descriptions better than "the logo for the company"), then I'll definitely go back to Apple. I mean I do hope Apple cleans out bugs and makes VoiceOver feel like it won't fall over if I breathed hard, but their image descriptions, even without an LLM, are already clean and clear. More like "A green logo on a black background", where Google is, like I said, more like "The logo for the company." I guess it's kinda what we get when AI is crowdsourced rather than given good, high quality data to work with.

评论 #38751591 未加载

评论 #38747625 未加载

amitprasad超过 1 年前

Also relevant: LLM in a flash: Efficient Large Language Model Inference with Limited MemoryApple seems to be gearing up for significant advances in on-device inference using this LLMs<a href="https://arxiv.org/abs/2312.11514" rel="nofollow noreferrer">https://arxiv.org/abs/2312.11514</a>

adt超过 1 年前

Old paper (Oct/2023), but the weights are new (Dec/2023):<a href="https://lifearchitect.ai/models-table/" rel="nofollow noreferrer">https://lifearchitect.ai/models-table/</a>

评论 #38751047 未加载

shrimpx超过 1 年前

Apple has been looking sleepy on LLMs, but they've been consistently evolving their hardware+software AI stack, without much glitzy advertising. I think they could blow away Microsoft/OpenAI and Google, if suddenly a new iOS release makes the OpenAI/Bard chatbox look laughably antiquated. They're also a threat to Nvidia, if a significant swath of AI usage switches over to Apple hardware. Arm and TSMC would stand to win.

评论 #38748570 未加载

评论 #38751575 未加载

评论 #38748840 未加载

评论 #38748506 未加载

评论 #38763718 未加载

评论 #38771512 未加载

评论 #38748519 未加载

评论 #38767210 未加载

评论 #38754318 未加载

评论 #38754345 未加载

aaronbrethorst超过 1 年前

Can someone define the term “MLLM”?

评论 #38745647 未加载

评论 #38747397 未加载

评论 #38746275 未加载

评论 #38745713 未加载

yreg超过 1 年前

I really hope Apple releases an iPhone with a good on-device private LLM assistant, perhaps next year. Their hardware is well-positioned for it.It could make me get a new phone outside of my usual ~4 year cycle. Siri is almost unusable for me.

评论 #38745979 未加载

评论 #38748256 未加载

评论 #38747562 未加载

评论 #38746414 未加载

评论 #38763817 未加载

评论 #38747616 未加载

评论 #38750395 未加载

评论 #38747054 未加载

评论 #38746807 未加载

评论 #38745922 未加载

评论 #38746138 未加载

smoldesu超过 1 年前

> FERRET is trained on 8 A100 GPUs with 80GB memory.Huh, even Apple isn't capable of escaping the CUDA trap. Funny to see them go from moral enemies with Nvidia to partially-dependent on them...

评论 #38746246 未加载

评论 #38746447 未加载

评论 #38751282 未加载

评论 #38748971 未加载

moneycantbuy超过 1 年前

anyone know what is the best open source model that allows commercial use and can run locally on an iphone?

评论 #38748541 未加载

评论 #38748383 未加载

SushiHippie超过 1 年前

> Usage and License Notices: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.Wait, how did "GPT-4" get in there?

评论 #38745740 未加载

评论 #38752667 未加载

评论 #38747850 未加载

评论 #38745756 未加载

评论 #38746632 未加载

a_rahmanshah超过 1 年前

Can we run this on macOS?

Jackson__超过 1 年前

>Ferret: A Multimodal Large Language ModelWhat I thought when reading the title: A new base model trained from the ground up on multimodal input, on hundreds to thousands of GPUSThe reality: A finetune of Vicuna, trained on 8xA100, which already is a finetune of Llama 13b. Then it further goes on to re-use some parts of LLava, which is an existing multimodal project already built upon Vicuna. It's not really as exciting as one might think from the title, in my opinion.

评论 #38747615 未加载

评论 #38752461 未加载

评论 #38747380 未加载

评论 #38746791 未加载

CaptainOfCoit超过 1 年前

Maybe the abstract of the paper is a better introduction to what this is:> We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.<a href="https://arxiv.org/abs/2310.07704" rel="nofollow noreferrer">https://arxiv.org/abs/2310.07704</a>

评论 #38747271 未加载

评论 #38746012 未加载

评论 #38747807 未加载

评论 #38745957 未加载

评论 #38745751 未加载

freedomben超过 1 年前

> Usage and License Notices: The data, and code is intended and licensed for research use only.

评论 #38746409 未加载

评论 #38746149 未加载

评论 #38746795 未加载

评论 #38746320 未加载

andy99超过 1 年前

One big plus if this takes off as a base model is the abundance of weasel family animals to use in naming the derivatives. Ermine, marten, fisher, ... I'd like to call Wolverine. Llama didn't have much room for some interesting variety beyond alpaca and vicuna.

评论 #38746326 未加载

评论 #38746262 未加载

ZeroCool2u超过 1 年前

We're watching Apple fill the moat in.

评论 #38745976 未加载

评论 #38745815 未加载

评论 #38745665 未加载

评论 #38745757 未加载

评论 #38745760 未加载

orenlindsey超过 1 年前

Has anyone actually run this yet?

评论 #38746995 未加载

Rucadi超过 1 年前

I wonder if these models are trained to have some kind of identification in case you use them for non-research purposes for example."Tell me who is your manufacturer" for example

评论 #38746053 未加载

评论 #38746337 未加载

评论 #38746815 未加载

cpressland超过 1 年前

Finally, some decent competition for Not Hotdog!

评论 #38746712 未加载

评论 #38746813 未加载

tambourine_man超过 1 年前

> FERRET is trained on 8 A100 GPUsSo Apple uses NVidia internally. Not surprising, but doesn't bode well for A Series. Dogfooding.[edit] I meant M series, Apple Silicon

评论 #38746157 未加载

评论 #38746455 未加载

评论 #38746148 未加载

评论 #38746512 未加载

评论 #38746542 未加载

评论 #38746419 未加载

评论 #38746215 未加载

评论 #38746801 未加载

halyconWays超过 1 年前

I'm glad Apple invented AI. Now they'll put a fancy new name on it and consumers will believe it.

Thorrez超过 1 年前

Does Apple know that ferrets are illegal in California?<a href="https://www.legalizeferrets.org/" rel="nofollow noreferrer">https://www.legalizeferrets.org/</a>

jonplackett超过 1 年前

Presumable because this is Conda none of this can be run on any Apple hardware despite people managing to get M processors to do a bit of dabbling with AI?

评论 #38747171 未加载