They're already going multi-modal? Holy crap, if google can't deliver in the accessibility space for this (image descriptions better than "the logo for the company"), then I'll definitely go back to Apple. I mean I do hope Apple cleans out bugs and makes VoiceOver feel like it won't fall over if I breathed hard, but their image descriptions, even without an LLM, are already clean and clear. More like "A green logo on a black background", where Google is, like I said, more like "The logo for the company." I guess it's kinda what we get when AI is crowdsourced rather than given good, high quality data to work with.
Also relevant: LLM in a flash: Efficient Large Language Model Inference with Limited Memory<p>Apple seems to be gearing up for significant advances in on-device inference using this LLMs<p><a href="https://arxiv.org/abs/2312.11514" rel="nofollow noreferrer">https://arxiv.org/abs/2312.11514</a>
Old paper (Oct/2023), but the weights are new (Dec/2023):<p><a href="https://lifearchitect.ai/models-table/" rel="nofollow noreferrer">https://lifearchitect.ai/models-table/</a>
Apple has been looking sleepy on LLMs, but they've been consistently evolving their hardware+software AI stack, without much glitzy advertising. I think they could blow away Microsoft/OpenAI and Google, if suddenly a new iOS release makes the OpenAI/Bard chatbox look laughably antiquated. They're also a threat to Nvidia, if a significant swath of AI usage switches over to Apple hardware. Arm and TSMC would stand to win.
I really hope Apple releases an iPhone with a good on-device private LLM assistant, perhaps next year. Their hardware is well-positioned for it.<p>It could make me get a new phone outside of my usual ~4 year cycle. Siri is almost unusable for me.
> FERRET is trained on 8 A100 GPUs with 80GB memory.<p>Huh, even Apple isn't capable of escaping the CUDA trap. Funny to see them go from moral enemies with Nvidia to partially-dependent on them...
> Usage and License Notices: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.<p>Wait, how did "GPT-4" get in there?
>Ferret: A Multimodal Large Language Model<p>What I thought when reading the title: A new base model trained from the ground up on multimodal input, on hundreds to thousands of GPUS<p>The reality: A finetune of Vicuna, trained on 8xA100, which already is a finetune of Llama 13b. Then it further goes on to re-use some parts of LLava, which is an existing multimodal project already built upon Vicuna. It's not really as exciting as one might think from the title, in my opinion.
Maybe the abstract of the paper is a better introduction to what this is:<p>> We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.<p><a href="https://arxiv.org/abs/2310.07704" rel="nofollow noreferrer">https://arxiv.org/abs/2310.07704</a>
One big plus if this takes off as a base model is the abundance of weasel family animals to use in naming the derivatives. Ermine, marten, fisher, ... I'd like to call Wolverine. Llama didn't have much room for some interesting variety beyond alpaca and vicuna.
I wonder if these models are trained to have some kind of identification in case you use them for non-research purposes for example.<p>"Tell me who is your manufacturer" for example
> FERRET is trained on 8 A100 GPUs<p>So Apple uses NVidia internally. Not surprising, but doesn't bode well for A Series. Dogfooding.<p>[edit] I meant M series, Apple Silicon
Does Apple know that ferrets are illegal in California?<p><a href="https://www.legalizeferrets.org/" rel="nofollow noreferrer">https://www.legalizeferrets.org/</a>
Presumable because this is Conda none of this can be run on any Apple hardware despite people managing to get M processors to do a bit of dabbling with AI?