TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training

179 pointsby lord_sudoabout 1 year ago

10 comments

nlabout 1 year ago
This is an awesome paper, and the somewhat negative sentiment in the discussion here is surprising.<p>The ablation studies are well done, comprehensive and expensive to do. People will be using the conclusions from this for years, and that is much more impactful than if an upcoming Siri product ourperforms the GPT model at that same point in time.<p>A few really interesting points:<p>Synthetic datasets substantially (1%+) increase performance for Image Encoder Pre-training<p>Architecture of the Visual&lt;-&gt;Language model connector doesn&#x27;t seem to matter.<p>Interleaving text and image data improves few shot performance, but image captioning data improves zero-shot numbers.<p>The ideal mix of data types is 5:5:1 for Interleaved:Captions:Plain Text (!)<p>Synthetic captioning data helps substantially at this point too (up to 4% gain)<p>The appendices are amazing: lots of details about learning rates tried, batch sizes.<p>The &quot;explain these figures&quot; are really really good. See page 37.
brookstabout 1 year ago
The paper explores different design choices for various parts of the model and draws conclusions about the relative importance of optimizing each area (image encoder very important, vision-language connector less so).<p>The actual set of models produced (up to 30B parameters) seems secondary to the intent of the paper, and is more validation of the best design choices in each area.
reapermanabout 1 year ago
This looks competitive against CLIP, and surprisingly great at VQA style prompts, but it doesn&#x27;t seem like the paper supports comparing it to GPT-4. We don&#x27;t see any tests for coding performance, math homework, legal document review, or any of the myriad other things that people use GPT-4 for on a daily basis.
评论 #39726392 未加载
lolinderabout 1 year ago
MM1 is a research paper, not a release of a competing product. I&#x27;m sure the paper is interesting and am looking forward to reading an analysis of it by someone who understands these things better than I do, but this is not that analysis, it&#x27;s an extremely low-effort puff piece that is more interested in getting attention than in accurately describing a research paper.<p>I don&#x27;t usually say this, but TFA frankly feels like it was written by AI:<p>&gt; The release of MM1 by Apple contributes significantly to the artificial intelligence domain, offering a detailed roadmap for the development of future MLLMs. By sharing the insights and design principles gleaned from MM1, Apple not only challenges the current capabilities of models like ChatGPT but also invites the broader AI community to build upon their findings, potentially leading to more sophisticated and capable AI systems.
评论 #39726461 未加载
评论 #39727778 未加载
refibrillatorabout 1 year ago
Biggest model is 30b MoE trained on 100b tokens, max sequence length 4096. A bit underwhelming compared to recent announcements like the open source Large World Model [1].<p>Absolutely no benchmarks against GPT4 present in the paper.<p>Notably they used instruction response pairs generated from GPT4 for supervised fine tuning. Which has always felt like an experimental hack to me, but that’s how many folks are bootstrapping smaller models these days, and the effectiveness is hard to argue with.<p>Apple’s axlearn framework was used which leverages JAX and XLA [2].<p>[1] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=39367141">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=39367141</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;axlearn">https:&#x2F;&#x2F;github.com&#x2F;apple&#x2F;axlearn</a>
评论 #39730440 未加载
评论 #39727623 未加载
pushedxabout 1 year ago
This has an unfortunate naming collision with the M&#x2F;M&#x2F;1 queue, a common stochastic model for the study of queueing theory.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;M&#x2F;M&#x2F;1_queue" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;M&#x2F;M&#x2F;1_queue</a>
smokelabout 1 year ago
The paper lists &quot;first authors&quot;, &quot;core authors&quot;, and &quot;senior authors&quot;.<p>My dream is to one day be listed on a seminal paper as &quot;secondary forum reply author&quot;.
评论 #39728393 未加载
评论 #39726508 未加载
评论 #39726488 未加载
a_vanderbiltabout 1 year ago
I wonder if this has anything to do with their acquisition of DarwinAI. After a decade of mediocrity, I&#x27;d love to see Siri get smarter. Any improvement would be welcome at this point.
评论 #39727428 未加载
评论 #39727614 未加载
评论 #39727312 未加载
评论 #39727423 未加载
评论 #39727318 未加载
erulabsabout 1 year ago
If it’s going to take general artificial intelligent to get a voice assistant that can remember not one, but two entirely separate cooking timers, then so be it. Imagine the GPUs required!<p>I’m still baffled at Siri and Google assistant. Virtually zero innovation in a decade. I just want to be able to turn on BBC radio while my hands are wet, is that really so hard?!
评论 #39727558 未加载
评论 #39727917 未加载
评论 #39727571 未加载
评论 #39729185 未加载
评论 #39727650 未加载
BryanLegendabout 1 year ago
Trainig
评论 #39727598 未加载