Down on twitter now so <a href="https://archive.is/Y72Gu" rel="nofollow noreferrer">https://archive.is/Y72Gu</a><p>The reason companies/researchers haven't generally touched MoE for LLMs despite how good it sounds on paper is because they've typically sucked and underperformed their dense counterparts.<p>assuming this is all true, Did Open ai do anything differently here or is it just scale ?<p>I know this very recent paper shows MoE benefit far more from Instruct tuning - <a href="https://arxiv.org/abs/2305.14705" rel="nofollow noreferrer">https://arxiv.org/abs/2305.14705</a><p>FLAN-MOE-32B comfortably surpasses FLAN-PALM-62B with a third of the compute. It goes from 25.5% to 65.4% on MMLU.<p>In comparison, 55.1 to 59.6% for Flan-Palm 62b. That just kind of shows the underperformance you expect from sparse models.<p>But from Open ai's technical report, it doesn't seem like they needed that.<p>The Vision component seems to be just scale. Well all of it seems to be just scale. Seems like there's plenty scale left too as far as performance gains go.
A long intro with no real content, just tech bro "here's the thing" stuff trying to bait you into subscribing. Doesn't actually explain the architecture if you don't subscribe. Don't bother reading.