>Many critical details regarding this scaling process were only disclosed with the recent release of DeepSeek V3<p>And so they decide to not disclose their own training information just after they told everyone how useful it was to get Deepseeks? Honestly can't say I care about "nearly as good as o1" when its a closed API with no additional info.
I thought there were three DeepSeek items on the HN front page, but this turned out to be a fourth one, because it's the Qwen team saying they have a secret version of Qwen that's actually better than DeepSeek-V3.<p>I don't remember the last time 20% of the HN front page was about the same thing. Then again, <i>nobody</i> remembers the last time a company's market cap fell by 569 billion dollars like NVIDIA did yesterday.
A Chinese company announcing this on Spring Festival eve, that is very surprising. The deep seek announcement must have put a fire under them. I am surprised anything is being done right now in these Chinese tech companies.
I just ran my NYT Connections benchmark on it: 18.6, up from 14.8 for Qwen 2.5 72B. I'll run my other benchmarks later.<p><a href="https://github.com/lechmazur/nyt-connections/">https://github.com/lechmazur/nyt-connections/</a>
Kinda ambivalent about MoE in cloud. Where it could really shine though is in desktop class gear. Memory is starting to get fast enough where we might see MoEs being not painfully slow soon for large-ish models.
The significance of _all_ of these releases at once is not lost on me. But the reason for it is lost on me. Is there some convention? Is this political? Business strategy?
> We evaluate Qwen2.5-Max alongside leading models<p>> [...] we are unable to access the proprietary models such as GPT-4o and Claude-3.5-Sonnet. Therefore, we evaluate Qwen2.5-Max against DeepSeek V3<p>"We'll compare our proprietary model to other proprietary models. Except when we don't. Then we'll compare to non-proprietary models."