I'm testing it for system programming brainstorming, code reviews and Python test units writing, and my impression is that it's a Sonnet 3.5 level model for most tasks. I said a few things here: <a href="https://www.youtube.com/watch?v=xjCqi9JK440" rel="nofollow">https://www.youtube.com/watch?v=xjCqi9JK440</a> but in general this is really an open weights frontier model, the first one that we get (IMHO llama 3.1 405B does not fit the definition, and the actual model quality is far from the benchmarks). Also the extreme inference speed due to MoE and other design choices improves the user experience a lot. I also tested asking questions with very large contexts (PDFs, large C files) at play, and it performs very well.<p>Also don't just focus on this model but check out what DeepSeek mission is, and the CEO words in the recently released interview. They want to be the DJI / Bambulab of AI, basically: leaders and not followers, and after V3 it's hard to say they don't have the right brains to do that.
If you understand how LLMs work, you should disregard tests such as:<p>- How many 'r's are in Strawberry?<p>- Finding the fourth word of the response<p>These tests are at odds with the tokenizer and next-word prediction model.
They do not accurately represent an LLM's capabilities.
It's akin to asking a blind person to identify colors.
I know future GPU development is addressing the constrained ram problem, but it is nonetheless a massive problem for local inference. MoE seems to solve a compute problem, at the expense of compounding the ram problem. So I have a question... My understanding is that the typical MoE model starts <i>each output token</i> with a decision as to which expert model(s) to send inference tasks to. How often is it that the vast majority of predictions end up being sent to the same expert(s)? Wouldn't it be a more practical from both a training and inference perspective to do the same mixture of experts model, but choose experts on a much higher level of granularity? Like maybe on the level of the whole response, or clause, or sentence? At least then you could load an expert into ram and expect to use it without having to do massive IO loading/unloading constantly.
From the article:<p>> They probably trained the model on a synthetic dataset generated by GPT-4o.<p>This seems to be the case. I can speculate further. They trained on copyrighted material that OpenAI did not.
A lot of talk about how much cheaper it is than all other models.<p>It remains to be seen what the pricing will be when run by non-Deepseek providers. They might be loss leading.<p>The comparison for cheap models should also be Gemini 2.0 Flash Exp. I could see it being even cheaper when it stops being free - if it does at all. There's definitely a scenario where Google just keeps it freeish for a long time with relatively high limits.
Open weights are nice, but they're just the end product of a black box process (training data, alignment methods, filtering choices, etc).<p>Like with all of these models, we don't know what's in them.