Mixtral of experts

639 点作者 georgehill超过 1 年前

26 条评论

Andrej Karpathy's Take:Official post on Mixtral 8x7B: <a href="https://mistral.ai/news/mixtral-of-experts/" rel="nofollow noreferrer">https://mistral.ai/news/mixtral-of-experts/</a>Official PR into vLLM shows the inference code: <a href="https://github.com/vllm-project/vllm/commit/b5f882cc98e2c9c6dde7357dbac2ec0c2c57d8cd">https://github.com/vllm-project/vllm/commit/b5f882cc98e2c9c6...</a>New HuggingFace explainer on MoE very nice: <a href="https://huggingface.co/blog/moe" rel="nofollow noreferrer">https://huggingface.co/blog/moe</a>In naive decoding, performance of a bit above 70B (Llama 2), at inference speed of ~12.9B dense model (out of total 46.7B params).Notes: - Glad they refer to it as "open weights" release instead of "open source", which would imo require the training code, dataset and docs. - "8x7B" name is a bit misleading because it is not all 7B params that are being 8x'd, only the FeedForward blocks in the Transformer are 8x'd, everything else stays the same. Hence also why total number of params is not 56B but only 46.7B. - More confusion I see is around expert choice, note that each token and also each layer selects 2 different experts (out of 8). - Mistral-mediumSource: <a href="https://twitter.com/karpathy/status/1734251375163511203" rel="nofollow noreferrer">https://twitter.com/karpathy/status/1734251375163511203</a>

评论 #38608210 未加载

korvalds超过 1 年前

More models available at Huggingface now: <a href="https://huggingface.co/search/full-text?q=mixtral" rel="nofollow noreferrer">https://huggingface.co/search/full-text?q=mixtral</a>Already available from both Mistralai and TheBloke <a href="https://huggingface.co/mistralai/Mixtral-8x7B-v0.1" rel="nofollow noreferrer">https://huggingface.co/mistralai/Mixtral-8x7B-v0.1</a> <a href="https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF" rel="nofollow noreferrer">https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF</a>

评论 #38600973 未加载

评论 #38600834 未加载

pugio超过 1 年前

> We’re currently using Mixtral 8x7B behind our endpoint mistral-small...So 45 billion parameters is what they consider their "small" model? I'm excited to see what/if their larger models will be.

评论 #38599045 未加载

评论 #38599183 未加载

评论 #38599653 未加载

评论 #38599068 未加载

评论 #38599745 未加载

评论 #38599643 未加载

seydor超过 1 年前

> A proper preference tuning can also serve this purpose. Bear in mind that without such a prompt, the model will just follow whatever instructions are given.Mistral does not censor its models and is committed to a hands free approach, according to their CEO <a href="https://www.youtube.com/watch?v=EMOFRDOMIiU" rel="nofollow noreferrer">https://www.youtube.com/watch?v=EMOFRDOMIiU</a>> Mixtral 8x7B masters French, German, Spanish, Italian, and English.EU budget cut by half

评论 #38599164 未加载

评论 #38600545 未加载

评论 #38606268 未加载

trash_cat超过 1 年前

This is very exciting and I think this is the future of AI until we get another, next-gen architecture beyond transformers. I don't think we will get a lot better until that happens and the effort will go into making the models a lot cheaper to run without sacrificing too much accuracy. MoD is a viable solution.

评论 #38599540 未加载

reqo超过 1 年前

Can someone explain why MoE works? Is there any downside to MoE compared to a regular model?

评论 #38599080 未加载

评论 #38599305 未加载

评论 #38599659 未加载

评论 #38600675 未加载

评论 #38599351 未加载

评论 #38599248 未加载

评论 #38598974 未加载

评论 #38602051 未加载

评论 #38601703 未加载

mijoharas超过 1 年前

> It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs.Is there any link to the model and weights? I don't see it if so.

评论 #38599218 未加载

评论 #38599212 未加载

chandureddyvari超过 1 年前

Sorry if this is a dumb question. Can someone explain why it’s called 8x7B(56B) but it has only 46.7B params? and it uses 12.9B params per token generation but there are 2 experts(2x7B) chosen by a 2B model? I’m finding it difficult to wrap my head around this.

评论 #38601324 未加载

评论 #38601384 未加载

评论 #38603507 未加载

dizzydes超过 1 年前

Honest question: if they're only beating GPT 3.5 with their latest model (not GPT 4) and OpenAI/Google have infrastructure on tap and a huge distribution advantage via existing products - what chance do they stand?How do people see things going in the future?

评论 #38599390 未加载

评论 #38599471 未加载

评论 #38600011 未加载

评论 #38599696 未加载

评论 #38600646 未加载

评论 #38599846 未加载

评论 #38599519 未加载

评论 #38599364 未加载

评论 #38599649 未加载

评论 #38599732 未加载

评论 #38599352 未加载

评论 #38600396 未加载

评论 #38599426 未加载

评论 #38600627 未加载

评论 #38602391 未加载

评论 #38599449 未加载

评论 #38601500 未加载

评论 #38599357 未加载

评论 #38600808 未加载

评论 #38599520 未加载

评论 #38599650 未加载

评论 #38600228 未加载

评论 #38599828 未加载

评论 #38602639 未加载

jstummbillig超过 1 年前

Is there a non-obvious reason that models keep getting compared to GPT-3.5 instead of 4?

评论 #38598980 未加载

评论 #38598930 未加载

评论 #38598931 未加载

评论 #38598961 未加载

评论 #38610346 未加载

评论 #38600158 未加载

评论 #38598927 未加载

ensocode超过 1 年前

Is anyone using this models self-hosted in production? What cloud hosting provider/plan do you use and how is it performance wise?

评论 #38600963 未加载

评论 #38600242 未加载

评论 #38600226 未加载

maelito超过 1 年前

Just tried to register, haven't received the confirmation email.

评论 #38599084 未加载

mercat超过 1 年前

Can someone please explain how this works to a software engineer who used to work with heuristically observable functions and algorithms? I'm having a hard time comprehending how a mix of experts can work.In SE, to me, it would look like (sorting example):- Having 8 functions that do some stuff in parallel- There's 1 function that picks the output of a function that (let's say) did the fastest sorting calculation and takes the result furtherBut how does that work in ML? How can you mix and match what seems like simple matrix transformations in a way that resembles if/else flowchart logic?

评论 #38621197 未加载

HarHarVeryFunny超过 1 年前

Interesting to see Mistral in the news raising EUR 450M at a EUR 2B valuation. tbh I'd not even heard of them before this Mixtral release. Amazing how fast this field is developing!

legel超过 1 年前

The Sparse Mixture of Experts neural network architecture is actually an absolutely brilliant move here.It scales fantastically, when you consider that (1) GPU RAM is way too expensive, in financial dollars, (2) SSD / CPU RAM are relatively cheap, and (3) you can have "experts" running on their own computers, i.e. it's a natural distributed computing partitioning strategy for neural networks.I did my M.S. thesis on large-scale distributed deep neural networks in 2013 and can say that I'm delighted to point our where this came from.In 2017, it emerged from a Geoffrey Hinton / Jeff Dean / Quoc Le publication called "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer".Here is the abstract: "The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost."So, here's a big A.I. idea for you: what if we all get one of these sparse Mixture of Experts (MoEs) that's a 100 GB on our SSDs, contains all of the "outrageously large" neural network insights that would otherwise take specialized computers, and is designed to run effectively on a normal GPU or even smaller (e.g. smartphone)?Source: <a href="https://arxiv.org/abs/1701.06538" rel="nofollow noreferrer">https://arxiv.org/abs/1701.06538</a>

i8comments超过 1 年前

The claim that it is better than GPT3.5, in practice, should be taken with a grain of salt, since the benchmarks themselves aren't... ideal.Despite the questionable marketing claim, it is a great LLM for other reasons.

评论 #38601179 未加载

inChargeOfIT超过 1 年前

It sounds like the same requirements as a 70b+ model, but if someone manages to get inference running locally on a single rtx4090 (AMD 7950x3D w/ 64GB ddr5) reasonably well, please let me know.

gardenhedge超过 1 年前

They're only paying 80,000€ for a full stack dev and want the candidate to have a "portfolio of successful projects"

评论 #38601111 未加载

评论 #38599834 未加载

评论 #38599889 未加载

评论 #38600215 未加载

评论 #38599632 未加载

评论 #38600872 未加载

评论 #38599735 未加载

bilsbie超过 1 年前

I wonder if the brain uses a mixture of experts?

评论 #38600795 未加载

DeathArrow超过 1 年前

What kind of training data did they use? Are the training data and replies censored in any way?

评论 #38599990 未加载

评论 #38599965 未加载

jurmous超过 1 年前

Are there any scores on Dutch support? Is it totally not supported or not benchmarked?

mkesper超过 1 年前

Also here graphs are confusing as they don't show the full y axis (please don't do that!) <a href="https://towardsdatascience.com/misleading-graphs-e86c8df8c5de" rel="nofollow noreferrer">https://towardsdatascience.com/misleading-graphs-e86c8df8c5d...</a>

评论 #38599463 未加载

评论 #38603205 未加载

_giorgio_超过 1 年前

Can mixtral be fine tuned in any way?

评论 #38602608 未加载

matrix2596超过 1 年前

the model says 8x7B model, so its a 56B model. what is the GPU memory requirements to run this model for a 512 context size? are there any feasible quantization models of this available? I want to know if my 16GB VRAM GPU can run this model? Thanks

评论 #38606041 未加载

ComputerGuru超过 1 年前

I’m surprised no one has commented on the context size limitations of these offerings when comparing to the other models. The sliding window technique really does effectively cripple its recall to approximately just 8k tokens which is just plain insufficient for a lot of tasks.All these llama2 derivatives are only effective if you fine tune them, not just because of the parameter count as people keep harping but perhaps even more so because of the tiny context available.A lot of my GPT3.5/4 usage involves “one offs” where it would be faster to do the thing by hand than to train/fine-tune first, made possible because of the generous context window and some amount of modest context stuffing (drives up input token costs but still a big win).

评论 #38604693 未加载

评论 #38602801 未加载

epups超过 1 年前

I think comparisons to base LLaMA are not so interesting, as almost no one is using those models. The most informative comparison is between Mistral 7B and 8x7B, provided in this picture: <a href="https://mistral.ai/images/news/mixtral-of-experts/open_models.png" rel="nofollow noreferrer">https://mistral.ai/images/news/mixtral-of-experts/open_model...</a>The key takeway for me is that there is a decent improvement in all categories - about 10% on average with a few outliers. However, the footprint of this model is much larger so the performance bump ends up being underwhelming in my opinion. I would expect about the same performance improvement if they released a 13B version without the MoE. May be too early to definitely say that MoE is not the whole secret sauce behind GPT4, but at least with this implementation it does not seem to lift performance dramatically.

评论 #38600208 未加载