Non-determinism in GPT-4 is caused by Sparse MoE

397 点作者 152334H将近 2 年前

21 条评论

jiggawatts将近 2 年前

Floating point inaccuracies are generally deterministic - running the same calculations twice ought to yield the same results, down to the bit.You only get divergent results if there is some other source of state or entropy: not zeroing buffers correctly, race conditions, not setting rounding mode flags consistently, etc…From the quality of the code I’ve seen being cobbled together in the AI/ML ecosystem I would assume all three of those issues going on, and maybe more.

评论 #37007811 未加载

评论 #37008580 未加载

评论 #37008106 未加载

评论 #37008138 未加载

评论 #37009559 未加载

评论 #37008004 未加载

gojomo将近 2 年前

Not sure I understand the excerpt from the referenced paper.Is it saying that part of its more-efficient inferencing relies on mixing tokens from completely-separate inputs – eg, from other users? And then, depending on what other inputs chance into the same grouping, the relative assignment-to-'experts' varies, and thus the eventual completions?If so, I'd see that as not just introducing non-determinism, but also potentially making the quality of your responses dependent on how-many-concurrent-requests are fighting for the same expert-allocations.(For example, maybe the parts of the system best at translating/interpreting Hindi give worse results during peak usage hours-of-the-day in India, when the most concurrent inputs are competing for that same competence.)Perhaps also, this is another possible explanation for perceived quality-degradation over time. When certain tests were reliably succeeding earlier, there was less congestion for the relevant 'experts'. Now, with more concurrent use, those same tests aren't as reliably winning as much of relevant 'experts' effort.This may also suggest a bit of a quagmire: on whatever domains some sub-experts seem impressively good, initially, even more proportionate use will be attracted. But such new congestion means all the copycat use no longer gets the same expert allocations – and thus the initially-impressive performance degrades.(And if the effect is strong, & known-but-undisclosed-by-OpenAI, does it amount to a bait-and-switch? Attract users with unrepresentative excellence on an initially-uncongested Mixture-of-Experts system, but then offer them the lower-quality results from a more-congested system.)

评论 #37007683 未加载

评论 #37007305 未加载

alpark3将近 2 年前

_If_ 3.5 is a MoE model, doesn't that give a lot of hope to open source movements? Once a good open source MoE model comes out, maybe even some type of variation of the decoder models available(I don't know whether MoE models have to be trained from scratch), that implies a lot more can be done with a lot less.

评论 #37006877 未加载

评论 #37007077 未加载

osmarks将近 2 年前

I feel like this introduces the potential for weird and hard-to-implement side channel attacks, if the sequences in a batch can affect the routing of others.

评论 #37006434 未加载

refulgentis将近 2 年前

This is _excellent_ work, I've been adamantly against MoE for a set of reasons, this is the first compelling evidence I've seen that hasn't been on Substack or a bare repeating of rumor.I had absolutely no idea GPT4 was nondeterministic and I use it about 2 hours a day. I can see why a cursory looking wasn't cutting it, they "feel" the same in your memory, a lot of similar vocab usage, but are formatted entirely differently, and have sort of a synonym-phrase thing going where some of the key words are the same.

评论 #37006632 未加载

评论 #37006596 未加载

评论 #37006770 未加载

评论 #37008548 未加载

pazimzadeh将近 2 年前

Mixture of Experts

评论 #37007694 未加载

评论 #37007614 未加载

crazypython将近 2 年前

The GPT-3.0 "davinci-instruct-beta" models have been returning non-deterministic logprobs as early as early 2021. This is speculation. CUDA itself often has nondeterminism bugs.text-davinci-001 and text-davinci-002 were trained through FeedMe and SFT, while text-davinci-003 was RLHF; the models themselves have more variance at high temperature.

评论 #37011255 未加载

throwawayadvsec将近 2 年前

"these tokens often compete against each other for available spots in expert buffers. " So is this also why ChatGPT is often just writing placeholders in place of functions when I ask him for some long code?

afro88将近 2 年前

> these tokens often compete against each other for available spots in expert buffers.Hold up, does this mean that under heavy load the results change? Does this explain why it sometimes feels like the output quality changes?

hyperthesis将近 2 年前

MoE: Mixture of Experts

评论 #37009168 未加载

cainxinth将近 2 年前

I asked GPT to explain this:>In the MoE approach, different "experts" or portions of the model are selected for different parts of the input data. The selection of which experts to use can be influenced by several factors, including the specific content of the input data, the order in which data is processed in a batch, and possibly even minor variations in the internal state of the model.>This "expert selection" process introduces a level of stochasticity, or randomness, into the model's operation. For example, if you process the same input data twice in slightly different contexts (e.g., as part of different batches), you might end up consulting slightly different sets of experts, leading to slightly different outputs.

cratermoon将近 2 年前

> It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0Interestingly, on another discussion there was a claim that setting the temperature to 0.0 made gpt-4 deterministic: <a href="https://news.ycombinator.com/item?id=36503146">https://news.ycombinator.com/item?id=36503146</a>

评论 #37010227 未加载

icelancer将近 2 年前

How interesting. I was just discussing this last night with our analysts after I experimentally noticed that temp=0.0 (and all penalties/top_p set accordingly) still showed non-determinate behavior. Wasn't sure why this was, and now this article comes about.The explanation makes quite a bit of sense.

rgoldste将近 2 年前

This is a plausible hypothesis. I’m curious whether OpenAI has considered this already and examined it I feel like an average senior eng could eval this in under two focused days, but maybe OpenAI has less unit-testing than I expect.

DeathArrow将近 2 年前

Well, a colleague of mine managed to build a non deterministic GET REST API endpoint. :D

albystein将近 2 年前

this hypothesis makes a lot of sense. if indeed gpt-4 is a sparse MoE—which i believe it is—then OpenAI must have tested and proved their initial idea of a large capacity MoE LLM model first training/building a smaller one. this smaller test model might be gpt-3.5-turbo.

f1shy将近 2 年前

I see in the comments it seems to be a huge miss understanding between 2 uses of “non-deterministic”: 1) from normal English: cannot be determined beforehand (results may vary) 2) from theory of computation: loosely “parallel computation” (unknown path to the solution)

评论 #37010092 未加载

rvcdbn将近 2 年前

I wonder if there’s a side channel attack in there waiting to happen..

pmarreck将近 2 年前

Determinism should always be an option in any system.

heroku将近 2 年前

can somebody make some quantum AI, that's super deterministic.

dudus将近 2 年前

Off topic> 3 months later, reading a paper while on board a boring flight home, I have my answer.I noticed people from hacker news routinely read scientific papers. This is a habit I envy but don't share.Any tips or sites for someone interested in picking up more science papers to read.

评论 #37006967 未加载

评论 #37006962 未加载

评论 #37007289 未加载

评论 #37009287 未加载

评论 #37007146 未加载

评论 #37007631 未加载

评论 #37007846 未加载

评论 #37007310 未加载

评论 #37007818 未加载

评论 #37010459 未加载

评论 #37007836 未加载

评论 #37007244 未加载

评论 #37008984 未加载

评论 #37007963 未加载

评论 #37008006 未加载

评论 #37007139 未加载

评论 #37009313 未加载

评论 #37007995 未加载

评论 #37007638 未加载

评论 #37007350 未加载

评论 #37007543 未加载

评论 #37007082 未加载

评论 #37007796 未加载

评论 #37006979 未加载

评论 #37008158 未加载

评论 #37006939 未加载

评论 #37020650 未加载