Not sure I understand the excerpt from the referenced paper.<p>Is it saying that part of its more-efficient inferencing relies on mixing tokens from completely-separate inputs – eg, from other users? And then, depending on what other inputs chance into the same grouping, the relative assignment-to-'experts' varies, and thus the eventual completions?<p>If so, I'd see that as not just introducing non-determinism, but also potentially making the <i>quality</i> of your responses dependent on how-many-concurrent-requests are fighting for the same expert-allocations.<p>(For example, maybe the parts of the system best at translating/interpreting Hindi give worse results during peak usage hours-of-the-day in India, when the most concurrent inputs are competing for that same competence.)<p>Perhaps also, this is another possible explanation for perceived quality-degradation over time. When certain tests were reliably succeeding earlier, there was less congestion for the relevant 'experts'. Now, with more concurrent use, those same tests aren't as reliably winning as much of relevant 'experts' effort.<p>This may also suggest a bit of a quagmire: on whatever domains some sub-experts seem impressively good, initially, even more proportionate use will be attracted. But such new congestion means all the copycat use no longer gets the same expert allocations – and thus the initially-impressive performance degrades.<p>(And if the effect is strong, & known-but-undisclosed-by-OpenAI, does it amount to a bait-and-switch? Attract users with unrepresentative excellence on an initially-uncongested Mixture-of-Experts system, but then offer them the lower-quality results from a more-congested system.)