Fidelity on the output isn't great, but the coherence (assuming the examples weren't massively cherry-picked) seems very good. Given the number of parameters this should be able to run on end-user machines, and in theory this could be fine tuned to produce better looking output than stable diffusion/etc.<p>What this model does more than anything else is demonstrate we're still in the early stages of generative models, and we can expect a lot of progress from architectural improvements over the next decade (in addition to the progress in compute and data that we're already counting on).
Here is an available implementation:<p><a href="https://github.com/lucidrains/muse-maskgit-pytorch">https://github.com/lucidrains/muse-maskgit-pytorch</a>
It'd be interesting to see some results where the training set has higher artistic quality (and how this model influences the "house style"). The output does not look great when compared to what other (trained) models deliver.<p>But the promise of a big efficieny gain will be an incentive for companies like midjourney to give it a go with their data.
More amazement . I wonder where this field will end up. Cute animal and nature images are nice but have limited real-life use (i mean, we have to accept that visual media ends after everyone can be an artist). I wonder when we 'll start interfacing language models with robotics to do some real-life work
> Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations;<p>Am I wrong or is that the same architecture as DALL-E 1?
Would stuff like DreamBooth and textual inversion be usable with transformer models like this one?<p><a href="https://dreambooth.github.io/" rel="nofollow">https://dreambooth.github.io/</a>
<a href="https://textual-inversion.github.io/" rel="nofollow">https://textual-inversion.github.io/</a>