This is really exciting to see. I applaud Stability AI's commitment to open source and hope they can operate for as long as possible.<p>There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.
It's impressive that it spell words correctly and lay them out but the issue I have is the text always has this distinctively overly fried look to it. The color of the text is always ramped up to a single value which when placed into a high fidelity image gives the impression of just slapping some text on top with photoshop afterwards in quite an amateurish fashion rather than text properly integrated into an image.
Question is, will SD3 be downloadable? I downloaded and run the early SD locally and it is really great.<p>Or did we lose Stable Diffusion to SAAS also? Like we did on many of the LLMs which started of so promising as for self hosting goes
It's very exciting to see that image generators are finally figuring out spelling. When DALL-E 3 (?) came out they hyped up spelling capabilities but when I tried it with Bing it was incredibly inconsistent.<p>I'd love to read a less technical writeup explaining the challenges faced and why it took so long to figure out spelling. Scrolling through the paper is a bit overwhelming and it goes beyond my current understanding of the topic.<p>Does anyone know if it would be possible to eventually take older generated images with garbled up text + their prompt and have SD3 clean it up or fix the text issues?
Nice improvements in text rendering, but it seems generating hands and fingers is still difficult for SD3. None of the pictures in the example contain human hands, except for the pixelized wizard; and the monkey hands seem a bit odd.
This looks great, very exciting. The paper is not a lot more detailed than the blog. The main Thing about the paper is they have an architecture that can include more expressive text encoders (t5-xxl here), they show this helps with complex scenes, and it seems clear they haven’t maxed out this stack in terms of training. So, expect sd3.1 to be better than this, and expect 4 to be able to work with video through adding even more front end encoding. Exciting!
This arch seems to be flexible enough to extends to video easily. Hopefully what we have here will be another "foundation" blocks like the transformer blocks in LLaMA.<p>Why:<p>It looks generic enough to incorporated text encoding / timestep condition into the block in all the imaginable ways (rather than in limited ways in SDXL / SD v1, or Stable Cascade). I don't think there is much left to be done there other than to play with positional encoding (2D RoPE?).<p>Great job! Now let's just scale up the transformers and focus on quantization / optimizations to run this stack properly everywhere :)
More and more companies that were once devoted to being 'open', or were previously open, are now becoming increasingly closed. I appreciate Stability AI releases these research papers.
He! in contrast to Stability AI, Open AI is the least closed AI lab. Even Deep Mind publishes more papers.<p>I wonder if anyone in Open AI openly says it "We're in for the money!"<p>The recent letter by SamA regarding Elon's trial had as much truth as Putin saying they are invading Ukraine for de-nazification.