The problem, from the paper:<p>> Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity.<p>Looking at the diagram provided, they use GPT-4 to suggest the position following the text prompt.<p>I see it as very useful for making sure to have the text in the right position without doing manual work trying to find the right position. I'm not an expert, but doesn't this method add another cost and overhead for calling Text-to-Image models?
Does Midjourney v6 use something similar to this because they both have a weird look to the text like amateurishly photoshopped look where it’s almost has different aliasing to the rest of the image looking not truly integrated.<p>Impressive it’s legible but some work is needed to get it to normal production quality.
Recent comparison of what's out there:
<a href="https://www.reddit.com/r/StableDiffusion/comments/18o1ole/apparently_not_even_midjourney_v6_launched_today/" rel="nofollow noreferrer">https://www.reddit.com/r/StableDiffusion/comments/18o1ole/ap...</a>
It’s very smart, though using bounding boxes will most likely limit it to 2D contexts (and some head-on 3D contexts) since the text won’t follow the bounding box when perspective is involved. I’m sure it can be improved to support bounds that have 3D transforms though.
I’m assuming the type foundry legal departments are getting ready to come for the image generators when they find out their typefaces have been vacuumed up and are now generating new content without licensing the typeface for use?