TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Expressive text-to-image generation with rich text

89 pointsby plurbyover 1 year ago

10 comments

simbolitover 1 year ago
I looked at this, and thought about it, and then I waited for an hour, and now I looked at it again, and I can&#x27;t help but think this is useless.<p>We can already weigh parts of prompts, we can already specify colors or styles for parts of the images. And even if we could not, none of this needs rich text.<p>In the beginning I even think their comparisons are dishonest. They compare &quot;plaintext&quot; prompts with &quot;rich text&quot; prompts, but the rich text prompts contain more information. What? Like, seriously, who is surprised the following two prompts give different images?<p>(1) &quot;A girl with long hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose.&quot;<p>(2) &quot;A girl with long [Richtext:orange] hair sitting in a cafe, by a table with coffee on it, best quality, ultra detailed, dynamic pose. [Footnote:The ceramic coffee cup with intricate design, a dance of earthy browns and delicate gold accents. The dark, velvety latte is in it.]&quot;<p>the worst part is &quot;Font style indicates the styles of local regions&quot;. In the comparison with other methods section they actually have to specify in parentheses what each font means style-wise, because nobody knows and (let&#x27;s be frank) nobody wants to learn.<p>So why not just use these plaintext parentheses in the prompt?<p>I really stopped myself from immediately posting my (rather negative) opinion, but after over an hour, it hasn&#x27;t changed. As far as i can see, this isn&#x27;t useful, rich text prompts are a gimmick.
评论 #37773780 未加载
评论 #37772557 未加载
评论 #37772377 未加载
评论 #37775603 未加载
评论 #37773632 未加载
Der_Einzigeover 1 year ago
I LOVE this.<p>All of the techniques that they are showing have already existed for awhile in places like Automatic1111&#x2F;ComfyUI or its extensions (i.e. regional prompting, attention weights). Having it connect so seamlessly with rich text is awesome and is a cool UI trick that might make normies notice it.<p>Also, related, but NLP is extremely undertooled on the prompt engineering side. Most of the techniques here would work just fine on any LLM. If you don&#x27;t believe me, read this: <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;Hellisotherpeople&#x2F;45c619ee22aac6865ca4bb328eb58faf" rel="nofollow noreferrer">https:&#x2F;&#x2F;gist.github.com&#x2F;Hellisotherpeople&#x2F;45c619ee22aac6865c...</a>
评论 #37774000 未加载
littlestymaarover 1 year ago
While I don&#x27;t think the rich text thing is particularly useful, I&#x27;m very impressed by the approach, especially how it manages to change the resulting image in a way you can control (that is, without regenerating the whole thing and ending up with something with random undesirable changes).<p>The stability of the overall image during local changes makes me think that maybe this could be a key to video generation (because the biggest problem with existing diffusion-based approach for video is their instability from frame to frame).
minimaxirover 1 year ago
A relatively functionally similar approach is prompt term weighting with libraries such as compel: <a href="https:&#x2F;&#x2F;github.com&#x2F;damian0815&#x2F;compel">https:&#x2F;&#x2F;github.com&#x2F;damian0815&#x2F;compel</a><p>Prompt weighting alone can fix undesired aspects of an output, especially with SDXL and its dual text encoders.
评论 #37773981 未加载
pugworthyover 1 year ago
I would love to experiment with the idea of font interpretation. People can and do anthropomorphize fonts, but then they have names with meanings which might or might not be useful.<p>For example, I&#x27;m wondering if a prompt written in Comic Sans should be turned into a comic-style illustration, or does it come out as a simplistic and childish drawing? Is a gothic font meant to imply a style of architecture, old Germanic peoples, or goth music and style?<p>See also <a href="https:&#x2F;&#x2F;design.tutsplus.com&#x2F;articles&#x2F;the-psychology-of-fonts--cms-34943" rel="nofollow noreferrer">https:&#x2F;&#x2F;design.tutsplus.com&#x2F;articles&#x2F;the-psychology-of-fonts...</a>
评论 #37772945 未加载
atleastoptimalover 1 year ago
This is very cool, but it&#x27;s gimmicky. All of the rich text could simply be a modifier before or after the word (such as an adjective or phrase). Given most LLM work is plain text, this benefit isn&#x27;t as neatly transferable as prompt engineering.
评论 #37776161 未加载
评论 #37774682 未加载
90-00-09over 1 year ago
I like this idea. It could be handy to be able to focus on individual descriptions in complex prompts. Is this then mostly a &quot;UI&quot; feature that is being translated to a traditional prompt?<p>(As a side note: using decorative typefaces was an unconvincing example.)
评论 #37799051 未加载
LASRover 1 year ago
How well does this work with LLMs? Anyone tried this? I am curious about the references and footnotes approach the most.
评论 #37772954 未加载
PixelForgover 1 year ago
I&#x27;m impressed by the pixel art generation, will definitely try it.
评论 #37799052 未加载
gorenbover 1 year ago
my god, i think midjourney and dalle should do this now