Longform creative writing is possibly the least valid or interesting use case for LLMs.<p>Why should I be bothered to read what nobody could be bothered to write?<p>The point of writing is <i>communication</i> and creative writing is specifically about the human experience… something a LLM can mimic but never speak to authoritatively.
All of these benchmarks have gotten out of hand, which is highly suggestive. Benchmarks exist as an indicator of quality and proliferate when other indicators of quality fail. Their very prominence implies that observers are having a difficult time assessing LLM performance in context, which hints at a limited utility or more precisely a non-closed feedback loop at the level of utility. (You know a burger tastes really good when you eat it, no benchmarks required.)<p>Perhaps LLM development really does exist at this rarefied abstract level whereby the development team cannot be immersed in application context, but I doubt that notion. More likely the performance observed in context is either so dispiriting or difficult (or nonexistent) that teams return again and again to the more generously validating benchmarks.
I’ve noticed the same contrast - technical writing from LLMs often needs trimming for clarity, but creative writing can lean too far into either bland or overly flowery language.<p>Most LLM benchmarks lean heavily on fluency, but things like internal logic, tone consistency, and narrative pacing are harder to quantify. I think using a second model to extract logical or structural assertions could be a smart direction. It’s not perfect, but it shifts focus from just “how it sounds” to “does it actually make sense over time.” Creative writing benchmarks still feel very early-stage.
When trying to use LLMs for creative writing, LLMs, really suck at sequencing and theory-of-mind that is they often create errors where they reference events that yet to occur (according to the prompt), or have characters know things that are true, but have no way at knowing. They are also terrible at writing scenes with mind-games or deception going on.<p>From my experience, this occurs on all LLMs and with a high enough frequency that editing their outputs is much more tedious than writing the damn thing myself.
Is there a score for internal consistency? I dunno, maybe have another LLM extract structure into some kind of a logic language and score how many assertions are incompatible? Or to a probabilistic language and do some Bayesian analysis?
We did a similar (less rigorous) evaluation internally last year, and one of the big things we identified was "purple prose":<p><pre><code> In literary criticism, purple prose is overly ornate prose text that may disrupt a narrative flow by drawing undesirable attention to its own extravagant style of writing, thereby diminishing the appreciation of the prose overall.[1] Purple prose is characterized by the excessive use of adjectives, adverbs, and metaphors. When it is limited to certain passages, they may be termed purple patches or purple passages, standing out from the rest of the work. (Wikipedia)
</code></pre>
The Slop Score they have gets at this which is good, but I wonder how completely it captures it.<p>Also curious about how realistic this benchmark is against real "creative writing" tools. The more the writing is left up to the LLM the better the benchmark likely reflects real performance. For tools that have a human in the loop or a more structured approach it's hard to know how well the benchmarks match real output, beyond just knowing that better models will do better, e.g Claude 3.7 would beat Llama 2.
Some of the Gemini stuff is almost at airport level. I'm surprised. Everything is going so fast.<p>The odd thing, is that with technical stuff, I'm continually rewriting the LLM's to be clearer and less verbose. While the fiction is almost the opposite--not literary enough.
Would be interesting if they'd add another one on non-fiction creative writing. For example, turning a set of investigative notes and findings into a Pulitzer-prize winning article that wouldn't be out of place in a renowned, high-quality newspaper.<p>IME, for LLMs (just like humans) this skill doesn't necessarily correlate with fiction writing prowess.<p>This is probably harder to judge automatically (i.e. using LLMs) though, maybe that's why they haven't done it.
I'm not sure how I feel about the details of the benchmark, but I think this is an important direction in which it would be nice if LLMs were improved.<p>At present they don't really understand either stories or even short character interactions. Microsoft Copilot can generate dialogue where characters who have never met before are suddenly addressing each other by name, so there's great room for improvement.
These Benchmarks are asinine to determine the quality of "creative writing", all chosen by engineers without a single artistic bone.<p>Length<p>Slop Score<p>Repetition Metric<p>Degradation<p>Moby-Dick has chapters as short as a page and as long as 20. According to this benchmark, the book would score lower because of the average length of its chapters.<p>These aren't benchmarks of "quality". A chapter's length is not indicative of a work's quality. That measurement, on its own, is enough to discredit the rest of the benchmark. So so so so so so so misguided.
Snd how these compare to GPT-4.5?<p>(From my experience, the best model for creative writing, <a href="https://p.migdal.pl/blog/2025/04/vibe-translating-quantum-flytrap" rel="nofollow">https://p.migdal.pl/blog/2025/04/vibe-translating-quantum-fl...</a>)
I've noticed people will routinely prompt for a specific style when generating visual art, but not when generating text. Wouldn't it be better for consistency to add something like "in the style of Hemingway" to an experiment like this?
For the cosy sci-fi (1) example, I found it introduced plot points too quickly, making the short passage really dense. The model eval said:<p>> The chapter also introduces several elements quickly (the stranger, the Syndicate, the experiments) which, while creating intrigue, risks feeling slightly rushed.<p>But there is no score for pacing.<p><a href="https://eqbench.com/results/creative-writing-v3/deepseek-ai__DeepSeek-R1.html" rel="nofollow">https://eqbench.com/results/creative-writing-v3/deepseek-ai_...</a>
Slightly related, though I'm yet to try anything to this level:<p>Turn Cursor Into a Novel-Writing Beast - <a href="https://www.reddit.com/r/cursor/comments/1jl0rqu/turn_cursor_into_a_novelwriting_beast/" rel="nofollow">https://www.reddit.com/r/cursor/comments/1jl0rqu/turn_cursor...</a><p>I have tried to have ChatGPT brainstorm story elements for something else, and its suggestions so far have been very lame. Even its responses to direction/criticism are off-putting and fawning.
I read Claude's "Sci-Fi First Contact — First Contact " entry. It's pretty good (and with some editing can be great - some of the ending seems slightly unearned). Has a Ted Chiang/Arrival vibe to it, is a very good first contact story.<p>Most folks here are communicating things without engaging with the content. We need a the Turing test for creative writing. I'd definitely not have guessed this was LLM written - seems like an experienced hand wrote it.
haha, Mr. Thorne shows up again in the Gemini 2.5 samples.<p>I have played around with creating long form fictional content with Gemini 2.5 the last week, and I started adding "no one named 'Thorne'" to my prompts, otherwise it always creates a character named Mr. Thorne. I thought it was something in my prompts trigger this, but it seems to be a general problem.<p>However, despite the cliches and slop, Gemini 2.5 can actually write and edit long form fictional pretty well, you can get almost coherent 10-20 chapter books by first letting it create an outline and then iteratively write and edit the chapters..<p>I also used Gemini 2.5 to help me code a tool to interactively and iteratively create longform content: <a href="https://github.com/pulpgen-dev/pulpgen">https://github.com/pulpgen-dev/pulpgen</a>
>Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7<p>I feel it might be beneficial to evaluate by an ensemble of a bunch of models, picking the SOTA models cause of the subjectivity of the task at hand.
"Crypto’s dead," Jay muttered. "Sneakers are forever."<p>Dang deepseek is actually pretty good. Compared to Gemini's version that sounded like a Schizophrenic on LSD.
The same author created the anti-slop sampler, which is proof that LLMs can be trivially made extremely creative.<p>Samplers are being slept on by the community. Sam peach is secretly one of the biggest geniuses in all of LLMs. It’s time for the community to recognize this.