LLM Benchmark for 'Longform Creative Writing'

96 pointsby vitorgrsabout 1 month ago

24 comments

lukevabout 1 month ago

Longform creative writing is possibly the least valid or interesting use case for LLMs.Why should I be bothered to read what nobody could be bothered to write?The point of writing is communication and creative writing is specifically about the human experience… something a LLM can mimic but never speak to authoritatively.

评论 #43643340 未加载

评论 #43643195 未加载

评论 #43643146 未加载

评论 #43643529 未加载

评论 #43643278 未加载

评论 #43643414 未加载

评论 #43643193 未加载

评论 #43643134 未加载

评论 #43643551 未加载

评论 #43644883 未加载

评论 #43642936 未加载

评论 #43645674 未加载

评论 #43643040 未加载

arthurofbabylonabout 1 month ago

All of these benchmarks have gotten out of hand, which is highly suggestive. Benchmarks exist as an indicator of quality and proliferate when other indicators of quality fail. Their very prominence implies that observers are having a difficult time assessing LLM performance in context, which hints at a limited utility or more precisely a non-closed feedback loop at the level of utility. (You know a burger tastes really good when you eat it, no benchmarks required.)Perhaps LLM development really does exist at this rarefied abstract level whereby the development team cannot be immersed in application context, but I doubt that notion. More likely the performance observed in context is either so dispiriting or difficult (or nonexistent) that teams return again and again to the more generously validating benchmarks.

评论 #43643356 未加载

successful23about 1 month ago

I’ve noticed the same contrast - technical writing from LLMs often needs trimming for clarity, but creative writing can lean too far into either bland or overly flowery language.Most LLM benchmarks lean heavily on fluency, but things like internal logic, tone consistency, and narrative pacing are harder to quantify. I think using a second model to extract logical or structural assertions could be a smart direction. It’s not perfect, but it shifts focus from just “how it sounds” to “does it actually make sense over time.” Creative writing benchmarks still feel very early-stage.

评论 #43642484 未加载

评论 #43642492 未加载

评论 #43642746 未加载

评论 #43642104 未加载

torginusabout 1 month ago

When trying to use LLMs for creative writing, LLMs, really suck at sequencing and theory-of-mind that is they often create errors where they reference events that yet to occur (according to the prompt), or have characters know things that are true, but have no way at knowing. They are also terrible at writing scenes with mind-games or deception going on.From my experience, this occurs on all LLMs and with a high enough frequency that editing their outputs is much more tedious than writing the damn thing myself.

baqabout 1 month ago

Is there a score for internal consistency? I dunno, maybe have another LLM extract structure into some kind of a logic language and score how many assertions are incompatible? Or to a probabilistic language and do some Bayesian analysis?

评论 #43641728 未加载

评论 #43641658 未加载

评论 #43641948 未加载

评论 #43648853 未加载

andy99about 1 month ago

We did a similar (less rigorous) evaluation internally last year, and one of the big things we identified was "purple prose":<pre><code> In literary criticism, purple prose is overly ornate prose text that may disrupt a narrative flow by drawing undesirable attention to its own extravagant style of writing, thereby diminishing the appreciation of the prose overall.[1] Purple prose is characterized by the excessive use of adjectives, adverbs, and metaphors. When it is limited to certain passages, they may be termed purple patches or purple passages, standing out from the rest of the work. (Wikipedia) </code></pre> The Slop Score they have gets at this which is good, but I wonder how completely it captures it.Also curious about how realistic this benchmark is against real "creative writing" tools. The more the writing is left up to the LLM the better the benchmark likely reflects real performance. For tools that have a human in the loop or a more structured approach it's hard to know how well the benchmarks match real output, beyond just knowing that better models will do better, e.g Claude 3.7 would beat Llama 2.

评论 #43642749 未加载

isaacfrondabout 1 month ago

Some of the Gemini stuff is almost at airport level. I'm surprised. Everything is going so fast.The odd thing, is that with technical stuff, I'm continually rewriting the LLM's to be clearer and less verbose. While the fiction is almost the opposite--not literary enough.

jjaniabout 1 month ago

Would be interesting if they'd add another one on non-fiction creative writing. For example, turning a set of investigative notes and findings into a Pulitzer-prize winning article that wouldn't be out of place in a renowned, high-quality newspaper.IME, for LLMs (just like humans) this skill doesn't necessarily correlate with fiction writing prowess.This is probably harder to judge automatically (i.e. using LLMs) though, maybe that's why they haven't done it.

评论 #43642946 未加载

impossibleforkabout 1 month ago

I'm not sure how I feel about the details of the benchmark, but I think this is an important direction in which it would be nice if LLMs were improved.At present they don't really understand either stories or even short character interactions. Microsoft Copilot can generate dialogue where characters who have never met before are suddenly addressing each other by name, so there's great room for improvement.

miltonlostabout 1 month ago

These Benchmarks are asinine to determine the quality of "creative writing", all chosen by engineers without a single artistic bone.LengthSlop ScoreRepetition MetricDegradationMoby-Dick has chapters as short as a page and as long as 20. According to this benchmark, the book would score lower because of the average length of its chapters.These aren't benchmarks of "quality". A chapter's length is not indicative of a work's quality. That measurement, on its own, is enough to discredit the rest of the benchmark. So so so so so so so misguided.

评论 #43643616 未加载

评论 #43643598 未加载

staredabout 1 month ago

Snd how these compare to GPT-4.5?(From my experience, the best model for creative writing, <a href="https://p.migdal.pl/blog/2025/04/vibe-translating-quantum-flytrap" rel="nofollow">https://p.migdal.pl/blog/2025/04/vibe-translating-quantum-fl...</a>)

informal007about 1 month ago

> Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7.Will Claude Sonnet 3.7 judger favor himself?

评论 #43643109 未加载

boredhedgehogabout 1 month ago

I've noticed people will routinely prompt for a specific style when generating visual art, but not when generating text. Wouldn't it be better for consistency to add something like "in the style of Hemingway" to an experiment like this?

评论 #43641988 未加载

kristianpabout 1 month ago

For the cosy sci-fi (1) example, I found it introduced plot points too quickly, making the short passage really dense. The model eval said:> The chapter also introduces several elements quickly (the stranger, the Syndicate, the experiments) which, while creating intrigue, risks feeling slightly rushed.But there is no score for pacing.<a href="https://eqbench.com/results/creative-writing-v3/deepseek-ai__DeepSeek-R1.html" rel="nofollow">https://eqbench.com/results/creative-writing-v3/deepseek-ai_...</a>

prawnabout 1 month ago

Slightly related, though I'm yet to try anything to this level:Turn Cursor Into a Novel-Writing Beast - <a href="https://www.reddit.com/r/cursor/comments/1jl0rqu/turn_cursor_into_a_novelwriting_beast/" rel="nofollow">https://www.reddit.com/r/cursor/comments/1jl0rqu/turn_cursor...</a>I have tried to have ChatGPT brainstorm story elements for something else, and its suggestions so far have been very lame. Even its responses to direction/criticism are off-putting and fawning.

评论 #43642115 未加载

anilgulechaabout 1 month ago

I read Claude's "Sci-Fi First Contact — First Contact " entry. It's pretty good (and with some editing can be great - some of the ending seems slightly unearned). Has a Ted Chiang/Arrival vibe to it, is a very good first contact story.Most folks here are communicating things without engaging with the content. We need a the Turing test for creative writing. I'd definitely not have guessed this was LLM written - seems like an experienced hand wrote it.

spacebanana7about 1 month ago

I get much better results with long form writing by including an existing base story in the prompt along with requests for extensive modifications.

matt-devabout 1 month ago

haha, Mr. Thorne shows up again in the Gemini 2.5 samples.I have played around with creating long form fictional content with Gemini 2.5 the last week, and I started adding "no one named 'Thorne'" to my prompts, otherwise it always creates a character named Mr. Thorne. I thought it was something in my prompts trigger this, but it seems to be a general problem.However, despite the cliches and slop, Gemini 2.5 can actually write and edit long form fictional pretty well, you can get almost coherent 10-20 chapter books by first letting it create an outline and then iteratively write and edit the chapters..I also used Gemini 2.5 to help me code a tool to interactively and iteratively create longform content: <a href="https://github.com/pulpgen-dev/pulpgen">https://github.com/pulpgen-dev/pulpgen</a>

评论 #43642060 未加载

jaggsabout 1 month ago

DeepSeek is pretty darn good. Much less flowery and more on point. At least in Sci-Fi.

anshumankmrabout 1 month ago

>Outputs are evaluated with a scoring rubric by Claude Sonnet 3.7I feel it might be beneficial to evaluate by an ensemble of a bunch of models, picking the SOTA models cause of the subjectivity of the task at hand.

leonewton253about 1 month ago

"Crypto’s dead," Jay muttered. "Sneakers are forever."Dang deepseek is actually pretty good. Compared to Gemini's version that sounded like a Schizophrenic on LSD.

sam-paechabout 1 month ago

Hey, I made this! Cool to see it show up on hackernews.

lukebuehlerabout 1 month ago

Great that they add a Slop analysis. For example, Gemini uses "heart hammered ribs" a lot.

Der_Einzigeabout 1 month ago

The same author created the anti-slop sampler, which is proof that LLMs can be trivially made extremely creative.Samplers are being slept on by the community. Sam peach is secretly one of the biggest geniuses in all of LLMs. It’s time for the community to recognize this.