"New Coke" is one of the most notable failed product launches in the American food and beverage industry, and I feel like some of its core lessons are becoming increasingly relevant to modern AI developers.<p>For those <40: In the 1980s senior executives at Coca Cola had a problem: Pepsi was gaining ground, partly thanks to the "Pepsi Challenge" - blind sip tests where consumers often preferred Pepsi's sweeter taste. Coke R&D developed a new, sweeter formula that <i>also</i> beat both Pepsi and original Coke in these single-sip taste tests involving many thousands of consumers. Based on this data, they launched "New Coke" in 1985.<p>The result was a legendary disaster. Outrage, protests, hoarding of the original formula. The problem was people didn't just <i>sip</i> Coke; they drank whole cans. They also valued the brand, the history and the familiarity - factors the narrow taste tests completely missed. Within months, "Coca-Cola Classic" was back. New Coke production was quietly scaled back in the early 90s, but stuck around in a few markets until the early 00s.<p>I think AI practitioners are starting to learn the same lesson. We're tuning our models with RLHF/DPO/other preference methods based on similar one-step blind taste tests. Raters pick the "better" response between two options, often optimizing for immediate helpfulness, agreeableness, or perceived safety in that isolated interaction. I think some of the more extreme recent LLM tuning may also be fueled by taste-test-style benchmarks like LMSYS and the Artificial Analysis image leaderboard.<p>Examples: ChatGPT's most recent update turned it into an overenthusiastic sycophant. Image models (Apple's Image Playground model is a particularly egregious example you can try right now) are frequently preference tuned until every generation looks like something out of a Pixar movie. Certain music models are incapable of generating music that doesn't sound like a 2020s top-40s song.<p>In all cases, it might taste/sound/look good once, but ultimately people will get sick of it. I work on generative models and I think (at least for our modality, music) the most enduring enjoyment of using them is the element of surprise and delight, which is increasingly being ruined by preference tuning which collapses the distribution of possible outputs.<p>Are we optimizing away the very qualities that make these models interesting, creative, and truthful in the long run, just to win the immediate "preference" taste test and rank higher in benchmarks? IMO we're witnessing the New Coke of AI.