Hello HN! I am happy to share the Monitoring reports we have been running for the past few months to identify regression in popular LLMs like GPT-4-turbo, Claude-2, etc.<p>There have been numerous informal observations about prompt drifts in Large Language Models (LLMs), with the most notable case being GPT-4 showing signs of laziness, especially for coding tasks by the end of the previous year. Discussions on Twitter also hint at a decline in Claude Sonnet’s effectiveness over the past few days. Given the closed-source nature of these models, it's impossible to know what happens behind the scenes, and most often, these drifts go unnoticed until they get flagged by the community.<p>In November, OpenAI introduced a feature known as model seeding, which aims to ensure consistency in responses given a fixed seed (assuming the temperature is set to 0 and the fingerprint remains unchanged). However, in reality, the model repeatedly displays significant variation and unpredictability over time, even when identical or nearly identical prompts are used for the same seed.<p>Today, a structured approach to tracking these shifts in the model’s performance is lacking. So we decided to undertake this as a community initiative to monitor prompt drift and identify any regressions systematically.<p>Our methodology:<p>1. We've compiled a dataset consisting of 25 samples, each having a question and a context containing information to answer that question. The dataset covers a broad range of topics, including finance, technology, health, sports, and academic areas such as calculus and geology.<p>2. Each day, we generate responses for three different models: GPT-4-turbo, GPT-3.5-turbo, and Claude-2.1, and run evaluate them on three distinct criteria (with GPT-3.5 serving as our evaluator):<p>a) Response Conciseness: We examine the extent to which the response includes unnecessary information that does not contribute to answering the question.<p>b) Response Completeness: We assess whether the model successfully addresses every aspect of the question.<p>c) Factual Accuracy: We verify the correctness of the model's response against the provided context.<p>Some of the challenges faced:<p>1. Choice of evaluator model: We have experimented with different evaluators (e.g., GPT-4-turbo, GPT-3.5, Claude-1.2, etc.). A notable observation was that all these evaluators align well for objective evals (e.g., Factual accuracy) but differ in absolute scores for subjective evals (e.g., Response completeness and conciseness). However, as we are tracking regression over time (and are not concerned with absolute scores), we decided to use GPT-3.5 due to its low cost and high stability.<p>2. Stability of eval scores: Employing LLMs as evaluators presents its own set of challenges, chiefly the variability in scores across multiple assessments. Tracking performance by using LLM as a judge is an inherently difficult task, as the evaluation score fluctuates over multiple runs. We have made many pipeline improvements to reduce the standard deviation of scores over the same data point across multiple runs to below 2%.<p>Looking ahead, we plan to enlarge our benchmarking dataset as well as include additional models (ex: Claude 3).
We would love to hear your feedback.