Uh...am I missing something, or is this whole thing setting the user up for humiliating failure by doing its testing the same way that bit that lawyer in the ass?<p>> Your job is to rank the quality of two outputs generated by different prompts. The prompts are used to generate a response for a given task.<p>> You will be provided with the task description, the test prompt, and two generations - one for each system prompt.<p>> Rank the generations in order of quality. If Generation A is better, respond with 'A'. If Generation B is better, respond with 'B'.<p>> Remember, to be considered 'better', a generation must not just be good, it must be noticeably superior to the other.<p>> Also, keep in mind that you are a very harsh critic. Only rank a generation as better if it truly impresses you more than the other.<p>> Respond with your ranking, and nothing else. Be fair and unbiased in your judgement.<p>So what factors make the "quality" of one prompt "better" than another?<p>How "impressive" it is to an LLM? What even <i>impresses</i> an LLM? I thought as an AI language model, it lacks human emotional reactions or whatever.<p>Quality is subjective. Even accuracy is subjective. What needs testing is alignment-- with <i>your</i> interests. The thing is hardcoded to rate based on what aligns with model hosts' interests, not yours.<p>Only the "classification version" looks capable of making any kind of assertion:<p>> 'prompt': 'I had a great day!', 'output': 'true' [sentiment analysis I assume?]<p>The rest of the test prompts aren't even complete sentences, they're half-thoughts you'd expect to hear Peter Gregory mutter to himself:<p>> 'prompt': 'Launching a new line of eco-friendly clothing' [ok, and?]<p>The one for 'Why a vegan diet is beneficial for your health' makes some sense at least, but it's really ambiguous.<p>I'm just some idiot, but if I were creating this, I'd expect the response to ask for a number of expected keywords or something to measure how close each model comes to what <i>the user</i> actually wants. Like, for me, 'what are operating systems' "must" mention all keywords Linux, Windows, and iOS, and "should" mention any of Unix, Symbian, PalmOS, etc.<p><i>All</i> tests should tank the score if it detects fourth-wall-breaking "As an AI language model/I don't feel comfortable" crap anywhere in the response. National Geographic got outed on that one the other day.