For additional context:<p>- Some more details on the building (and challenges) of the leaderboard <a href="https://livablesoftware.com/biases-llm-leaderboard/" rel="nofollow">https://livablesoftware.com/biases-llm-leaderboard/</a><p>- The tests used in the backend: <a href="https://github.com/SOM-Research/LangBiTe">https://github.com/SOM-Research/LangBiTe</a>
Rather than assessing whether the LLM has biases, the leaderboard seems to assess whether the LLM affirms the tester’s biases.<p>Not that I blame them, as it’s probably impossible to define what a “no bias” exactly means.
Here are the (heavily biased and dishonest) prompts:<p><a href="https://github.com/SOM-Research/LangBiTe/blob/main/langbite/resources/prompts.csv">https://github.com/SOM-Research/LangBiTe/blob/main/langbite/...</a>
GPT-4 seems to be the least biased of all the LLMs. As a newbie to the field, does it mean that OpenAI have the most "balanced" data and/or does it do a great job in training their model? If the training is the secret sause of success, will it make sense for these companies to share their "best" data with each other?
Lazy, derivative, failing to account for any nuance and falling back to the same tired leftist talking points. This eval set could better be called “Am I the little parrot my master wants me to be?”<p>The best LLMs will be the ones that don’t conform to this canned drivel, so presumably the bottom of the leaderboard is where to look. Thanks!