I’ve been using these prompts
To compare how different LLMs perform, and the results have been surprisingly staggering.<p>The toughest one is Wheel of Fortune, which only works consistently on GPT4.<p>3.5 turbo rarely works, or it does with surface level misunderstanding gameplay.<p>Bard never works.<p>BingChat kinda works, but sometimes gets sassy and ends the chat.