1 pointsby admtalalmost 2 years ago

I’ve been using these prompts To compare how different LLMs perform, and the results have been surprisingly staggering.<p>The toughest one is Wheel of Fortune, which only works consistently on GPT4.<p>3.5 turbo rarely works, or it does with surface level misunderstanding gameplay.<p>Bard never works.<p>BingChat kinda works, but sometimes gets sassy and ends the chat.

Show HN: Benchmarking AI Chatbot with Game Prompts

no comments

Show HN: Benchmarking AI Chatbot with Game Prompts

no comments