(They patched Gemini an hour after I finished writing this. My complaints of excessive refusals and model deception may no longer apply.)<p>Witness:
- A chess match between Ultra and GPT4 (the first one ever, as far as I'm aware)
- A Gemini vs GPT4 rap battle
- Tests of general knowledge, recall, abstract reasoning, and code generation
- Head-to-head contests of poetry and prose, plus style imitations of famous authors/bloggers<p>I also investigate VERY IMPORTANT things such as:<p>- which model can create a more realistic ASCII cat?
- which model is better at stacking eggs?
- which model plays Wordle better?
- which model SIMULATES Wordle better (with me playing)?<p>Obviously, a lot of my tests are a bit silly. We already know Ultra's benchmarks, I'm trying to probe the gaps BETWEEN benchmarks, and figure out what the models are like "on the ground".<p>Conventional wisdom holds that Ultra is another GPT4: this was not my experience. Switching from GPT4 to Ultra feels like switching character classes in an RPG; they are quite different, with distinct strengths and weaknesses.