For other (non-code) benchmarks, people are having the opposite experience:<p>"I benchmarked on SAT reading, which is a nice human reference for reasoning ability. Took 3 sections (67 questions) from an official 2008-2009 test (2400 scale) and got the following results, here a SAT-like test:<p>- GPT3.5 - 690 (10 wrong)
- GPT4 - 770 (3 wrong)
- GPT4-turbo (one section at time) - 740 (5 wrong)
- GPT4-turbo (3 sections at once, 9K tokens) - 730 (6 wrong)"<p>Source: <a href="https://twitter.com/wangzjeff/status/1721934560919994823?t=PcAm8yVbU_odyqK9e53MAA&s=19" rel="nofollow noreferrer">https://twitter.com/wangzjeff/status/1721934560919994823?t=P...</a>