Lmsys is not a measure of intelligence. It's a measure of human preference. People prefer correct answers (assuming they are qualified to identify the correct one), but they also prefer answers formatted nicely for reading, for example, which has nothing to do with "intelligence". That is why "reasoning" models, which often do better on benchmarks, do not necessarily do correspondingly well on lmsys.