Some results published here <a href="https://llm-eval.github.io/pages/leaderboard/advprompt.html" rel="nofollow noreferrer">https://llm-eval.github.io/pages/leaderboard/advprompt.html</a><p>Also, technical report on arxiv <a href="https://arxiv.org/abs/2312.07910" rel="nofollow noreferrer">https://arxiv.org/abs/2312.07910</a><p>I think that something that would make the results better would be highlighting which models perform best for specific tests (ie color coding) and explaining the tests via some hover info.<p>Also with some fine tuned models training to get higher scores on specific tests, I don't know how valuable these tests are in comparison to chatbot arena's elo ranking <a href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard" rel="nofollow noreferrer">https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...</a>