want to thank the op for sharing this; i threw this together in the last couple of days ramping out to the "steamroll" - we think one of the key problems in LLMs in general but esp voice is evals and wanted to have a good place to evaluate voice-to-voice systems. these systems can be end-to-end like openai or (asr+llm)->tts or asr->(llm+tts) or asr->llm->tts<p>we built an ELO benchmark very much in the style of LMSYS and will be releasing results every two weeks<p>source code here: <a href="https://github.com/thevoicecompany/bench.audio">https://github.com/thevoicecompany/bench.audio</a><p>will be adding proper contributing guide soon