Unlike text generation using LLMs, text-to-video generation brings unique challenges — balancing realism, prompt alignment, and artistic vision is something much more nuanced and intuitive than generated code.<p>But how do we measure the quality of the outputs?
Is choice of color more important than the realistic aspect or is it the composition of the scene?<p>We’ve launched a Text-to-Video Model Leaderboard to explore these questions, inspired by the LLM Leaderboard (lmarena.ai). Our idea: many models exist, but only an unbiased comparison can help evaluating what users of text-to-video models actually find most important.<p>Right now, the leaderboard includes five open-source models:
* HunyuanVideo
* Mochi1
* CogVideoX-5b
* Open-Sora 1.2
* PyramidFlow<p>We plan to expand it to include proprietary models from Kling AI, LumaLabs.ai, Pika.art.
You can check out the current leaderboard here: <a href="https://t2vleaderboard.lambdalabs.com/leaderboard/" rel="nofollow">https://t2vleaderboard.lambdalabs.com/leaderboard/</a><p>We’re looking for feedback from the HN community:
* How should text-to-video models be evaluated?
* What criteria or benchmarks would you find meaningful?
* Are there other models we should include?<p>We’d love to hear your thoughts and suggestions!