I have seen some amazing benchmarks used to rank LLMs abilities, it got me thinking are there similar benchmarks for propensity modelling, churn prediction or other types of models?<p>Are there best practices for comparing model performance beyond benchmark data when they may have different underlying datasets?
On PapersWithCode, different datasets have benchmarks: <a href="https://paperswithcode.com/datasets" rel="nofollow">https://paperswithcode.com/datasets</a><p>You can also break down by task here: <a href="https://paperswithcode.com/sota" rel="nofollow">https://paperswithcode.com/sota</a><p>For churn, you might go to time series forecasting first:
<a href="https://paperswithcode.com/task/time-series-forecasting" rel="nofollow">https://paperswithcode.com/task/time-series-forecasting</a><p>They have this subtask which is a bit different because it's about novel products rather that continued sales, for example:<p><a href="https://paperswithcode.com/task/new-product-sales-forecasting" rel="nofollow">https://paperswithcode.com/task/new-product-sales-forecastin...</a><p>But you get the idea of how they organise by task.
I'm curious about other benchmarks and interfaces too and would like to see others.<p>I think HuggingFace and Kaggle have some overlap with different tasks that have benchmarks.