Elo sucks – better multiplayer rating systems for smaller games (2019)

161 点作者 brownbat将近 5 年前

25 条评论

The author didn't benchmark to see if this system is actually any better at predicting outcomes than vanilla Elo. That's how you determine if your implied win probabilities are accurately being derived from rating differences. The author seems to be under the impression that there's something fixed and concrete about an 1800 rating, but when you change the system, you also change what an 1800 rating means in the first place.Some of these complaints are solved by existing systems, namely Glicko. For example, rating deviation helps with experienced players (low RD) losing points to newer players (high RD). It also has a built-in way to discourage inactivity. Players' RD increase over periods of inactivity, so they can be excluded from the leaderboard after reaching a certain point. That allows us to maintain their rating without decreasing it. After all, that's our best guess of the player's skill. It's just a less reliable guess over time.

评论 #23911937 未加载

评论 #23913245 未加载

评论 #23913474 未加载

评论 #23913653 未加载

dvt将近 5 年前

Elo is great for what it was built for: ranking chess players. Chess is (1) extremely low-variance, (2) has an extremely high skill ceiling, and (3) is 1-on-1. Elo works great for chess, but it would never work for something like Poker. Let's briefly go over these three points.Most games aren't chess -- where the only variance is picking who's black and who's white -- in fact, they might include dozens of RNG mechanics (from critical strikes to ability rolls, to spawn points). These mechanics (while fun and well-designed) might pollute your "idealized" model. There's also the problem of RPS (rock-paper-scissors) mechanics or pick-counter-pick mechanics which will also heavily skew win rates. For instance, given a slow combo Magic deck, you will most likely auto-concede to mono red aggro (regardless of skill level). If you're using Elo, this will pollute your model. (Hint: you shouldn't be using Elo.)Most games also don't have chess' high skill ceiling. Chess has such a high skill ceiling for a number of reasons -- it's one of the oldest games still being actively played, for one. Suppose your "game" is simply the flip of a coin (everyone wins 50% of the time). Zero skill involved. Trying to model win-loss-ratios using a sigmoid curve is silly. Obviously, no game is going to be a coin flip, but there's a world of difference between chess and DOTA.TruSkill attempts to fix (3) by using clever Bayesian updating on a player-by-player basis[1] but in reality, it's a shit-show. Using Elo (or variants thereof) for team-based games where the team isn't really a team (more like 3-5 random people plopped together for one match) is incredibly misguided, but continues to be implemented in just about every modern multiplayer game (to the players' frustration). Of course, mixing and matching pre-made groups with non pre-made groups creates as many issues as you might imagine.In short, why so many game devs are enamored with Elo when it comes to ranking is a bit bizarre.[1] <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/NIPS2006_0688.pdf" rel="nofollow">https://www.microsoft.com/en-us/research/wp-content/uploads/...</a>

评论 #23911341 未加载

评论 #23909698 未加载

评论 #23913538 未加载

评论 #23911137 未加载

评论 #23911071 未加载

评论 #23909726 未加载

评论 #23913795 未加载

评论 #23910358 未加载

CodesInChaos将近 5 年前

1. The sigmoid function is the closest thing to linear that makes sense on probabilities⁺. A purely linear function would cross 0/100% which, while the sigmoid flattens exponentially as it approaches the extreme values.2. The fit isn't as bad as the author claims. It looks like the biggest difference between the graphs is that the point differences are scaled differently (400 pts for 90% in elo vs 800 pts in the second graph).A quick and dirty overlay of the two graphs shows a reasonable fit: <a href="https://ibb.co/0YwYH9z" rel="nofollow">https://ibb.co/0YwYH9z</a>3. I like observations about player psychology. Satisfying the players is more important than having the mathematically best ranking system.4. Personally I like Whole History Ranking (<a href="https://www.remi-coulom.fr/WHR/" rel="nofollow">https://www.remi-coulom.fr/WHR/</a>), but it's unlikely to be popular with players (the psychological criticisms the article makes apply to it as well, with some additional problems, like rank drifting without playing). KGS which uses ranking system similar to WHR (but more primitive) certainly draws a lot of criticism for its ranking system.If I had to design a mathematically optimal ranking system, I'd start with WHR and make parts of it trainable/fittable.----⁺ Bayes' theorem turns into addition when applied to logarithmic probabilities and the sigmoid function converts from logarithmic probabilities to normal probabilities. This property is why it (or its multi category equivalent softmax) is used when predicting probabilities using logistic regression or neural networks.

IanGabes将近 5 年前

Creating a custom system to suit your situations needs sounds great and the thought process was fun to read, but some of the claims lobbed here are pretty questionable.Specifically, the claim that Dota's matchmaking system is "probably wrong" because the model chosen doesn't match your own findings feels like a reach. Sibling commenters have pointed out how skill variance is important to allow the ELO system to function in games like chess. Additionally, someone else pointed out that the sigmoid function is similar to a linear funciton close to zero.It seems at least as likely that Acolytefight doesn't have a high enough level of skill expression present in the game to see top players "curve out" weaker players, rather than exponential functions mapping player skill to be useless or wrong.Does elo suck? Maybe, but this hasn't convinced me.

jrek将近 5 年前

Elo might or mightn't suck (imo it's a great ranking system). But the article sucks. Vanilla elo is built around chess and some adjustments to the scale and/or K-factor might be necessary to fit the circumstance. A quick change of scale to E = (1 / 1 + 10 ^ ((Ra - Rb) / 800)) and all of a sudden ELO very accurately reflects the games actual results: <a href="https://imgur.com/a/rFP5U0g" rel="nofollow">https://imgur.com/a/rFP5U0g</a>Meaning just that skill is a weaker factor in this game than in chess...Edit: The 'actual' curve includes a correction for the obvious anomaly of ~55% win expectation at 0 point delta.

runarberg将近 5 年前

I remember a bit back the Go server that I play most of my go these days [OGS](<a href="https://online-go.com" rel="nofollow">https://online-go.com</a>) changed their ratings from Elo to Glicko-2.You can read their rationally for it in this forum: <a href="https://forums.online-go.com/t/ogs-has-a-new-glicko-2-based-rating-system/13058" rel="nofollow">https://forums.online-go.com/t/ogs-has-a-new-glicko-2-based-...</a>The key takeaway is this:> Most of the shortcomings [of Elo] can be traced back to the fact that the system is too slow to find a player’s correct rank, and too slow to adapt when jumps in strength occur.> The problem of slow moving ratings is a well-known problem with Elo implementations. In response to this, Prof. Mark Glickman developed the Glicko, and later Glicko-2, rating systems which address this problem very well and are fairly widely usedA few weeks ago they then made an update to their implementation of Glicko-2, where—during the announcements they summarized many interesting statistics on how the system has panned out for them: <a href="https://forums.online-go.com/t/2020-rating-and-rank-tweaks-and-analysis/28649" rel="nofollow">https://forums.online-go.com/t/2020-rating-and-rank-tweaks-a...</a>

BSTRhino将近 5 年前

Wow, I wrote this article ages ago, didn't expect to see it posted here today.I just want to clarify the point of the article:Why would you fit a curve to the data when you can just use the actual data?That's the point of the article.We're in the age of big data, we should use it to make better win rate predictions. Elo's exponential curve is fine, it's approximately right, it's just now we can have databases of millions of games and we can just do better. Elo was invented before the big data age and it is limited by that.That's all I'm saying.I shouldn't have included all the other stuff in the article, it just distracts from the point.

评论 #23931714 未加载

dcl将近 5 年前

If you're interested in evaluating and rating/ranking agents, it might be worthwhile checking out DeepMind's multidimensional Elo rating system (<a href="https://arxiv.org/abs/1806.02643" rel="nofollow">https://arxiv.org/abs/1806.02643</a>) which attempts to solve some of the issues with Elo and Glicko. Most notably, the ability to handle non-transitive interactions (like rock, paper, scissors) and the presence of redundant duplications of matches that might erroneously inflate ratings.Shameless plug, I've created an R implementation of it here: <a href="https://dclaz.github.io/mELO/" rel="nofollow">https://dclaz.github.io/mELO/</a>

评论 #23913622 未加载

noctilux将近 5 年前

I'm curious about whether the author tried to optimize Elo's K factor. It's often left at 32, which is not reasonable for all contests. It's essentially related to the standard deviation of player skills: if there is a large range of skills, it should be large, and if there is a small range, it should be small. It's easy to tune by optimisation, and it has a huge effect on predictive ability.

HideousKojima将近 5 年前

The more obvious solution is to bring back custom lobbies and private servers and forget about ranking players at all. Gets rid of a lot of bad behavior too because servers can police their own communities and players won't get frustrated when a crappy teammate is dragging their ranking down

评论 #23914578 未加载

评论 #23913010 未加载

评论 #23911276 未加载

im3w1l将近 5 年前

> If we take a top-level player, and make them fight a high-level, mid-level and low-level player repeatedly until we can become statistically confident of their win rates against each, there is no reason why their win rates would fit an exponential curve.When I first read this, I thought to myself "well we get to pick the scores, so it's exponential by definition". The problem becomes more clear when you express it without any reference to the scores.If Player A wins 80% of the time against Player B, and Player B wins 80% of the time against Player C, how often does Player A win against Player C? This is a question purely in terms of observables. Elo makes a prediction here (94.1% of the time) and it can be either right or wrong. If it's wrong, then there is no valid assignment of scores.

gverrilla将近 5 年前

Isn't a qualitative system possible? It would be really complex to create for a game such as dota2 or cs:go, but maybe not for a simpler game. I will give cs:go as an example only because I know it very well.. It would be possible, I believe, in theory, to measure player knowledge towards specific ingame-skills. New cs players for instance wouldn't know how to control recoil effectively. And 100% of global elite/pro players would be above a certain threshold regarding recoil control. On the other hand, you could say with a lot of confindence that a player that tries to achieve a high ground pressing only +jump multiple times with no success, when he would need a crouch jump instead because of height, is a noob. Elo or something similar could then be used to measure ranks within specific clusters only. And some form of mood/form on top of this, to allow for better experience (even though I have played cs for 20y now, it could happen that I abandon the game for a few months, or that I have a really bad focus because of external events).I'm not sure if this makes sense, but what I know for sure is that as an experienced player, I can watch a player play a single game (sometimes a few rounds), and access his average rank/skill level with high confidence, with no need of information from his prior games whatsoever, or detailed statistics of his gameplay.There's something else to remember for high skill-cieling games: winrate is not what really matters. A lot of times I will play a very good, balanced and fun game and lose. Sometimes it will even happen with very uneven scores like 16-5 or soomething...

closed将近 5 年前

I am pretty sure the author is describing a well understood limitation of Elo, they just need a tiny bit of connecting to models.Elo can be thought of as an approximation to item response theory models [1]. These describe skill as normally distributed, and whether one person will win using a logistic function (not exponetial).I think what the author has keyed in on is that afaik in simple Elo there is no slope coefficient for the logistic, but in general IRT models there is (called item discrimination). So in Elo you can't learn that flatter curve they show.[1]: <a href="http://hvandermaas.socsci.uva.nl/Homepage_Han_van_der_Maas/Publications_files/papers/klinkenberg.pdf" rel="nofollow">http://hvandermaas.socsci.uva.nl/Homepage_Han_van_der_Maas/P...</a>

edaemon将近 5 年前

The "newbie suppression" mechanic doesn't make much sense to me. If you play against someone substantially lower in rating than you and lose, shouldn't you lose a significant amount of points? After all, you lost to someone you should have easily beaten.

评论 #23911443 未加载

评论 #23911212 未加载

评论 #23911747 未加载

评论 #23911192 未加载

duaoebg将近 5 年前

Repeated Bernoulli trials give rise to Gaussian distributions which is where the e exponential comes from.This an assumption and an approximation and is not necessarily a good fit. Pulling from actual probabilities would generally perform better.The rest is massaging to better fit the different objectives.

Godel_unicode将近 5 年前

If your curve is linear, it's because your game isn't that hard (or more formally, where winning and skill are less strongly correlated). This is tough for people to hear if their game is "designed to be a high-skill game".The curve being linear means essentially that skill in the game confers less of a relative advantage. Chess is a good counterexample here, also rocket league. Both are games where difference in MMR is very strongly correlated with outcome, and both are games where skill is easily measured and highly correlated with ranking.

sytelus将近 5 年前

Take a look at TrueSkill, a much better mathematically grounded, created at Microsoft Research and being used at scale in Xbox: <a href="https://en.m.wikipedia.org/wiki/TrueSkill" rel="nofollow">https://en.m.wikipedia.org/wiki/TrueSkill</a>

IshKebab将近 5 年前

TrueSkill definitely has a time decay term and I'm fairly sure it lets you fit the model to previous games. I wonder if the author actually tried it. (Though to be fair I'm not sure if there are open source versions of the latest version of TrueSkill.)

评论 #23912026 未加载

评论 #23911651 未加载

neolefty将近 5 年前

How about coop games — what would you use to rate players where the goal is to win together?

EGreg将近 5 年前

Wait why don’t we use a deep learning thingy on this dataset and just back out a formula that predicts the wins based on just the relative numbers of the people?

musicale将近 5 年前

Nonsense - they're in the Rock and Roll Hall of Fame after all! Jeff Lynne is a musical genius.

philliphaydon将近 5 年前

Elo was in Age of Empires back when zone .com was a thing.It worked and worked well. Points were calculated for each person. However Dots2 and lol don’t implement Elo the same way, points are calculated for the team. So if you’re Low score and you win against high people. In Dota and lol you won’t gain many points.I believe this is done to avoid being carried but it doesn’t work because it just results in you being stuck in a Low tier for ages.TLDR: elo works and it’s great. No one implements it right.Edit: In Age of Empires / Zone, if you had a 4v4, it used all 8 players to calculate the ELO on an individual player, so if you had in your team 1750 elo, 1550 elo, and anything in between. The 1750 may gain only 1 point, while the 1550 may gain 16 points (the highest gain lowered the more people who played) While on the losing side the lowest elo will lose the lowest amount of points and the highest will lose the highest amount of points.dota / lol don't do this, the winning/losing team gains/loses the same amount of points. This is wrong.This means a high elo player has the potential to farm points from low elo players with little risk. While low elo players get stuck not playing people in their own range.

dang将近 5 年前

I recall at least one large previous thread about Elo but can't find it. Anyone?

评论 #23911164 未加载

afwaller将近 5 年前

This is useful to increase plays by reducing “ladder anxiety”

letmeinhere将近 5 年前

Isn't that a logarithmic curve?

评论 #23909304 未加载