Ratings: A Mathematical Study

Backgammon Articles

Ratings: A Mathematical Study

by Douglas Zare and Adam Stocks

This article originally appeared in GammonVillage in 2001.
Thank you to Douglas Zare for his kind permission to reproduce it here.

"If I am given a formula, and I am ignorant of its meaning, it cannot teach me anything, but if I already know it what does the formula teach me?"
—St. Augustine De Magistro ch X, 23.

Introduction

Although ratings systems are not directly part of backgammon, they are common and interesting though frequently frustrating. In this article we report some facts about the rating systems on three backgammon servers that may increase the amount of information and enjoyment you get from backgammon ratings. The questions we will address are as follows:

If one is misrated, how long does that last?

What is the half-life of a ratings difference? That is, how quickly do ratings differences decay?
How long does it take to bounce back to normal?

How large are the normal swings of one's rating?
How high should one's maximum rating ever be?
Can one estimate someone's true rating from their experience level and maximum rating?

Finally, why do the answers given here differ from other answers in the rec.games.backgammon archive?

First, how do ratings work? There are many methods, but all of the ones we will discuss have the property that there is a formula estimating the probability of winning a match based on (1) the difference between the ratings of the players and (2) the length of the match. One gains rating points if one wins and loses rating points if one loses so that if the formula's estimates are accurate, then on average, one's rating stays the same. The number of rating points at stake also varies.
For example, suppose that on FIBS one plays a 25-point match with a player so much weaker that one has an 80% chance of winning. If this is accurately reflected in the ratings, then when one wins, one will gain 4 rating points. If one loses, one loses 16 points. Another way of thinking about this is that one gives up 6 points before the match, and then plays for a stake of 10 points.
If your rating is very far from what it should be, then it will tend to move closer to your playing strength. On the other hand, from the luck of the dice your rating will fluctuate. In this article we will separate those effects, determine the stable distribution of ratings, and show that the maximum rating is often a better indicator of playing strength than one's current rating.

1. The Half-Life

Suppose your rating suddenly drops by 100 points due to letting your goldfish use your account or just a streak of bad luck. How long will this change last? In some sense, the residue of the change will always be there, but the effect will diminish with time, much like a radioactive substance will decay.

Another way of thinking about the half-life is as follows: Your current rating is a function of whether you won or lost in past games. More recent games are more important than games played many experience points ago. The half-life is how far back the matches worth only half as much as the most recent matches are from the present.

What is the restoring force that pushes your rating toward its natural level? Suppose that in the example above, you will only win 75% of the 25-point matches, so the fair payoffs are +5 and −15 rather than +4 and −16. Rather than paying 6 points to play for a stake of 10 points, the fair amount is to pay 5 points, so you lose 1 point, and then play a fair game. It doesn't matter much whether you win or lose that game.

Example:

Suppose players A and B play 5-point matches on FIBS against players rated 1500, 1700, and 1600 in that order, and both of them win all 3 matches. Player A starts with a rating of 1500. Player B starts with a rating of 1600.

Player A Player B
After match 1 1504.472 1603.900
After match 2 1510.047 1608.922
After match 3 1515.034 1613.343

So, after 3 matches, the rating of player A is less than that of player B by 98.309 rating points.
Suppose instead they lose all 3 matches.

Player A Player B
After match 1 1495.528 1594.955
After match 2 1492.206 1591.084
After match 3 1488.351 1586.663

After 3 matches, the rating of player A is less than that of player B by 98.313 points, almost exactly the same amount as if both had won all 3 matches! After another 100 5-point matches with the same opponents (with ratings near 1500) and the same results, player A and player B will probably have ratings that differ by between 55 and 60 rating points.

The following graph illustrates the convergence of ratings in a simulation. It shows the ratings on FIBS versus experience of accounts that start with a range of rating, but have exactly the same pattern of opponents and wins and losses. All opponents were assumed to be correctly rated at 1500. To generate the sequence of wins and losses, it was assumed that the actual playing strength was 1500.

The restoring force is approximately proportionate to the length of the match and to the difference between the ratings between player A and player B, or between player A and the playing strength of A. These mean that it makes sense to talk about a half-life of a ratings difference in units of experience points, the sum of the lengths of matches that one plays. The half-life varies slightly on the rating of one's opponent, the difference between the two players' ratings, and the length of the matches. The following table shows some empirically derived (from simulations) half-lives for various values of each parameter:

server difference matchlength halflife FIBS 300 5 605-655 FIBS 100 5 605-610 FIBS 10 5 605 FIBS 1 5 605 FIBS 100 1 602-604 FIBS 100 25 600-650 FIBS 100 5* 705-880 GGrid 100 1 641 GGrid 100 5 585-590 GGrid 100 25 575-625 GGrid 100 5* 650-760 GS2000 100 1 482-484 GS2000 100 5 485-495 GS2000 100 25 500-525 GS2000 100 5* 555-680

The lines with matchlength 5* give the half-lives when the opponent is 400 points weaker or stronger, rather than approximately equal to the player. The opponent's rating and strength were chosen to be between the upper and lower ratings that fit this description.

For example, this means that if two players on FIBS have ratings which differ by 100 points play only 5-point matches with players of about the same rating, then after 121-122 matches (605-610 experience points) the difference between the ratings will be just under 50 points if they have the same pattern of wins and losses. Of course, if your true rating is 1500 and your actual rating is 1400, it is quite possible that after 600 exp, your rating will still be 1400. However, for that to happen, you would have to be unlucky to the extent that if your rating started at 1500 it would now be about 1450. The 100 point gap from 1400 to 1500 has diminished to a 50 point gap between 1400 and 1450, and you have another 50 points of bad luck to work off.

The values in the table don't differ from each other much. On FIBS, the half-life varies from 600 to 650 for normal play. On GamesGrid, the half-life varies from 575 to 641. On GameSite 2000, the half-life varies from 482 to 525.

This can be varied slightly to give an approximate expected rating of a player who goes through the "ramp" or initial adjustment period in which the games are worth more. The ramped period of 400 experience points is approximately equivalent to 1200 unramped experience points, or about 2 half-lives on FIBS. So if a player's strength is 1600, then after the ramped 400 experience points one expects that the rating has risen from 1500 to about 1575. On GamesGrid, the ramp lasts 500 experience points and is about equivalent to 1500 experience points, or 2.5 half-lives, so a player of strength 1800 can expect to be rated about 1747 after the ramp.

The following graph shows the expected rating on GameGrid versus experience, taking into account the ramp, and assuming that one plays 5-point matches against those whose (correct) ratings equal the new player's nominal rating.

Incidentally, the restoring force is greatest when one plays people rated half-way between one's rating and playing strength. Someone of playing strength 1700 who is rated 1500 should play those correctly rated 1600 to increase their rating as rapidly as possible.

2. How long does it take to bounce back from a fall?

A player, call her Double7 (not her real id), recently started playing on GamesGrid with a rating of 1800 (and no ramp). Double7's play on another server indicates that she actually should have been granted a rating of at least 1900, and her rating quickly passed 1900. If we assume that Double7's playing strength is 1900, how long should she wait before her rating is first at least 1900?

Her expected rating will never be 1900, though it will be about 1850 after 600 exp, about 1875 after 1200 exp, etc. because of the half-life phenomenon. She might encounter good luck and get a rating of 1900 after only a few matches, or have enough bad luck that she does not reach 1850 even after 5000 exp. Although there are advanced mathematical tools that can allow one to describe the distribution of the times before one reaches 1900, the simplest method is just to simulate the process.

The following is a table of the 1st percentile, 10th percentile, 50th percentile, 90th percentile, and 99th percentile times to bounce back from various ratings differences assuming that one plays only 5-point matches against players with one's target rating using the GamesGrid formula. The sample sizes were 999, so the numbers may be off by a few percent.

delta 1st 10th 50th 90th 99th 20 25 45 215 1360 3185 50 95 200 675 1960 4110 100 305 500 1145 2490 4455 200 660 965 1690 3040 5240 500 1375 1775 2540 4040 6155

So, for example, it seems that there was a 10% chance that Double7 would reach 1900 by 500 exp. The median time to reach 1900 is about 1145 exp, and there was about a 10% chance that Double7 would not have reached 1900 even after 2490 exp. If Double7 should be rated 1920, then the times would be 265, 445, 930, 2005, and 3310. On the other hand, she would get to 1900 slightly more slowly if she played only players of strength 1700: 330, 545, 1195, 2585, 4630.

By symmetry, one can also remain above one's true strength for as long a time.

3. Ratings Variations

Just from the luck of the dice, our ratings vary a lot from our playing strength. Some of that comes from differences in playing style, say, when one is tired, or if one plays overrated or underrated opponents. If one ignores those effects, how much should one's rating vary?

The following tables give experimental evidence (from simulations) of how much one's rating varies on GamesGrid assuming that one plays only 5-point matches against players of one's true strength. Note that we assume there is a ramped period from 0 exp to 500 exp in which ratings change more. These values were taken from samples of size 999, so they are not exact. The values in the outer columns are less precise (+−10 points) than those in the middle (+−3 points).

rating exp 1st 10th 50th 90th 99th stddev average 1500 500 1354 1415 1597 1584 1658 65.1 1499 1500 1000 1374 1436 1499 1565 1605 50.4 1499 1500 2500 1405 1446 1500 1554 1598 41.6 1500 1500 10000 1400 1445 1500 1554 1590 41.5 1500 1600 500 1440 1501 1587 1669 1731 63.5 1585 1700 500 1525 1580 1669 1752 1815 66.2 1668 1800 500 1595 1665 1747 1833 1900 65.8 1748 1900 500 1693 1743 1828 1922 1984 66.7 1831 2000 500 1758 1832 1914 1994 2055 62.2 1913

Interestingly, the ramped period increases the variance dramatically, but it then decreases. The distribution after 10000 experience is more tightly distributed about the true value than after 1000, and that is more tightly distributed than after 500.

However, we can make a mathematical model for the limiting distribution that we can solve. The probability density function of the stable distribution satisfies the following differential equation, which comes from the fact that as many players should increase past rating x as decrease past rating x if the distribution is stable.

pdf(x) represents the stable probability density function.
variance is the variance per experience, 16 p(1 − p) or 25 p(1 − p), where p is the probability of winning the match.
drift(x) is the restoring force per experience.
pdf'(x) is the derivative of pdf(x) with respect to x.

This is satisfied when

where the constant c is chosen so that the total probability is 1. The result is a distribution which looks similar to the Gaussian (normal) distribution, the classic bell curve, but the tails are fatter.

With this solution, one can compute the percentiles of the stable distribution on each of the servers considered. The following assumes that one plays only players correctly rated one's true strength.

1st 10th 50th 90th 99th stddev FIBS 1-point matches: -97.00 -53.43 0 +53.43 +97.00 41.69 FIBS 5-point matches: -97.16 -53.47 0 +53.47 +97.16 41.73 FIBS 25-point matches: -97.95 -53.71 0 +53.71 +97.16 41.98 GamesGrid 5-point: -95.51 -52.58 0 +52.58 +95.51 41.04 GameSite 2000 5-point: -108.68 -59.80 0 +59.80 +108.68 46.68

These are the limiting distributions. Over the very long run, a player on FIBS playing only 5-point matches with correctly rated players of equal strength should spend 1% of the time with a rating more than 97.16 points below the correct value. One is 200 points overrated just under a millionth of the time—if you hear of someone honestly falling 200 rating points they were probably overrated at the start and underrated afterwards.

The formulas for GamesGrid and for FIBS look different, but produce almost exactly the same stable distributions. On the other hand, the formula for GameSite 2000 only differs in that the total stakes for a match are 5 sqrt(n) rather than 4 sqrt(n). This means that the half-life is smaller, but a string of victories has a greater effect on one's rating, so the deviation from one's true rating is slightly larger.

Is the expected rating of someone of playing strength 1500 equal to 1500? Oddly, no. What the rating system ensures is that the mode is 1500, i.e., it is more likely that the player is rated between 1499 and 1501 than between 1549 and 1551. However, the median (50th percentile) and the average value both depend on the strength of one's opponents, even with the assumption that one's opponents are properly rated.

5-point matches on FIBS against opponents 1st 10th 50th 90th 99th average of equal strength: -97.16 -53.47 0 +53.47 +97.16 0 100 points stronger: -97.87 -53.82 -0.19 +53.13 +96.45 -0.29 300 points stronger: -99.21 -54.48 -0.55 +52.48 +95.13 -0.82

While it is true that in this model one can increase one's average rating by playing only weaker players, the effect is very small.

Another effect not captured by this model is that one's opponents are not necessarily correctly rated. This is different from in chess, where almost all tournament players say they are 150 points stronger than their ratings; in backgammon there really are ratings swings. The direct effects are very small. If players rated 300 points above one's strength are really either 200 points above or 400 points above with equal likelihood, then for the purposes of playing 5-point matches it is as though they are actually only 295 points stronger, so if one only plays people rated 300 points stronger than one's rating, rather than averaging 1 point lower, one would average 4 points higher than one's true rating.

The assumption that players at a given rating are overrated or underrated with equal probabilities is unreasonable, however. Someone rated 1850 might have strength 1900 and a string of bad luck, or 1800 with good luck, but there are more 1800 players than 1900 players, so someone rated 1850 is more likely to be overrated than underrated. To what extent this is true is outside the realm of mathematics, though conceivably a more detailed analysis could model this, but by playing weaker players only when they are overrated one can boost one's rating to the point that it is only a valid indicator of how well one performs against overrated players, not most of the population.

4. One's Maximum Rating

GamesGrid keeps track of each player's highest rating ever and makes this information available in the player information. On other servers, we suspect that players keep track of this data by themselves. How high should this rating be? That depends on your experience level. If you play long enough, you will get an astoundingly lucky streak.

The following tabulates the 10th percentile, 50th percentile, and 90th percentile of the maximum rating of simulated players who start out with 0 experience points on GamesGrid and plays only 5-point matches. Note that the mathematical model may break down since people play differently when they have had lucky streaks from when they have had unlucky streaks, and their opponents may change, too.

rating exp 1st 10th 50th 90th 99th stddev average 1500 500 1500 1522 1588 1664 1724 53.9 1592 1500 1000 1500 1522 1590 1663 1742 54.2 1593 1500 2500 1509 1542 1592 1661 1712 46.2 1597 1500 10000 1548 1574 1609 1667 1728 37.5 1616 1600 500 1511 1565 1642 1726 1781 61.8 1645 1600 1000 1531 1580 1648 1722 1788 55.4 1650 1600 2500 1580 1617 1670 1732 1801 45.5 1673 1600 10000 1643 1664 1698 1741 1804 30.7 1701 1700 500 1565 1622 1710 1792 1865 64.2 1709 1700 1000 1609 1657 1728 1804 1873 56.4 1729 1700 2500 1671 1707 1755 1810 1871 41.0 1757 1700 10000 1741 1764 1794 1827 1875 25.7 1795 1800 500 1650 1704 1784 1874 1941 65.1 1787 1800 1000 1687 1737 1808 1881 1946 56.1 1808 1800 2500 1759 1798 1847 1899 1951 40.0 1849 1800 10000 1834 1858 1889 1923 1958 25.2 1890 1900 500 1709 1778 1860 1945 2016 66.9 1860 1900 1000 1769 1821 1893 1962 2033 55.8 1893 1900 2500 1852 1890 1938 1986 2032 37.2 1939 1900 10000 1930 1957 1990 2022 2055 26.0 1990 2000 500 1777 1848 1933 2026 2106 68.4 1936 2000 1000 1859 1913 1976 2045 2095 52.0 1977 2000 2500 1949 1983 2035 2081 2115 36.7 2033 2000 10000 2034 2059 2089 2124 2148 25.1 2090

There are many interesting phenomena which can be observed in the above table.

The top 90th and 99th percentiles don't change much for rating 1500 as one increases the experience level. That is because the high ratings were almost always achieved in the first 500 experience, when a winning streak has a large effect. For example, in these simulations, only 1 of the top 100 out of 999 maximum ratings for the experience levels of 1000 and 2500 were achieved after 500 experience, and only 5 out of the top 100 for the experience level of 10,000 were achieved after 500 experience. Eventually, the swings after the ramped period will exceed the initial streaks, but this will take a very, very long time if the streaks took one to over 150 points above one's true rating. To a lesser extent, this occurs with stronger players, but there is less time to achieve a very high rating before the ramp expires, since part of the time is spent rising to the true rating.

Particularly for the stronger players and large experience levels, the range of maximum ratings is relatively small compared to the range of ratings. At 2500 experience, the gap between the 10th and the 90th percentiles of the maximum rating is about as large as the gap between those percentiles of the current ratings, 100 points. However, at 10000 experience, the gap between the 10th percentile and 90th percential of the maximum ratings is 60-65 points. For sufficiently large experience levels, the gap between the 1st and 99th percentiles will be arbitrarily small, so the maximum rating will be an extremely accurate indicator of the person's playing strength.

5. Estimating Playing Strength from Max Rating and Experience

Some players with ratings of 1750 on GamesGrid have highest ever ratings of 1760, and some have highest ever ratings of 1920. With which would you prefer to play 100 games for $100 a point? One would expect that the second player is stronger, and has had an unlucky streak, and the first has had a lucky streak, but how can one estimate the true rating of each, assuming that they have played consistently?

This is a subtle statistical issue, and the assumptions are not quite accurate. However, we can use the table constructed above to try to answer it within the model.

If after an experience of 10,000 the maximum rating of a player is 1860 and they have been unlucky as they would be 1 time in 10, then their actual playing strength is about 1800. If they have been lucky as they would be 1 time in 10, their actual playing strength is about 1740. That gap is smaller than the difference between the 10th percentile and the 90th percentile of the rating distributions.

Of course, to estimate one's playing strength, one should use both the current rating and the maximum rating. However, for large experience levels, one gets more information from the latter than the former. Even for players of true strength 1500, after 10000 experience the standard deviation of the maximum rating is smaller than the standard deviation of the rating.

Finally, what good does this do you? Perhaps nothing, but it might soothe the irritation of a streak of bad luck, and remind you not to quit your day job if you find yourself rated higher than Snowie.

Appendix: We are Right and They are Wrong

Ok, maybe they are right, too. Other people have written articles on the ratings system, sometimes asking the same questions, and sometimes getting different answers. We'd like to give pointers to some of these articles contained in the rec.games.backgammon archive and explain the differences. If you are interested in the ratings system, it is worth looking at these other perspectives and analyses.

Gary Wong's "Effect of droppers on ratings."

This article introduces the idea of the half-life in one particular case and applies it to the issue in the title. However, an arithmetic error and a modelling error cause the half-life to be underestimated by a factor of 3.

Kevin Bastian's "Statistics."

This article does an empirical analysis of ratings variations from one real-life history. The variations there are much smaller than those one would expect from considering the stable distribution determined in section 3. The reason is that the history is very short. Two samples from the ratings distribution that are one's rating before and after a 5-point match are highly correlated, and there can't be a large difference between them. Even if the ratings are far from the player's true average rating, they will be very close to their own average. This effect biases the analysis, and his observed standard deviation (24) is much smaller than the standard deviation of the stable distribution (42).

David desJardins' "No limit to ratings." and "Unbounded rating theorem."

These show that with probability one, your rating will exceed any given level if you play long enough. That's true enough. How long will it take a player whose strength is 1500 to exceed 2000? At a rate of one 5-point match per second, the player should be over 1900 one second every 4 trillion years, but achieving 2000 happen 6 billion times less frequently. The expected time one waits before achieving 1900 is longer than 4 trillion years by aproximately the number of seconds one expects to stay above 1900 once it is achieved, which is between 3.3 and 4.0.

Ed Rybak's "Ratings swing."

He reports the results of some simulations, including one similar to that of section 2: How long does it take to bounce back from a 100 point rating drop? Note that for a distribution of this type, there is a "fat tail." Our reported median is smaller than his average value due to the small fraction of times that the time to bounce back is very large; the average values in our simulations agree with his.

Jim Williams' "Different length matches."

From the actual results of tens of thousands of 1-point, 3-point, and 5-point matches he finds that the correct lengths of these matches in the ratings formula should have been 1.6, 1.6, and 2.1 by the ratings formula, respectively, presumably by a least-squares fit. Two conclusions: There is no more skill in a 3-point match than a 1-point match, and longer matches have less skill than the ratings formula indicates.

The problem with this analysis is that these come from real opponents, rather than ideal opponents who always have their correct ratings. Because of the natural variation, people will often be rated higher than stronger opponents, or will have a large ratings difference against an opponent who is only slightly stronger. These effects do not cancel out. The result is that the effective match length will be underestimated by this method of analysis even if the ratings formula is completely correct if one enters the true ratings of the players.

Suppose player A is 60 points stronger than player B. A standard deviation of about 40 says that the ratings are as varied as though half the time one is 40 points overrated, and half the time one is 40 points underrated. If this is the case with both players, then out of 4 matches between players A and B, player A will be rated 140 points greater in one, 60 points greater in two, and 20 points lower in one. That is a lot of variation. The following table indicates the best (least-squares) fit to the effective match length if the two play n-point matches and player A wins exactly the amount suggested by the FIBS ratings formula.

matchlength A's advantage effective length 7 60 1.97 5 60 1.40 3 60 0.84 1 60 0.28 7 100 4.08 5 100 2.90 3 100 1.73 1 100 0.58 7 200 6.21 5 200 4.39 3 200 2.61 1 200 0.86 7 400 6.99 5 400 4.94 3 400 2.93 1 400 0.96

Since the effective match length found by this method varies with the difference in ratings, it is unclear what information one can get about the relative skill levels needed to win matches of different lengths. However, since it always underestimates the effective match length, this suggests that 1-point matches may have slightly more skill than his reported effective match length of 1.6 indicates.

William Hill's "Opponent's Strength."

He points out that two identical bots achieved ratings which differed by 150 points. The higher rated bot only played those with ratings under 1750, and the lower-rated bot played only players with ratings over 1750. How can this be reconciled with the statements at the end of section 3 which predict the opposite effect?

I think the crucial factor here is that the player is a bot, and therefore makes predictable errors that stronger human players can exploit disproportionately well. That effect is not easy to model mathematically, but if it is important then it does not invalidate the rating system for human players. It might be that the ratings formula gives a good estimate for the probability each player wins the first match that they play, but not the 10th match in a row. Also, weaker players might be more likely to be grossly overrated or to throw games to a bot. Still, it is an interesting phenomenon.

Christopher Yep's "Possible Adjustments."

Among other interesting possibilities, he suggests adjusting the FIBS formula so that the stakes for an n-point match are 2sqrt(n) rather than 4sqrt(n). The only difference between the FIBS formula and that of GameSite 2000 is that FIBS uses 4 sqrt(n) and GS uses 5 sqrt(n). It has been suggested that the bots should have a smaller multiplier, perhaps as though the stakes were 0.5 sqrt(n) for it. What effect would these have?

Larger stakes decrease the half-life linearly, but increase the standard deviation by a factor of approximately the square root. The standard deviation of the stable distribution on GS 2000 is about sqrt(5/4) times the standard deviation of the stable distribution on FIBS. Decreasing the stakes would shrink the ratings variations, but lengthen the half-life; one would be misrated by less, but it would take longer to recover from a drop, or for any improvements in skill to be reflected. If a bot were changed to move only 1/10 as much as the other players on GS 2000, then the standard deviation of its stable distribution would be 14.74 rather than 46.68.

FLMaster39's "Ratings Variation."

This contains the information from experiments on the maximum ratings similar to the ones tabulated in section 4. The main differences are that he used longer histories, collected more data, and did not include a ramped period but rather assumed that the ratings started at the correct values. He also reports average values rather than several percentiles.

Here is a Mathematica 4 notebook which shows how some of the computations in the article were done, and allows one to vary the parameters.

Return to : Articles by Douglas Zare : Backgammon Galore