Hi, two years ago here was some discussion about FIBS rating system:
how does it work with different match lengths etc... This article
is very long and might be also a bit complicate, sorry for that.
Anyway, I think, that it explains quite nicely anomalous rating data
collected from FIBS. Article answers to question: how much does
better player get advantage from the cube? It explains also how the
rating system should be modified in order that it would work better
for different match lengths.
Best regards,
Matti Rinta-Nikkola
1. The ELO system
-----------------
Basic assumption in ELO rating system is that the rating distribution of
players will follow Gaussian distribution. The assumption leads to the
match winning probability formula:
1
P(D) = --------------------- , (1)
10**(-D*sqrt(S)/W) + 1
where D is the ELO difference
S=S(N) is opportunity for skill in the N point match
W is the class width
In ELO system the winner of the match will gain
(1-P)M*sqrt(S) (2)
points and the looser will lost the same amount.
On rating system the class width W and the mean of the rating
distribution <ELO> can be set arbitrary. Backgammon servers (that I
know Netgammon, FIBS and Gamesgrid) have chosen class width W=2000,
<ELO> =1500 and constant M=5 (eq 2). Note the relationship between
constant M and class width W. If you desire to set lower (higher) value
for W you should also lower (higher) the value of M. Skill function
S(N) has been solved assuming that the game winner will always get 1
point (ref 1). That assumption leads to the function
S(N) = N. (3)
If we take account gammons and doubling cube it is shown that Skill
function has a form
N - 1
S(N) = 1 + C ----- ; N=1,3,5,7.... , (4)
2
where constant C should be solved using the match statistics of the
server (ref 2). We know that the value of C will be .8 < C < 2
(ref 3,4).
2. Determining Skill function constant
--------------------------------------
2.1 Continuos zero volatility game
----------------------------------
Constant C on eq(4) can be divided in two parts C=cp+ch, where cp
presents checker playing and ch cube handling skill. It has been shown
that cp=.84 (ref 3,4) while the part of the cube handling skill is still
more or less open question. Opportunity for skill could be equally
understood as possibility for error. One way to estimate ch is to
determine maximum "reasonable" cube handling error that players could do
and evaluate its effect to game result. In order to estimate ch we will
make following assumptions:
1) maximum cube handling error = cube is never used
2) backgammon game = continuos and zero volatility game
3) equity change is directly proportional to the skill
Two players with equal checker play skills but maximum diversity in cube
handling skills play money game. We will assume that perfect doubling
point is p=.75, where p is cubeless game winning probability, and that
doubles are always dropped (from zero volatility assumption it follows
that perfect doubles are equally correct to take or drop). Money game
equity can be written as a function of cubeless game winning probability
Eq(p) = 2.67p - 1, (5)
see figure 1. At the beginning of the game Eq(.5)=.33 i.e. maximum
advantage that player can obtain using cube. In order to win a game
equity change DEq=.67 has to be concurred by checker play. Using
assumption 3) we can write
cp
C = ---- = 1.3 ; assuming N >> 3.
.67
Notice small error on figure 1 and equation (5). Average win for the
player who does not use cube is bigger than one. Player who uses cube
does not get additional cubing advantage from the gammons because this
advantage is compensated by gammon wins of player who plays without
cube. Using different assumption for gammonless doubling point
.8> p >.75 Skill function constant varies 1.1< C <1.3. Doubling point
.8 corresponds alive cube and .75 corresponds dead cube. Average
doubling point (omitting gammons) is p=.78 (ref 5,6) which will give
C=1.2.
.75
--------------- 1
/
/
|/
-/- - .33
/|
-----/--------- 0
/
/
/
/
/ |
--------------- -1
0 .5 1
Figure 1. Money game equity as a function of cubeless game winning
probability.
2.2 Average number of games
---------------------------
Note that in Skill equation (4) opportunity for skill is expressed "1
point match skill" as unit. The complication determining constant C is
partly rising from the fact that N point match (N>1) there are elements
which are missing from one point match i.e. cube and gammon factor.
The problem can be simplified if we are considering only the odd point
matches longer than one point. We assume that in every game is involved
same amount of skill. That is we assume that in a game of three point
match there is same opportunity for skill than in a game of 11 point
match (for example). This is reasonable approximation -from Jellyfish
money game statistics it can be seen that cube is turned only 1.2
times/game (ref 5). I think that the above cubing number is valid also
in pre-Crawford games in match play. Although match score is affecting
to cube decisions, I think, that in a single cubing in three point match
there is in average same opportunity for error or equivalently for skill
than in a single cubing in 11 point match (for example). If average
number of games/match are known we can simply write Skill function,
notice the analogy with rolls method (ref 3). Luckily that data can be
retrieved from big_brother match archive (ref 7), see Table 1. Skill
function for odd point matches (N>1) can be written as
N - 3
S(N)= 1 + C' ----- ; N=3,5,7,9... (6)
2
Here skill is expressed "3 point match skill" as unit.
Table 1. Big_brother match archive: average number of games/Match and
Skill functions.
Match # of matches in average # of average # of games
length archive games (2.35 as unit)
1 350 1.00 (1.00) -
3 634 2.35 (2.35) 1.00 (1.00)
5 1184 3.83 (3.70) 1.63 (1.60)
7 492 5.02 (5.05) 2.14 (2.20)
9 42 7.24 (6.40) 3.08 (2.80)
11 31 7.81 (7.75) 3.32 (3.40)
eq. 4 eq. 6
C=1.35 C'=.60
Note that without additional assumptions using the table data we cannot
say how much more opportunities for skill there are in 3 point match
than in 1 point match. In the middle column on Table 1 we have simply
assumed that in 1 point match game is equivalent to the N point match
game N>1.
2.3 Data of match results
-------------------------
Determining the constant C using players' ELO ratings is somewhat tricky
business. It's tricky because we are using ELO data which has been
obtained using faulty rating system to correct the rating system. If
rating system has a wrong Skill function constant then we do not have
rating system which predicts consistently the winning chances of players
in all match lengths. In fact every match length will have its own
rating system with different class width values W'(N). If player plays
mainly 1 point matches his rating will follow 1 point rating system and
if he plays mainly 3 point matches his rating will follow 3 point rating
system and so on. Note that player's rating does not depend only on the
match length he usually plays but also the ratings of his opponents and
the match lengths they mainly play. Of course ratings are describing
also players' backgammon playing skills despite erroneous Skill
function.
In order to understand where and how big is the error resulted from the
FIBS rating system lets take closer look to the heart of the rating
system i.e. to the match winning probability function eq(1). It can be
shown that the exponent in the formula
D*sqrt(S)
--------- = constant; N=constant (7)
W
If we change the values S or W the rating differences of the players
will change so that the above equation is constant. In other words the
rating system does not change players' match winning probabilities!
Assuming that we have bg-server equipped with two independent rating
systems which have same class width value W but different Skill
functions S and S'. Both Skill functions have the form of that
expressed in equation 4 but they have different constant C. Lets assume
that rating system with Skill function S is perfect i.e. measured class
width value is constant and equal to W -the value used in rating system.
Using equation (7) we will define function
W D' S
Err(N) = --- = sqrt(---); N=1,3,5,7,... (8)
W'D S'
which will be used later to examine quality of erroneous rating system
and to determine constant C. The class width W' on above equation can
be measured using the match data of server and eq(1) (ref 8). Note that
W' is a function of match length. Here D and D' should be understood as
class widths of players' ELO point distribution rather than ELO
difference of two individual players.
Assuming that there is no ratings mixing i.e. all match lengths will
have their own ELO rating system and the ratings of different match
lengths aren't mixed. In that case W/W'=1 and Skill function constant
could be retrieved from
D'(N)
Err(N) = -----; N=1,3,5,7,... (9)
D
where D and D'(N) are measured class widths of match rating
distributions. D is class width of 1 point and D'(N) N point match
rating system. Unfortunately there is no practical use for equation (9)
because class widths D'(N) cannot be measured easily. In realistic case
ratings of different match lengths are well mixed and although faulty
rating system players ELO distribution have well defined class width Dm'
(ref 9). Now equation (8) can be written
W Dm'
Err(N) = ----- ----; N=1,3,5,7,... (10)
W'(N) D(N)
Note that here Dm' resulted by rating system is constant while the
correct class width D=D(N) is function of match length. Also this
equation has its difficulties. We know W, Dm' is easily obtained from
server and W'(N) can be measured. The problem here is that we do not
know how to get D(N).
o Example 1: FIBS ratings formula.
In a case of FIBS ratings system S'(N)=N i.e. C=2 on eq(4).
Equation (8) can be written as
1+C(N-1)/2
Err(N) = sqrt(----------); N=1,3,5,7,... (11)
N
Function is tabulated using various N and C values on the table below.
We see from Table 2 that FIBS rating system works reasonable well for
matches longer than one point, because function Err(N>1) is nearly
constant.
Assume that FIBS rating system is used to rate players who mainly play
N>1 matches and one point matches are played only occasionally so that
players' ELO rating distribution is not affected and it's perfectly
following N>1 match ratings system. Note that due the error in Skill
function FIBS rating system follows more aggressively N>1 match rating
than one point match rating distribution, see equation (2). Also in a
case that players play notable amount of one point matches the rating
distribution will probably follow N>1 ratings system. In that case
expected class width for one point match rating system would be about
W'(1)
----- = <Err(N>1)>, (12)
W
where <Err(N>1)> is the average of Err function values N>1. Because
rating distribution is following N>1 rating system D/Dm'=Err and
equation (10) can be written
W 2
---- = Err (N), N>1 (13)
W'(N)
After simple algebra Skill function constant is in our hands
(W/W')N - 1
C = ------------ , N>1 (14)
1/2 (N - 1)
Only unknown on equation (14) is W'(N) which can be measured by
following the method explained on reference 8.
Table 2. Function Err(N) with expected class width values W'(1)
(eq 12) and W'(N>1) (eq 13).
C \ N 1 3 5 11 21 W'(1) W'(3) W'(5)
.8 1.0 .77 .72 .67 .65 1440 3373 3858
1.0 1.0 .82 .77 .73 .72 1540 2974 3373
1.2 1.0 .85 .82 .80 .79 1640 2768 2974
1.4 1.0 .89 .87 .85 .85 1740 2524 2642
1.6 1.0 .93 .92 .90 .90 1840 2312 2363
o Experiment 1 by Gary Wong (ref 8)
Computer player Abbot on FIBS has been set to record its one point
matches. Recorded data has been used to test FIBS rating system. The
best fit to the collected data and match winning rating formula has
been obtained using class width value W'(1)=1634. Assuming that
Abbot's opponents mainly play N>1 matches Skill function constant
would be C=1.2, see Table 2. Note that ELO rating of Abbot is around
1500. It is unlikely that Abbot is incorrectly rated and so error on
ELO difference D (eq 1) is all coming from the error of Abbot's
opponents' ELO ratings.
o Experiment 2 by Jim Williams (ref 10)
Also this experiment like many others has been done on FIBS. Match
results of 1-5 point matches have been collected and the data has been
used to test empirically the validity of the match winning probability
function eq(1). Here Skill function has been chosen so that the
formula eq(1) gave the best fit with the observed data i.e. S(N)=Neff,
where Neff is "effective match length", see ref 10. For our analysis
class width W' is more suitable fitting variable than Skill function.
Class width W' can be calculated from W'(N)=2000*sqrt(N/Neff). The
results of the experiment are summarized on Table 3. If ratings
are well mixed it is enough to measure W'(N) for one match length
in order to be able to determine constant C. Here W'(N) has been
measured for three match lengths and every measure leads to the
same value of C within the accuracy of the measurement. This fact is
an empirical proof that Skill function has the form presented on
equation (4) and that one rating system can be design for all match
lengths.
Table 3. Observations. On row ave is weighted average of specific
column. Average is weighted using the number of observed match
results.
N # of matches Neff W' C Dm'/D
(x10**3) (C=1.1)
1 20.0 1.6 1581 1.1 .79
3 12.0 1.6 2739 1.19 1.15
5 8.6 2.1 3086 1.13 1.23
ave 2242 1.13 .99
o Note 1
The weighted average of Dm'/D over all match lengths is about one. If
we use C<1.1 the average is smaller than one and if C>1.1 the average
is bigger than one.
o Note 2.
Ratio Dm'/D can be used to estimate players' true ratings. Assuming
that we have two players with ELO=1900 playing on system described on
experiment 2. One of these players plays only one point matches. His
true rating can be estimated as (1900 - 1500)*.79 + 1500 = 1816. The
other player plays only 5 point matches and his true rating would be
(1900 - 1500)*1.23 + 1500 = 1992 (C=1.1). So if you want to be top
rated player it's not sufficient that you are the best player but you
need to know also how the rating system works.
o Note 3.
I have estimated my FIBS ELO rating differences against JellyFish to
be as following: N=1 -> 210, N=3 -> 176 and N=5 -> 165 (ref 11).
Also these ratings can be corrected using measured ratio Dm'/D!
Rating of one point match is correct by definition -no ratings mixing
here. Using C=1.1 true rating difference for 3 point match would be
176*1.15 = 202 (209) and for 5 point match 165*1.23 = 203 (208). In
parenthesis are values obtained by using match equity table JF-mrn
(ref 11) and equation (1), where Skill function eq(4) with constant
C=1.1 is used.
3. Summary
----------
Three completely different approaches have been used to determine Skill
function constant C (eq 4). All approaches lead to constant C that fall
in interval 1.1-1.4, see Table 4, while in FIBS formula is used C=2.
Experimental data has been used in two different ways to fix constant C:
method 1) average number of games/match and method 2) data of match
results and ELO ratings. First method gives C=1.35 and latter
1.1<C<1.2. The difference between these two values can be explained by
the fact that method 1) does not take account obvious differences of one
point and N>1 point matches (cube and gammon factor) while in method 2)
these differences are covered. Bad side of the method 2) is that it
relays on data which has been obtained by faulty rating system. This
makes analysis more complicated but more over it can be even so that
from the available data constant C cannot be solved accurately. I think
that after the first correction there is still need for a small fine
tuning to reach "correct" C value. Anyway, I think, that what ever
value is picked up from range 1.1-1.4 the resulted rating system would
be superior compared to the one which is currently in use, compare
Tables 2 and 4.
Table 4. Expected accuracy of corrected rating system i.e.
function Err(N). C=1.2 has been chosen for correct Skill function
constant.
On the right has been shown C values obtained by different
methods: 1) Continuos zero volatility game (ch 2.1)
2) Average number of games (ch 2.2)
3) Data of match results (ch 2.3)
4) Game statistic JF-mrn (ref 11)
C \ N 1 3 5 11 21 C \ Method
1 2 3 4
1.0 1.00 1.05 1.06 1.08 1.09 1.0
1.1 1.00 1.02 1.03 1.04 1.04 1.1 x x x
1.2 1.00 1.00 1.00 1.00 1.00 1.2 x x
1.3 1.00 .98 .97 .97 .96 1.3 x
1.4 1.00 .96 .96 .94 .93 1.4 x
If FIBS rating system is corrected by choosing new Skill function with
C=1.2, it might be a good idea to change also class width value W on
rating system so that the change on players ELO distribution is
minimized. Class width W could be chosen for example so that current
"11 point rating system" remains intact i.e. W=2500. Experiment 2
suggests to use somewhat lower value W=<W'>=2250. Constant M I would
let as it is. Note that class width of ELO distribution should remain
equal to the one in old rating system if we change W correctly. New
rating system could be implemented side by the old one so we would have
direct comparison between rating systems.
References
----------
1) ELO ranking
http://www.netgammon.com/us/facts/elo2.htm
2) "Derivation of backgammon Skill function" by M.Rinta-Nikkola
http://www.deja.com/[ST_rn=ap]/getdoc.xp?AN=419254506&fmt=text
http://www.deja.com/[ST_rn=ap]/getdoc.xp?AN=419293370&fmt=text
3) FIBS--Rating Formula Different length matches by Tom Keith
http://www.bkgm.com/rgb/rgb.cgi?view+523
4) "Constructing a ratings system" by M.Rinta-Nikkola
http://www.bkgm.com/rgb/rgb.cgi?view+621
5) Cubeful distribution by Roland Sutter
http://www.deja.com/getdoc.xp?AN=491955947&fmt=text
6) "Doubling in money game: drop, take or beaver" by M.Rinta-Nikkola
http://www.deja.com/[ST_rn=qs]/getdoc.xp?AN=464753854&fmt=text
7) Big_Brother match archive
http://www.bkgm.com/rgb/rgb.cgi?menu+matcharchives
8) FIBS--Rating Formula: Emperical analysis by Carry Wong
http://www.bkgm.com/rgb/rgb.cgi?view+601
9) Rating distributions of bg-servers by Daniel Murphy
http://www.deja.com/getdoc.xp?AN=480105756&fmt=text
10) FIBS--Rating Formula: Different length matches by Jim Williams
http://www.bkgm.com/rgb/rgb.cgi?view+603
11) JF-mrn game statistics by M.Rinta-Nikkola
http://www.deja.com/getdoc.xp?AN=503022535&fmt=text
|