Ratings

Forum Archive : Ratings

 
Ratings and rankings

From:   Chuck Bower
Address:   bower@bigbang.astro.indiana.edu
Date:   23 December 1997
Subject:   rankings and ratings...
Forum:   rec.games.backgammon
Google:   67pd8h$ph8$1@dismay.ucs.indiana.edu

(NOTE:  don't confuse this Subject with "rantings and ravings..."  That
was a post I made about a week ago.)

      From time to time there are questions pertaining to how to decide
who is better than whom.  There was a rash of them recently relating
to the BOTS and their performances on FIBS.  I'm writing this as an
overview/summary of the topic.  Some ideas (including speculations) are
my own and others have already been expressed (in this newsgroup).  I'll
try to differentiate, but pardon me if I plagiarize.  And, especially I
ask forgiveness for NOT giving credit to the authors of any ideas that
I rehash.

      One other disclaimer:  I will be discussing some "methods" and
their "keepers".  DO NOT FOR ONE SECOND conclude that these people
contend that their methods are anything more than just another piece of
data which can be input into the unanswerable question "who is better
than whom?".  They provide a valuable service (for no remuneration) and
if someone were to attempt to lecture, scold, or otherwise chastise any
of them, then such person has completely misunderstood both their efforts
and, in addition, this article!

      The BEST way (that I can think of) to determine which of TWO players
is better is to have them play a LONG session against each other.  Up
until recently (last 10 years or so) this was about the ONLY way to
answer the question "who is better...".  Normally this is done for money
(to "reward" the better player as well as to attempt to ensure that each
is playing at his/her best)!  Unfortunately the number of games/matches
required to determine the answer with statistical confidence is so large
that it just takes too much time to reach a reliable answer.  As the skill
difference between the players gets small, the number of trials required
becomes HUGE.  As an example, after last summer's JF challenge, Fredrik
pointed out (with statistics) that a difference of 58 points in 300 games
isn't nearly enough to draw a conclusion because of the large fluctuations
(from the dice).  (Try DejaNews for more specifics.)

      It could be that people who play against each other A LOT (one or
more sessions per week over years, for example) could actually collect
enough data to reach a statistically significant answer.  However, you can
always surmise:  "Maybe one of the players improved more relative to the
other over the time the data were taken.  Who is the better player NOW?"
Currently we have dedicated 24 hour players (like commercial Jellyfish),
and the question can be answered knowing that at least one of the players
isn't improving!  I haven't seen much on the newsgroup (that is,
substantiated with numbers) comparing human vs. JF.  I keep such tallies
for my own play (and have posted results in the past) but haven't been
getting in enough play-vs-JF time lately to collect sufficient statistics
on the current version.  I'm SURE that JFv2.0 level-7 was a better player
than I.  Anyone want to take the other side of that argument?  Gee, thanks.

      What about global measures?  I know of three which are currently
available, but none of them is perfect, either.  They are surveys,
performance points, and ratings based methods.  All have their pluses
and minuses.

     One example of a survey is Yamin Yamin's "Giant 32 of Backgammon".
This is a biannual survey of on the order of 100 persons (responses,
that is).  The results are published in the Flint Area Backgammon News.
Actually, this survey was just completed within the last couple of weeks
and I expect to see the results in either the next Flint newsletter or
the issue after that.

     The problem with surveys is that they are inherently subjective.
For example, some Europeans have complained that Yamin's survey is biased
toward North American players (and I agree with them).  I, for
one, am NOT complaining about this survey.  The nature of BG (as it is
played today) is regional in nature.  IMHO that is an irrefutable fact.
Most of Yamin's survey respondants are North Americans, and even if they
aren't socioligically biased (let's hope that's the case!) their
experience is in this hemisphere and players from other parts of
the world play within their own travel zones.  Not very many (in fact,
NO) events give a truly geographically unbiased sampling.  Actually,
the online Internet tournaments are probably the least biased from
that standpoint.

      Performance rankings are another measure.  Bill Davis has been
coordinating such a point system--The American Backgammon Tour.  How
good is this at determining the best players in "America"?  Well, it
is probably a decent (though still statistically insufficient) way of
deciding who is "best" among those who play in a LOT of ABT events!
Problem is, for whatever reason, a lot of strong Western Hemisphere
players don't participate frequently on this tour.  It's a fun way of
recognizing players who are doing well, but it's just another piece
in the puzzle.

      The third method is one which has really caught on recently
thanks to Internet backgammon.  That is ratings systems.  Copied from
chess ratings systems, this is an objective method of ranking players
who share a common playground.  Kent Goulding (and colleagues) had
been keeping a ratings system for large tournament results over the
past several years.  Unfortunately, instigated by lost-data problems,
I believe his effort has been inactive since the summer of 1996.
Still, to my mind, KG deserves much of the credit for the current
popularity of the online ratings systems.

      One obvious weakness of any ratings system is that it really
only applies "locally".  Only FIBS players get FIBS ratings/rankings.
Only GAMESGRID players get GAMESGRID ratings/rankings.  Etc.  At best
you only get a reliable ranking among the participants and conditions
of that rating system.  Maybe the highest ranked player is just "a small
fish in a small pond", so to speak.  And it's really worse than that,
because sometimes the players within a rating system don't intermingle
much.  For example, some FIBS players only play within a small cluster of
"friends", so even though s/he has a FIBS ratings, it's not as universal
as it appears.  The two ideas I've mentioned in this paragraph have
been discussed previously (multiple times) in this newsgroup.  In
addition, they are covered in greater detail, with some nice examples,
(or counterexamples...) in the Jacobs-Trice book "Can a Fish Taste
Twice as Good?".  This is recommended reading for anyone wanting to
delve more deeply into the subject.

      Online ratings systems can be tricked as well.  (This is no secret.)
By carrying the "don't intermingle" idea to an extreme, a person can
play against him/herself (using two or more different ID's) and
artificially inflate his/her rating.  This is almost always easily
detected.  A very high rating with very low experience is certainly
suspicous (though it's apparently theoretically possible to do this
honestly).  There are other low-integrity tactics which have been pointed
out in this newsgroup as well, like preferential dropping, and "fishing"
(searching out weak players whose ratings are higher than deserved, for
one reason or another).  I believe these problems are inherent.  There
will always be "clever" cheaters who find a way to work around attempts
to prevent such tactics.

     One other thing worth mentioning (and covered previously in the
newsgroup) is the observation that the common ratings systems may have
the weakness of overrating players who only compete in 1-point matches.
It seems like a difficult thing to prove, but there does appear to be
circumstantial evidence.  Maybe these special players should have their
own (segregated) ratings system.

     Now I am going to attempt to break new ground and start speculating.
(Oh, you thought that's what I'd already been doing!)  In particular I'm
going to focus on the robots' ratings on FIBS--another hot topic recently.
Do the robots' high ratings really give them the title "best players on
FIBS".  Maybe, but not necessarily.  I am going to list (in no particular
order) some reasons why their high ratings could be brought under
suspicion.  Note that in case you are new to the newsgroup, I have no
hidden malice towards them.  I have a very high respect for these players.

1) Selection effects.  (Basically I'm repeating the above problems with
intermingling.)  Do the bots take on all comers?  Should they?  Do the
best humans take on all comers?  Should they?  Are weak humans more
likely to challenge a highly ranked bot than I highly ranked human?  My
guess is "yes".  Computers are incapable of sneering when they turn you
down.  Even if you argue that human experts don't do this (and my
experience is that they are among the best mannered experts of any kind
in the world!), that doesn't keep the inexperienced player from suspecting
such a thing could happen.

2) Exhaustion.  Computers don't get tired.  Humans do.

3) Emotion.  Computers don't feel emotion.  They don't notice bad dice.
I suspect even the best human experts, as hard as they try, still feel
the pain of unlucky dice.  I'm sure it doesn't have the same magnitude
of adverse effects as it does for the typical player, but it has to have
a small impact in any case.  How about elation?  Is that an advantage
for a human player to have?  What about embarassment?  Do strong human
players make errors based on ego?  (I can't lose to THIS person!)  Again,
they know it's a detriment to good play, and thus work on eliminating
such concentration killers, but it still must affect things sometimes.

4) Distractions.  Does a computer's spouse interupt and call it to dinner?
Does it have to break it's concentration when the modem rings?  Is it
watching a ballgame at the same time it's playing?  Does it play better
or worse after having a couple alcoholic drinks?

5) "Giant killing".  (I am of the belief that this is potentially a HUGE
advantage for the bots.)  Let me start with a (true) story.  I had heard
through r.g.bg and also from conversations with other players that there
was a "new kid on the block"--SnowWhite.  (This was a while back.)  I
was on FIBS and decided to watch this maiden take on one of the seven
dwarves.  I watched for all of about two dice rolls.  Why?  I was
annoyed (make that disgusted).  SnowWhite's opponent was in some kind
of SUPER BACKGAME.  Three or four points in SwowWhite's board, only one
or two checker's in his/her home board--you get the picture.  Gee.  This
looked like the typical backgammon game that I play...

     So, why do I believe that such tactics are a big advantage to the
bots?  Simple.  We're not (necessarily) talking about a highly rated
FIBS player trying to outsmart a bot by playing a backgame.  We're talking
about Joe-typical-player.  Even if it's true that an expert backgammon
player can "make money" using backgame tactics, it is usually done by
getting the bot to indiscriminately elevate the cube in a few games.
The bot wins most of the games (many of which are gammons) with the cube
at a low level.  The human expert wins a few games WITH THE CUBE AT SOME
ASTRONOMICAL VALUES.  This isn't likely to work at match play due to the
finite match length.

     Secondly, backgames are quite tricky.  Your Joe-typical-player
is going to be giving away equity by playing sub-optimally, so even if
some experts can outplay a bot by seeking backgames, my guess is that
most FIBS players are going to screw up bad enough to end up becoming
cannon fodder for the bots.  Now.  Suppose this same Joe-typical-player
is in a match with a human expert.  Do you think he is going to steer
into a backgame?  I can tell that there is at least one Chuck-typical-
player who won't!

     I realize that there are likely to be some biases that work
against the bots.  For example, I wouldn't be surprised if the bots
have a higher percentage of their matches dropped.  "Hey, bots have
no feelings, so why should I feel guilty pulling the plug when I'm
losing a match to one of them?"  And even if the biases favor the
bots, that certainly doesn't mean they aren't better anyway.  My main
point is to read the ratings systems with a skeptical eye, whether
comparing bot vs. bot, bot vs. human, or human vs. human.

Chuck
bower@bigbang.astro.indiana.edu
c_ray on FIBS
 
Did you find the information in this article useful?          

Do you have any comments you'd like to add?     

 

Ratings

Constructing a ratings system  (Matti Rinta-Nikkola, Dec 1998) 
Converting to points-per-game  (David Montgomery, Aug 1998)  [Recommended reading]
Cube error rates  (Joe Russell+, July 2009)  [Long message]
Different length matches  (Jim Williams+, Oct 1998) 
Different length matches  (Tom Keith, May 1998)  [Recommended reading]
ELO system  (seeker, Nov 1995) 
Effect of droppers on ratings  (Gary Wong+, Feb 1998) 
Emperical analysis  (Gary Wong, Oct 1998) 
Error rates  (David Levy, July 2009) 
Experience required for accurate rating  (Jon Brown+, Nov 2002) 
FIBS rating distribution  (Gary Wong, Nov 2000) 
FIBS rating formula  (Patti Beadles, Dec 2003) 
FIBS vs. GamesGrid ratings  (Raccoon+, Mar 2006)  [GammOnLine forum]
Fastest way to improve your rating  (Backgammon Man+, May 2004) 
Field size and ratings spread  (Daniel Murphy+, June 2000)  [Long message]
Improving the rating system  (Matti Rinta-Nikkola, Nov 2000)  [Long message]
KG rating list  (Daniel Murphy, Feb 2006)  [GammOnLine forum]
KG rating list  (Tapio Palmroth, Oct 2002) 
MSN Zone ratings flaw  (Hank Youngerman, May 2004) 
No limit to ratings  (David desJardins+, Dec 1998) 
On different sites  (Bob Newell+, Apr 2004) 
Opponent's strength  (William Hill+, Apr 1998) 
Possible adjustments  (Christopher Yep+, Oct 1998) 
Rating versus error rate  (Douglas Zare, July 2006)  [GammOnLine forum]
Ratings and rankings  (Chuck Bower, Dec 1997)  [Long message]
Ratings and rankings  (Jim Wallace, Nov 1997) 
Ratings on Gamesgrid  (Gregg Cattanach, Dec 2001) 
Ratings variation  (Kevin Bastian+, Feb 1999) 
Ratings variation  (FLMaster39+, Aug 1997) 
Ratings variation  (Ed Rybak+, Sept 1994) 
Strange behavior with large rating difference  (Ron Karr, May 1996) 
Table of ratings changes  (Patti Beadles, Aug 1994) 
Table of win rates  (William C. Bitting, Aug 1995) 
Unbounded rating theorem  (David desJardins+, Dec 1998) 
What are rating points?  (Lou Poppler, Apr 1995) 
Why high ratings for one-point matches?  (David Montgomery, Sept 1995) 

[GammOnLine forum]  From GammOnLine       [Long message]  Long message       [Recommended reading]  Recommended reading       [Recent addition]  Recent addition
 

  Book Suggestions
Books
Cheating
Chouettes
Computer Dice
Cube Handling
Cube Handling in Races
Equipment
Etiquette
Extreme Gammon
Fun and frustration
GNU Backgammon
History
Jellyfish
Learning
Luck versus Skill
Magazines & E-zines
Match Archives
Match Equities
Match Play
Match Play at 2-away/2-away
Miscellaneous
Opening Rolls
Pip Counting
Play Sites
Probability and Statistics
Programming
Propositions
Puzzles
Ratings
Rollouts
Rules
Rulings
Snowie
Software
Source Code
Strategy--Backgames
Strategy--Bearing Off
Strategy--Checker play
Terminology
Theory
Tournaments
Uncategorized
Variations

 

Return to:  Backgammon Galore : Forum Archive Main Page