This article originally appeared in the January 2001 issue of GammOnLine.|
Thank you to Kit Woolsey for his kind permission to reproduce it here.
I. IntroductionOne positive result of the recent US presidential election was the thought provocation it initiated. Many questions (and hopefully answers) emerged: "What is the electoral college?" "Should the president be decided by popular vote?" "Do the mechanics of the election process need reform?" Even backgammon may have profited. By studying the ramifications of voting idiosyncrasies, one could potentially improve understanding of the statistical principles which permeate backgammon results.
The official, certified result in Florida was a difference of 537 votes out of 5.825 million. The binomial distribution says that the standard deviation is sqrt(Bush × Gore/total) = 1200 votes. But what does this mean? Statistically, the standard deviation is related to confidence. How confident are we that a repeat of the Florida election would give the same result? The actual result (537 vote win for Bush) divided by the standard deviation (1200) converts to a relative standard deviation of 0.45. We say that the Florida election was signifcant to 0.45 standard deviations (sometimes quoted as 0.45 sigma since the Greek lower-case letter sigma is the conventional algebraic symbol for standard deviation). But we're still not being straightforward. What confidence does 0.45 standard deviations correspond to?
There is a mathematical relationship between standard deviation and statistical confidence. You may have heard it referred to as The Standard Normal Distribution or The Gaussian Distribution or, more popularly the bell-shaped curve. (This is origin of the term "grading on the curve" used in classroom test results.) Virtually every elementary statistics text has a table of values for this distribution. I look up the conversion and find that 0.45 standard deviations corresponds to 0.67, meaning that there is a 67% chance (about 2::1 odds) that a repeat of the election in Florida would result in a Bush victory and a 33% chance that Gore would win.
II. Experimental ElectionsDoes this make sense? If you were to repeat the election exactly, wouldn't the results be identicalBush winning by a net of 537 votes? Yes, but you typically can't repeat a situation exactly. Let's do an experiment and use the Florida electorate as our subjects! OK, we can't really do such an experiment, but we can perform a thought experimentwe pretend that we are executing an experiment and make predictions of the results.
The election occurred on Tuesday, 7 November. On Wednesday, as Florida is awakening, we first erase their minds of any knowledge of the previous day (if they voted, whom they voted for, who won the other states, that the election was close). We tell the Floridians "yesterday's election results were invalid. But you get another chance. Today, 8 November, is election day in Florida." At the end of the day, (most of) the votes are tallied. Will anything change? Can we predict the results? With what confidence?
A lot of things have changed, and those changes affect the results. The weather is different. Some don't go out to vote in bad weather. Others who voted on a rainy Tuesday might prefer to play golf Wednesday. People feel differently. Some sick people got well. Some who were healthy on Tuesday are now bedridden. Some voters have died. (In Florida, this may even be a significant number.) People have work schedulesthose who had the day off on Tuesday may have to work Wednesday, or vice versa. All of these little factors can not only influence which (and how many) people go to the polls, but even whom they vote for. An important point, though, is that these "random" factors don't explicitly favor one candidate or another. One thing is for sure: Bush is extremely unlikely to win by exactly 537 votes. In fact, as we saw above, there is only a 2/3 chance that he actually will win. In Las Vegas you could probably place a bet, something like lay 3 to win 1 on Bush or put up 5 to win 6 for Gore. (Bookmakers never give true oddsthey build in a commission for themselves.)
Now let's perform a second thought experiment. Again, on Wednesday the Florida voters will recast, but this time we won't erase their memories of the previous day. They've seen the election results from the other states, and they know that yesterday's Florida result was very close. Some will have read about the concession/retraction phone calls and be peeved at one candidate or the other. Knowing that Florida is the swing state, some who didn't bother voting the previous day ("my vote won't count") now have a fresh enthusiasm. Still others who voted nonchalantly will now see a new importance of their decisions and run out and buy Time or Newsweek and actually cast votes based upon issues! Almost certainly there will be a huge turnout with millions more votes cast. What can we predict about this result? Could you still bet Bush 1::3 or get Gore 6::5 in LV?
In our first experiment, the factors which changed the outcome were random. A change in the results should follow statistical theory. Besides these random results, though, our second experiment is also affected by systematic factors. Influences of TV, magazines, etc. are not likely to be random. Overall these new conditions are going to significantly favor one candidate or the other. I don't know which one, but hopefully some political scientists would know the answer, and the Las Vegas bookmakers are going to be aware of this and either refuse to give odds at all, or bias them even more in their favor (taking larger commissions). In our first experiment, anyone with a little statistical knowledge could make a good prognostication. In the latter case, it takes a much deeper understanding of demographicsconsiderably more homework. Probably very few would have sufficient knowledge to quickly make an accurate prediction of this result.
III. Backgammon rolloutsWhat does any of this have to do with backgammon? There are quite a few analogies which can be made. A single game of backgammon is like one Florida vote. Saying that a player's decision on a particular play clearly led to a win or loss (decision was correct or incorrect) is about as reliable as asking a single Florida voter who s/he chose and then projecting the entire state's outcome based solely upon that exit poll. (In retrospect, the TV networks would have done as well with this naive technique.)
Much can be learned from repeated trials. In the old days, serious players would roll out a position by hand many times and record the results. The idea was that, over time, the luck (randomness) of the dice would cancel out and a true result would emerge. Either the best checker play or proper cube decision would be found by such a Monte Carlo technique. For example, in a straight race only two cubeless outcomes can occur: blue wins or blue loses. It is well known that cube ownership allows a player to take in a long race with as few as 23% cubeless game winning chances. Suppose a 100 trial hand rollout results in a 70-30 split. How confident should we be that this position is a take? As with the certified Florida result, the standard deviation is sqrt(70 × 30/100) = 4.6 games. The significance is (77 − 70)/4.6 = 1.5 standard deviations. My table says this corresponds to a little over 0.93, so we conclude that there is a 93% chance that this position is a take, and a 7% chance that it is a pass.
The problem with hand rollouts is that they take a long time, and if a decision is close (which is often the case, since obvious decsions wouldn't be rolled out at all), 100 trials isn't enough to confidently determine the proper action. If the above result had been 75-25, such a result would only be about 67% confidence for a take, the same as Bush's 537 vote Florida win, not very convincing. (The situation is worse when multiplication factors for gammons, backgammons, and cubes enter.) You could conduct more rollouts, but how much time does one person have? Note that the significance depends on the square root of the number of trials, so to improve the significance by a factor of 2, you need to increase the number of trials by a factor of 2 squared, or 4. To increase significance by a factor 10, you need to run 10*10 = 100 times as many trials. No human I know has time for that.
Sounds like we need a robot. Even Expert Backgammon for the PC (the forerunner of today's neural net players) could grind through a long rollout in short order. So, just hand the position to a robot, go about your business, and come back later to find the correct result. Run enough trials to get statistical significance and you've got your answer. Simple! Well ..., not quite. Recall in our second Florida experiment that being able to do statistics wasn't enough to predict the outcome. As with elections, there is systematic uncertainty associated with rollouts (human or robot). Those systematics come about from mistakes in play. For example, if your robot is rolling out the simple position of white having a single checker on his acepoint versus blue (on roll) with a checker on its 4-point and another on its acepoint, your result won't be so reliable if the robot plays a 51 roll 4/3, 3/off!
Analogously with our second Florida experiment, it is almost always difficult (and maybe impossible) to get a solid handle on the impact of systematic uncertainties. You don't know when they occur, and even if you did, it's hard to quantify their effect. So, what do we do, just not bother with rollouts? There are better options. Experience with the robots (especially Jellyfish and Snowie) have shown us that they are very strong players. The positions they have trouble with seem to be rare, and appear mostly to fall into a narrow class of late-game priming situations. For an arbitrary position, the chances of the game developing into a robot's achilles heal is small. Still, there is a limit as to how much confidence you want to squeeze out of a rollout.
IV. Sample Positions
Enough with theory and generalizations, let's look at some concrete examples. Position 1 was constructed as a simple examplea non-contact race checker play problem. Blue has two reasonable plays: 9/6 and 8/5. For illustrative purposes we'll throw in the clearly inferior 9/7, 8/7. These non-contact races are among the very few backgammon positions where exact equities are known. I've used Backgammon Position Aanalyzer (BPA) which said that 8/5 wins 35.22% cubeless, 9/6 wins 34.48%, and 7(2) only 30.42%.
I performed several experiments with both Jellyfish and Snowie. One important point here is that although these software have internal bearoff databases, I made sure they were turned off during the rollouts. Ten rollouts of 36 trials each and one long rollout with 552,380 trials were performed using Snowie 3.2 1-ply neural net, untruncated. (Untruncated means all games were rolled to completion.) For the long rollout, the cubeless game winning chances for blue were 35.3%, 34.6%, and 30.4%in good agreement with the exact values indicated by the BPA software package. This tells us that Snowie 1-ply is playing the checkers pretty well (and thus the systematic uncertainty for this position is close to zero). Now we can be confident that the results of the 36 trial rollouts are dominated by random uncertainties.
For the ten experiments (rollouts) of 36 trials each, 8/5 came in alone in first only twice, and three times tied with 9/6 for the top spot. 9/6 was the winner in five of the rollouts. 9/7, 8/7 came in last for all ten rollouts. For the composite (360 trials), 9/6 actually came in first with 36.4% winning chances, 8/5 (the true best play) had 35.6% winning chances, and 7(2) was appropriately last with 30.9% winning chances. So 360 trials gave us the wrong answer. Why? Because a 360 trial rollout is not long enough to separate two plays which differ by only 0.7% game winning chances. But how would be know that if we hadn't had BPA's exact probabilities to compare to?
The answer is, we wouldn't have known, but we would have had a good indication that the results weren't trustworthy. The standard deviations would have warned us, if we had only bothered to look at them. For results expressed in percentages, the formula for standard deviation which we introduced in our election analogy is replaced by: std. dev. = sqrt[p × (1 − p)/n] where p is the game winning chances for a play and n is the total number of trials in the rollout. This works for each individual rollout as well as for the combined results. (Unfortunately it doesn't work for general equitiesthat is, for results which are a combination of simple wins, gammons, and backgammons. But more on that later.) For p = 0.356 (the composite rollout result for the 8/5 play) and n = 360, the standard deviation comes out to be 0.025. This is also the standard deviation of the composite rollout result for the 9/6 play. For the inferior 9/7, 8/7, the s.d. is 0.023, just a bit smaller. Now we can use this to determine the confidence of the rollouts.
When comparing the results of two plays, each of which has an associated random uncertainty (standard deviation), a joint standard deviation needs to be calculated. If s1 and s2 are the respective uncertainties, the the joint standard deviation is given by sj = sqrt(s1 × s1 + s2 × s2). If s1 and s2 are close (which is almost always the case in backgammon rollouts where the same number of trials were used for each play), this simplifies to s × sqrt(2) = 1.4 × s. Since exact numbers aren't usually necessary, this can even be approximated by sj = 1.5 × s. For our position 1 and its composite 360 trial rollout, s = 0.025 so sj = 0.035. Now to find the significance that 9/6 is better than 8/5, take the difference in equities for the two plays from the rollout result and divide by the joint standard deviation: significance = (equity1 − equity2)/sj. Doing that we find a significance that 9/6 is better than 8/5 of (0.364 − 0.356)/0.035 = 0.23 sigma. Now we just have to convert significance to confidence. Again, our trusty table says a significance of 0.23 corresponds to 59% confidence that 9/6 is better than 8/5 for position 1 (and, equivalently, there is a 41% chance that 8/5 is better than 9/6). The odds are only about 3::2 that 9/6 is the better play, based solely on the 360 trial composite rollout. That's not very convincing, and thank goodness because we know (from BPA) that 9/6 isn't better at all. The rollout result warned us not to trust it!
We can further simplify things by memorizing a couple key numbers which relate significance to confidence. You can never be 100% confident using repeated trials, but you can get arbitrarily close. All you have to do is perform enough rollouts. But how close to 100% do you want to get? Well, that depnds on the importance of the result. If someone's life depends upon it, you probably want to be 99.9999% sure, but if it's just a backgammon position, 95% confidence is usually taken as sufficient. A significance of 1 sigma corresponds to a confidence of 84%. A significance of 2 sigma is 97.7%. 3 sigma corresponds to 99.87%. So a ballpark number to remember is that 2 sigma is better than 95%, and you're pretty sure the result is correct. If you want to be pretty damn sure, go for 3 sigma. Jellyfish reports the standard deviation whereas Snowie displays a 95% confidence interval. In either case I'm hoping you'll now be able to determine confidence and/or significance from a rollout result. Just remember that 2 standard deviations (or 2 sigma) is approximately 95% confidence, and vice-versa. Also, multiply significance level (or confidence interval) by 1.4 to get joint significance (or joint confidence) in order to compare candidate checker plays.
Finally let's put all of this together and look at a typical contact position. On 15 December 2000 Marty Storer (in a GoL post titled "Gammon-go problem") asked what was the best response play to an opening 63 (played 24/18, 13/10) with a reply roll of 32 for a matchscore of 2-away, 1-away, Crawford game. (The 63 opening was played by the match leader.) Marty's position is reproduced here as position 2. Can we answer Marty's question?
In a followup post that same day, Gregg Cattanach revealed rollout results from Snowie which, among other things, indicated that 24/21, 6/4 was the best response at this match score with an equity of 0.069, followed by 24/21, 13/11 (0.064) as second best. Unfortunately the way he came up with these numbers didn't include a confidence level or significance, but he did give us enough information to calculate one: 150,000 "equivalent" trials at Snowie 3-ply. The standard deviation is around 0.0035 in these units. Converting to joint standard deviations (0.005) indicates that 24/21, 6/4 is better at about one standard deviation statistical significance, or about 84% confidence. From a statistical point alone, this should cause us to be a bit wary. Can we also comment on the systematics?
Snowie 3-ply rollouts use variance reduction (see David Montgomery's article in the February 2000 issue of GoL). If implemented properly (and we have no reason to believe otherwise), variance reduction does not introduce bias, and therefore the statistical uncertainty should be correct. However, I see three sources of systematic uncertainty which can contribute. A) truncated rollouts, b) scoring goals of the robot, and c) missing candidate plays. Let's look at each of these in more detail.
I believe Gregg ran rollouts which truncated after 7 rolls. The way such a rollout is performed is that after seven (random) dice rolls, the robot's evaluation function is applied. Truncated rollouts result in substantial reduction in statistical uncertainty since only a small, fixed number of (random) dice rolls go into the result. However, the subsequent application of the evaluation function increases the systematic uncertainty, so overall a truncated rollout is less reliable than a complete (untrunated) rollout, even though the statistical uncertainty doesn't reflect this fact. (Truncated rollouts have a large speed advantage, which is why most people, including the author, resort to them.)
Gregg's rollouts were performed primarily for the purpose of determining the best replies at cubeless money play. Thus both sides were trying to win gammons at the standard 2::1 ratio of gammons to simple wins. Converting these cubeless results to match scores where gammons have considerably different values (as were the conditions for Marty's position) will introduce a bias (or systematic uncertainty) into those results. Snowie does have a "match play according to score" option which allows the rollout to procede with checker plays reflecting a different gammon weighting. Use of this setting will reduce the sysematic uncertainty.
Lastly, since Gregg set up his candidate plays to accomodate cubeless money conditions, he failed to include one or two moves which he realized would be poor at that form of scoring, but which in fact can be quite reasonable at late matchscores. An example is 6/1* which garners more gammon wins at the expense of total wins.
I had Snowie roll out position 2 with the 2-away, 1-away, Crawford matchscore and "checker play according to score" activated. The rollouts were performed at 3-ply with a tiny search space and 20% "speed" setting. The rollouts were truncated after 10 rolls. (There is a large time penalty for running full rollouts and when I initially set things up I didn't expect to have sufficient CPU time to devote. In retrospect I should have reduced the number of total trials to compensate for the time factor and instead run full rollouts.) Based on Snowie's evaluations, I confined the rollouts to five candidate plays. The final results were a combination of six independent rollouts, and I choose to report the results in the units of "match winning chances (MWC)" instead of "equivalent money game" equity as Gregg did.
The final tally was for more than 1 million equivalent trials for each candidate. The highest equity came from the more standard 24/21, 13/11 move which won 26.59% of matches with a standard deviation of 0.04% MWC. 2nd best was 6/1* at 26.45% MWC, trailing by 2.37 joint standard deviations. 24/21, 6/4 (the play Gregg's rollouts preferred) won 26.38% (3.60 j.s.d. worse than 'best') followed by 13/10, 6/4 (26.36% MWC and 3.66 j.s.d behind) with 13/11, 8/5 taking up the tail (26.30% MWC and 4.83 j.s.d. behind the first choice). Sounds like a pretty convincing victory. Statistically speaking that may be so. But if you look closely at the numbers, you see that if 10,000 matches were played for each candidate play, "best" wins only 29 more matches than "worst"! This is 1 part in 345, and it's hard to believe that the systematic uncertainties of these rollouts are this small. Finally, any deviation of play by either player from Snowie's actual plays could also change the results. The conclusion is that if Snowie (3-ply, tiny, 20%) were playing itself it would get a slight advantage by playing 24/21, 13/11 instead of the other candidates. Otherwise the five plays are effectively identical. At least Bush only had one serious competitor!