Backgammon Programming

Programming

Variance reduction of rollouts

From:   Michael J. Zehr
Address:   tada@mit.edu
Date:   24 August 1998
Subject:   Re: Rollouts and variance reduction
Forum:   rec.games.backgammon
Google:   6rrum2$l2d@senator-bedfellow.MIT.EDU

Midas wrote: > Hi there. Could one of you statistically minded readers > let me know if there are many different ways to apply variance > reduction to backgammon rollouts. I can trivially answer "yes" to this but in a way that might not be useful to you. Jellyfish uses one method (the name of the algorithm is in the manual) and Snowie uses a method that might or might not be the same. (I'll speculate that it isn't the same since Snowie generally reports a higher "equivalent to X games" value for the same size rollout, which suggests that it reduces variance faster.) In addition there's at least one variance reduction method that isn't used by any computer programs that I am aware of. This is the (hereby named, although reported on this newsgroup previously) Zehr Cube-reduction algorithm: when rolling out with a live cube and a takeable cube is given, keep the cube at the same value and play two games from the position where the cube was turned. (Do this on all subsequent cube turnings too.) The purpose of this is to allow a live cube on rollouts and yet keep a single game with a high cube value from affecting the results too much. > I.e. is there more than one formula and if so, is > it possible they may result in differing results > significant enough to matter when relating to > backgammon check play analysis. I'm sure one can use different algorithms to get different results from the same neural net and dice, however one would also get different results from different neural nets when using the same algorithm. The basic idea of any variance reduction method is to estimate the amount of luck each side has and factor that out. For example, if one side has two on the bar against a 5-point board and rolls the necessary doubles to come in that's a very lucky roll. A variance reduction algorithm would measure that luck and change the results of the game in some way to take account of it. In the example above, if the neural net's evaluation of rolling the double to enter is +.243 equity, and the evaluation of the position before rolling as -.324, that roll was worth +.567. If the player who rolled the lucky double goes on to win then instead of getting 1 point for the win they might get 1 point less some fraction of the equity from their lucky rolls. So you might deduct half of .567 (as well as half of other lucky rolls, but also adding in half of the unlucky rolls, and doing the same thing for the opponents). Some variations on this might include: o vary the fraction of the adjustment (half in the example above) o only make the adjustment if the roll was especially lucky or unlucky and make no adjustments for rolls whose value is within some range of the position's value before rolling o change the adjustment based on whether the roll was close to the start of the game or close to the end of the game Any such algorithm is going to have results vary depending on the evaluation used to decide what is a lucky roll. In the example above a different program rolling out the same position with the same dice might think that rolling the double was worth only .425 and hence make a smaller adjustment. An example of another kind of variance reduction is one that Snowie uses -- you can enable bearoff database truncation. When using this a rollout stops at the point when Snowie can look it up in its bearoff database. This database is extremely accurate and can reduce all of the variance from luck during the bearoff. For example, Snowie might determine that one side has a 63% chance of winning. At that point it can stop that game and use a result of .26 equity. > Also with regards to checker play comparison, > should we lean towards truncated rollouts with a set > horizon with no variance reduction or are the results > likely to better when applying > variance reduction a full rollout? First one should specify the context a bit more. A full rollout will almost always give a better evaluation if time were not a factor. What one is usually interested in is the fastest way to get a reliable answer. For some positions a truncated rollout will get a reliable answer fastest and for some it won't. If the position is one that the evaluator understand well, or the position is one that will soon simplify (for example a holding game) then a truncated rollout can get a very reliable answer very quickly. One can think of these as evaluations with a higher ply. (A 1296 game rollout with the initial rolls varied, truncated after 2 rolls and played n-ply is equal to an n+2-ply evaluation.) If the evaluator doesn't understand the position well or nothing is going to change much for a while (for example a prime vs. prime position) then it is probably best to do the full rollout. The best way to decide which one to do is to start with a full rollout, compare the results to the evaluator, and if they are close then trust a truncated rollout. Unfortunately that doesn't save any time. But with experience one can judge which kind of rollout to do. One final comment about computer rollouts -- in those positions in which a truncated rollout is not accurate we might not be able to trust the full rollout results anyway. If the evaluator doesn't understand the position very well, it might not even understand it well enough to do a rollout. It's easy to fall into the trap of doing a bazillion game rollout, getting a standard deviation of .001 on the results, and believing you've found the right answer. In reality you've found the best answer that particular tool can give you, which isn't necessarily the right answer. -Michael J. Zehr

Brian Sheppard writes:

One significant technique [of variance reduction] is called "striated sampling." The goal is to choose your samples according to a theoretical model of how samples ought to be distributed. For example, when you do a JF rollout with a multiple of 36 trials, JF allocates trials equally over the 36 possible dice rolls. This ensures that the variations exactly match the results of the first ply of rolls. And if the number of trials is a multiple of 1296, then JF will do an exhaustive enumeration out to 2 ply. Another technique is to concentrate effort on those rolls that have the highest variance. For example, suppose that the side-to-move is on the bar, and 25 numbers fan. Then all 25 of those numbers transpose to the same variation. Moreover, suppose that we observe the results of fanning to be consistent: the opponent always wins. Accordingly, this category has 0 variance. Since other rolls have higher variance, we should concentrate our efforts on other rolls. To compensate for the fact that we distribute trials non-uniformly, we scale the results for each roll. To continue our example, suppose we do 50 trials of the 25 fanning rolls (losing a single point each time), and 1000 trials for each of the other 11 rolls, losing 0.5 points each time. Our equity is computed as follows: average value in 50 trials is -1, times 25/36 = -25/36. average value in 11000 trials is -0.5, times 11/36 = -5.5/36. Total value: -30.5/36 -- Michael J Zehr wrote: > In addition there's at least one variance reduction method that isn't > used by any computer programs that I am aware of. This is the (hereby > named, although reported on this newsgroup previously) Zehr > Cube-reduction algorithm: when rolling out with a live cube and a > takeable cube is given, keep the cube at the same value and play two > games from the position where the cube was turned. (Do this on all > subsequent cube turnings too.) The purpose of this is to allow a live > cube on rollouts and yet keep a single game with a high cube value from > affecting the results too much. This is clever. Doubling the number of trials in variations in which a cube is accepted reduces the variance of our estimate in those variations by 50%. The impact on run time is reasonable. Even though we double the double/take trial count, we do not double the run-time, since the variation is only split *after* the double. Moves prior to the double are common to the two trials. The bottom line is that the method halves the variance of double/take trials with considerably less than a doubling of the run-time. A few implementation details for implementation-minded readers. First, there are pathological situations where the rollout would explode. (These situations are the same as the situations where the cube-using equity value is undefined. Check out the Deja News archive for related threads.) The number of "splits" of the variation tree must be limited to prevent explosions. Second, you have weigh the double/take variations on an equal footing with the double/drop variations. This means that the pair of double/take games must be counted as a single game, not two games. -- Michael J Zehr wrote: > The basic idea of any variance reduction method is to estimate the > amount of luck each side has and factor that out. For example, if one > side has two on the bar against a 5-point board and rolls the necessary > doubles to come in that's a very lucky roll. A variance reduction > algorithm would measure that luck and change the results of the game in > some way to take account of it. > > In the example above, if the neural net's evaluation of rolling the > double to enter is +.243 equity, and the evaluation of the position > before rolling as -.324, that roll was worth +.567. If the player who > rolled the lucky double goes on to win then instead of getting 1 point > for the win they might get 1 point less some fraction of the equity from > their lucky rolls. So you might deduct half of .567 (as well as half of > other lucky rolls, but also adding in half of the unlucky rolls, and > doing the same thing for the opponents). Whoa! This is not "variance reduction." This is "bias introduction." (Math phobic readers may wish to avert their eyes at this juncture.) The algorithm you just described computes the move-by-move change in the neural net evaluation over the course of a single trial. The final outcome of the trial is offset against the sum of the changes. This reduces variance, to be sure. But *how* does it reduce variance? Let's take a look. Let E0 be the evaluation of the root position of the rollout. After 1 turn the evaluation changes to E1, then to E2, and so on. In the final position the outcome is V. What is the sum of the changes from turn to turn? In the first turn the change is E0 - E1. In the second turn the change is E1 - E2. In the third turn the change is E2 - E3, and so on. The sum is (E0 - E1) + (E1 - E2) + (E2 - E3) + (E3 - E4) + ... + (En - V) This type of sum is called a "telescoping series," because it resembles a collapsible telescope. To see why, let's rearrange the parentheses a little: E0 (-E1 + E1) (-E2 + E2) (-E3 + E3) ... (-En + En) - V So the sum is E0 - V! This amount will be offset against V. If the fraction we offset against V is f, then the final answer is f * E0 + (1 - f) * V Well no wonder we reduce the variance! All the algorithm accomplishes is to compute a weighted average of the true statistical answer with our initial estimate. This is obviously the wrong answer; our goal is to eliminate our initial estimate from the equation. The dependency on E0 is a bias. True variance reduction does not introduce bias into the result. I believe I have reconstructed JF's algorithm based upon the cryptic description in JF's user manual. Before I describe it, I want to provide some motivation for it, so that you can understand how the algorithm reduces variance. Let E be the true cubeless equity function, and e be our static evaluation function (e.g. a neural net). Rollouts estimate E by playing out a number of trial games and averaging the results. A simple rollout does not exploit the fact that e is a very good approxmation to E. Variance reduction reduces the number of trials we need to achieve a specific level of accuracy by exploiting the fact that e is a very good approximation of E. How does it work? Instead of regarding the rollout as establishing a value for E, let's think of it as establishing a value for E - e. In words, the rollout will estimate the error in our static evaluation. Let's begin by doing a 1-ply search, then estimate E as the average of the 1- ply values of e. This would be completely correct if e were perfect, but since e isn't perfect, we need to estimate the error involved. How can we do that? Simple. Let's just pick a 1-ply position at random, and estimate the error involved in its evaluation. Note that I have just recreated the problem that I posed at the root (i.e., estimating the difference between e and E) but one ply deeper into the game tree. I can continue this process until I reach a terminal node. To estimate E, we start at the end of the variation we have selected at random and apply the formula E = Delta + Average(e) where the average is taken over all next-ply nodes and Delta is the recursively computed estimate of the inaccuracy of e for the randomly-selected successor. How does this procedure reduce variance? Let's take a look. The variance of our estimate of E equals the variance of our estimate of Delta plus the variance of our estimate of Average(e). But Average(e) is a constant, so its variance is zero. So the variance of our estimate of E is simply the variance of our estimate of Delta. Now, Delta is the estimate of E-e for a single successor position. And since e is very, very accurate, the variance of Delta is very, very low. The standard deviation of E-e is 0.035 or so. Compare this low error rate with the typical standard deviation of 0.5 for a single game rollout. Reducing the standard deviation by a factor of 0.5 / 0.035 = 14.28 would normally require 14.28 * 14.28 = 203.9 times as many trials. We need vastly fewer trials to get a specific level of accuracy. The factor of 203.9 is not pure profit, alas, because we have to do 1-ply searches at every node when we use variance reduction. By contrast, a normal rollout does only static evaluation. This typically costs a factor of 20 or so, which is a shame. But we still should be very happy with a net increase of a factor of 10 in overall throughput. As I stated before, I believe that this is the algorithm used in JF. However, this belief is based upon deciphering the user's manual, which refers only to certain theorems from the theory of Markov chains. There may be other formulations of this algorithm, or there may be other ways to apply the theory to reduce variance. Warm Regards, Brian Sheppard

Did you find the information in this article useful?

Do you have any comments you'd like to add?