Rollouts |
Today’s column is intended to help illustrate what information we can get from rollouts and what we can’t, and to help guide the process of choosing which rollouts to perform. It starts out simply, but I’ll introduce some complicated ideas in the examples. What is a Rollout?The most basic type of rollout of a position is for two people to sit down at a backgammon board and play many games from a particular starting position, recording the results. For example, we might get the results of +1, −2, −4, +2, etc. It’s not much different from a prop.There are two fundamental problems with this basic rollout. First, it takes too long. To get an accurate estimate of the value of a position, we might have to play hundreds or thousands of games, depending on how accurate we want the result to be. Second, if we misplay the positions that arise, the result might not be meaningful. On the other hand, we might learn a lot while playing the position out. With strong backgammon programs (bots) and tireless and blindingly fast computers, the most common method of performing rollouts is to let a program play against itself. We still encounter the same fundamental problems: The program might take too long, and might misplay the resulting positions. Bots allow us to try many tactics to reduce the first problem, and their play is consistent and usually good enough to ameliorate the second problem. Rollout OptionsTo speed up the rollout, there are many possible options. I’ll list advantages here, and disadvantages in another section below.Truncation: Instead of playing the game to conclusion, the game may be interrupted, perhaps after a set number of moves, and settled according to the bot’s evaluation. This is faster because you don’t have to play the entire game. Further, there is less luck in the first part of the game than in the entire game, so you don’t need as many trials for the average luck to be small. Variance reduction: While there are many possibilities that may be called variance reduction, the most effective is to subtract an unbiased estimate of the luck from the result. Although the bot evaluations are used to make the estimate, this should not affect the average result. If a bot understands the positions that arise in a rollout, the variance reduction may reduce the number of necessary trials by a factor of 10 or more to obtain a given level of accuracy. If the bot does not understand the position well, the variance reduction still reduces the number of trials, but perhaps by a factor of only 3. I recommend using variance reduction all of the time. See my article "Hedging Toward Skill" for additional information. Stratification: The luck on the first roll can be cancelled out completely by rotating the first roll through the possibilities. That is, ensuring that there is one 6-6, one 5-5, . . . , two 2-1 rolls in each set of 36 rolls. The number of trials must be a multiple of 36 for this to work perfectly, of course. Similarly, one can eliminate all of the luck in the first 2 rolls by rotating through all of the possibilities in every 1296 trials. Care can be taken to make sure that the intermediate results are still meaningful if you interrupt the rollout. The players: To save time, the smartest level of the bot is usually not used. Lower levels might not consider all possible rolls over the next couple of plies, might not consider as many candidate moves, or might not look ahead as far. The rules of the game: It can take time to evaluate the cube decisions and the effects of the match score on checker play. It is faster to perform a rollout without the doubling cube, collecting data on the initial volatility and the backgammons, gammons, and single wins for each side. Then these can be combined to produce an estimate of what should happen with the cube in play and at a particular match score. So, to specify which rollout I performed, I might say that I used Snowie 3, 2-ply huge, truncated at the bearoff database, with a live cube (3-ply) and checker play not according to score. That’s quite a mouthful, and people understandably don’t always say all of the parameters. In addition to these, I should specify the number of trials, and perhaps the seed that would allow others to repeat the rollout if they want. Snowie 3 was the bot, and most of the rest of the information describes parameters specific to Snowie. 2-ply means that instead of static evaluations, Snowie looked ahead one step. Huge refers to the number of candidate plays considered at each turn. 3-ply cube means Snowie paid attention to the cube actions, which it evaluated on 3-ply. Snowie played each trial until reaching a near-bearoff position, then truncated the rollout and substituted the winning chances in the database, or until the trial was ended by a pass. The moves made in the trial were not necessarily the ones that Snowie would choose over the board, but were made to maximize the cubeless equity. Variance reduction would be used, as would stratification. Does that make sense? Ok, let’s try this one: I believe a Snowie 3, 3-ply evaluation is a 1-ply cubeful rollout with 1-ply cube, truncated at both a depth of 2 and at the bearoff database, with checker play not according to score. I think GnuBG 2-ply evaluations (analogous to what Snowie calls 3-ply) are 0-ply cubeful rollouts with 0-ply cube (analogous to what Snowie calls 1-ply), truncated at a depth of 2 (I’m not sure about the database), with checker play according to score. There is a slight difference between the two, that checker play is according to score for Gnu, but not for Snowie. Snowie will report the same distribution of backgammons, gammons, and single wins regardless of the match score, but Gnu’s distribution varies as the checker plays made during the rollout depend on the match score. However, the main idea is that you already use rollouts even if you think you only use evaluations. Errors in RolloutsRollouts sometimes don’t give the right answer. Some rollouts tell you little about a position, and more about the bot performing the rollouts. To decide whether to trust a rollout, or to decide which rollout to perform, we need to consider the sources of errors in rollouts. Let’s consider what I call the fundamental equation of backgammon:
For truncations, we observe the final evaluations, which are not always equal to the final outcome. If we define the evaluation bias to be evaluation minus equity, we get Final Evaluation = Initial Equity + Net Luck + Net Skill + Evaluation Bias. We’re trying to determine the value of the initial equity. We observe the results, either the final outcome or the final evaluations. We hope that Final Equity = Initial Equity. We hope that the Net Luck, Net Skill, and Evaluation Bias are all 0. The net luck of the rollout is one source of errors, the statistical errors. Statistical errors can be reduced by performing more trials (and using variance reduction and stratification to compensate for the luck). The other errors can all be called systematic errors. The only way to eliminate systematic errors is to use a different type of rollout. Systematic errors come from many sources. They occur when the bot misplays one side of the position more than it misplays the other. They occur when the evaluations are not accurate at the end of a truncated rollout. They occur in cubeful rollouts when the cube actions are incorrect. They occur after a cubeless rollout when one converts the backgammon/gammon/win distribution to a cubeful equity. Example 1: Statistical ErrorsLet’s consider a concrete example. The following position results from:
I first rolled it out 5 times using Snowie 3, 2-ply huge, cubeless, truncated at the bearoff database with 36 trials per rollout. The first rolls should be stratified. I used seeds 1, 2, 3, 4, and 5. Rollout 1:
Close double, easy take. Rollout 2:
Huge double, small take. Rollout 3:
Not quite a double. Rollout 4:
Clear double, clear take. Rollout 5:
Small double, easy take. For another trial, I used the above settings but seed 6 and 1440 trials. Rollout 6:
Double, clear take. Finally, I used seed 7, but a 3-ply rollout (huge, 33%), and 1440 trials. Rollout 7:
Double, clear take. For those not familiar with the notation, the bg/g/w line gives the backgammon wins, then the gammon+backgammon wins, then the wins of all types, then the losses of all types, etc. The descriptions of the difficulty of the doubling decision are my interpretations of the numerical results. What do these rollout results mean? There are widely differing results among the first 5 rollouts. The differences between them are due to statistical noise from the small number of rollouts. The net luck is not close to 0. The Central Limit Theorem was discussed in my last column, "What’s Normal?" It applies to rollouts, and says that the cubeless equity is approximately normally distributed. Snowie estimates the standard deviation, and reports a confidence interval of radius 1.96 standard deviations. Roughly 95% of the time, the rollout limit should be within the confidence interval, where by rollout limit we mean what would happen if you have a rollout that is infinitely long, with net luck 0. The rollout limit is what you get if Snowie (2-ply huge cubeless) plays Snowie under these terms. If you are not Snowie, or your opponent is not Snowie, another value might be right for you. That Snowie can demonstrate that a position is a take doesn’t mean that you can. On the other hand, if you can play a position better than Snowie, then when a Snowie rollout says to pass, you might be confident of the take. Different rollouts using different settings or different bots can produce widely varying answers. The 6th and 7th rollouts are long enough that the confidence intervals are small. That means that the net luck is probably quite small, so both results are close to the rollout limits. They are not necessarily equal, though. The reason is that 2-ply plays differently from 3-ply, hence makes different mistakes, so the net skill may differ between the rollouts. In the third example, below, there will be rollouts whose confidence intervals do not overlap, suggesting that the rollout limits are different. Back to backgammon. The reason one might perform the rollout is to decide the correct cube action. First, we should consider whether it is a take. The equities of all rollouts indicated that it is correct to take, but let’s look at the range of values in the first 5: 0.723, 0.955, 0.625, 0.887, 0.772. There was wide disagreement, by up to 0.330! All of these were takes, and certainly put together these say it is a take, but some rollouts said it was a very easy take and one said it was a close decision. The rollout limit should be close to the value of double/take in rollout 6, 0.815, since it was much longer and had the same parameters. What about the doubling decision? Now we need to consider the value of not doubling immediately, to compare this with the value of doubling now. There was wide disagreement in the first 5 rollouts about the value of not doubling, although it didn’t vary quite as much as for doubling: 0.700, 0.829, 0.637, 0.794, 0.730. The highest value for not doubling was 0.829, while the lowest value for doubling was 0.625. After performing the first 5 rollouts, you might try to guess the rollout limit, and it would not be unreasonable to guess that the right value for not doubling was the highest seen in the 5, 0.829. It would also not be unreasonable to guess that the right value for doubling was the lowest seen, 0.625. It might seem that these could not combine to produce strong evidence that doubling is right, but they do! The key is that the estimates for doubling and for not doubling are linked. The important question is the difference between doubling and not doubling, and the differences were 0.023, 0.126, −0.012, 0.093, and 0.042. These don’t vary as much. It would be very surprising if the values for not doubling and doubling were 0.829 and 0.625, respectively. If you haven’t met Mr. and Mrs. Smith, then it might be plausible that Mr. Smith collects stray cats, and it might be plausible that Mrs. Smith detests and is allergic to cats, but it is hard to believe that both are true. The variation of the difference between doubling and not doubling is much smaller than the variation of either. In fact, the variation of the difference (which is what we want to understand) very closely resembled the variation of cubeless equity. Thus, the radius of the confidence interval of cubeless equity can also be taken as a rough estimate of the radius of the confidence interval of the difference between doubling and not doubling. Slightly different corrections are appropriate when deciding whether to redouble, and it’s completely inappropriate when you might be too good to double. Anyway, this position is a solid take and a double. Example 2: A Checker Play Decision
The rollout seems to favor the hit, 13/9*. Is it by a lot? Is it enough evidence to believe that it is the right play? The problem is that we have a confidence interval for each cubeless equity, and we want to have a confidence interval for the difference. In the previous example, we were taking the difference of closely related numbers. If the rollout showed that the player on roll would win a lot of games, then the equity would increase for both doubling and not doubling. Here, the values from the rollouts are almost completely unrelated (there can be mild exceptions if one duplicates the dice from one rollout in the other). So, the confidence interval of the difference is larger, not smaller, than either confidence interval. The joint standard deviation is squareroot(a^{2} + b^{2}), where a is the standard deviation of the first rollout’s value, and b is the standard deviation of the second value. The radius of the confidence interval follows the same formula, squareroot(a^{2} + b^{2}), where now a and b are the radii of the confidence intervals of the rollouts. In this example, the radii of the confidence intervals are 0.006 and 0.008, so the difference between the plays is 0.017 ± 0.010. As a short cut, I usually just multiply the larger radius by 1.4, which produces about the right answer most of the time. Is 0.017 ± 0.010 a significant difference? It is statistically significant since the confidence interval is from 0.007 to 0.027, which does not include 0. That means that it would be quite surprising if the rollout limit were equal or greater for 8/5, 6/5 than the 13/9* favored by the rollout. Further, while 0.017 might not seem like much equity, it is magnified when you convert this to cubeful equity. Usually the expansion factor is between 1.5 and 2, so this represents a cubeful difference of about 0.030. On the other hand, the plays are different enough that there might be significant systematic errors from the truncation. I would still trust this truncated rollout much more than I would trust a 3-ply evaluation (in this case, they agree). Example 3: One-Sided ErrorsLet’s suppose a position is going to be easy to play from one side, with almost no decisions. The side with no decisions will make no mistakes. The side with decisions might or might not. In many positions the errors made on opposite sides will cancel, but not in positions of one-sided errors. For the rollout limit to be close to the theoretical value of the position, the bot must play the hard position well.Note that while a position might have no chance of a checker play error, it might still have opportunities for cube action errors. Also, in a truncated rollout, there is a possibility that the evaluations at the truncation are incorrect and favor the decision-maker. A cubeless rollout without truncation should behave as described here. It is hard to use rollouts to determine the value of positions of one-sided errors. On the other hand, these positions serve as a test of how strong the settings of the bot are. Ideally, the stronger the bot is, the higher the value of the rollout limit will be from the perspective of the side that can make errors.
Rollout 1: 1-ply, cubeless, truncated at bearoff database, 21600 trials (seed 11).
Rollout 2: 2-ply super-tiny, cubeless, truncated at bearoff database, 1440 trials (seed 12).
Rollout 3: 2-ply medium, cubeless, truncated at bearoff database, 1440 trials (seed 13).
Rollout 4: 2-ply huge, cubeless, truncated at bearoff database, 1440 trials (seed 14).
Rollout 5: 3-ply tiny 20%, cubeless, truncated at bearoff database, 360 trials (seed 15).
Rollout 6: 3-ply tiny 100%, cubeless, truncated at bearoff database, 360 trials (seed 16).
Rollout 7: 3-ply huge 20%, cubeless, truncated at bearoff database, 360 trials (seed 17).
Rollout 8: 3-ply huge 100%, cubeless, truncated at bearoff database, 360 trials (seed 18).
Rollout 9: Snowie 4 Full Cubeless 2-ply (fast), 324 trials (seed 1).
Rollout 10: Jellyfish Level 6, cubeless 2-ply, 9000 trials (seed 19).
In this test, it looks like among the Snowie 3 rollouts the most important variable was the ply. All 3-ply rollouts were higher than the 2-ply rollouts. To my surprise, the 1-ply rollout was not far behind. Among the 2-ply rollouts, the size of the search space mattered; huge was significantly better than super-tiny. There is not much information among the 3-ply rollouts, but it looks like using 100% speed made no real difference (those with 100% speed averaged a cubeless equity of 0.531 ± 0.012, less than the average of 20% speed, 0.540 ± 0.012), but using a huge rather than tiny search space might mean stronger play (0.540 ± 0.012 versus 0.532 ± 0.012). Thanks to Michael Strato for running rollout 9 for me. The question is how Snowie 4, 2-ply, compares to the levels of Snowie 3 in a one-checker containment position. It appears that Snowie 4, 2-ply is weaker than Snowie 3, 3-ply tiny 20%. Snowie 4, 2-ply might be stronger than Snowie 3, 2-ply, but more data is needed. In this position, Jellyfish’s play on level 6 is worse than Snowie 3’s play on 1-ply. Jellyfish is known to have trouble in containment positions, and this is an example. In a position of one-sided errors, every rollout setting should underestimate the value of the side with real decisions. If any rollout says the side without decisions has a pass, then it should be a pass. Example 4: Too Good to DoubleAfter performing a cubeless rollout, Snowie and Gnu can convert the bg/g/w distribution to a cubeful equity. This is an educated guess based on Janowski’s formulas. Sometimes it works, and sometimes not.
I’m not exactly sure what is going on in the rollouts, but clearly the live cube rollouts see something very different from the cube adjustments of the cubeless rollouts. Most notable is that the difference between playing on with a centered cube versus an owned cube is only 0.023 in the cube adjustment, but about 0.084 with the live cube. It is also the case that the cubeful equities are lower than the cube-adjustments. What I think is happening is that Janowski’s formula implicitly assumes that the centered cube will be used almost exclusively by Black. In fact, it is very likely that Black will go from too good to double to not good enough to double in one exchange, by leaving a shot White hits. This happens in almost all of White’s cubeless wins. In this scenario, White will greatly prefer to have cube access. In fact, the cubeless equity is about 1.182, greater than the live centered cube equity, so White gets more value out of the doubling cube than Black! In many positions, as Black’s position deteriorates, Black would double White out, or double White in. Perhaps the assumptions of Janowski’s formula would better estimate the cubeful from the cubeless equities. In this position, I’m convinced that the cube-adjustment is a source of a substantial systematic error if you perform a cubeless rollout. In other situations, you should be more wary of the cube errors in the course of a cubeful rollout. Most of the time I trust the cube-adjustment, but not when there is a cube turn coming up soon, the cube liveliness is abnormal, or you may be too good to double. Counter Example: A Bearoff
If you perform any cubeless rollout, you get the following rollout limit:
Untruncated 2-ply rollout, 3-ply live cube, 2592 trials:
The correct tool for checking the right decision for money play is an exact database, such as the Sconyers database sometimes available on GamesGrid (set up a prop, click on Bearoff Equity if you have a mutual bearoff position) or Walter Trice’s Bearoff Quizmaster. I used the latter to get the following:
Here’s the problem with Snowie’s rollouts of this position: Snowie knows the exact winning chances, but it grossly overestimates the recube potential in the late bearoff. In a long race with this winning chance, Snowie recommends not doubling, which is reasonable when there would be many opportunities to recube if the game turns around. Much of the time, a rollout with a live cube will fix the problem, and the equity for redouble/take seems to be about right. However, after no redouble, Snowie makes a big mistake from the other side! After both sides take two checkers off, Snowie would erroneously take, thinking that it would be able to recube. However, there is no recube value. Snowie cubeful rollouts say that if you are playing Snowie, it is better to delay redoubling to get a bad take next turn. If your opponent would pass correctly (even a third of the time), you should redouble now. This is a rare example in which Snowie errs on evaluations and all rollouts (that I know of)*, but where the exact equities are known. However, it is likely that there are many other positions in which all rollouts will fail because the checker play of all bots is flawed.
Deciding Which Rollout to UseI’m often asked what the right rollout settings are. The answer is that it depends on the position and what information you want. In some situations, many rollouts will give the required information quickly. In others, the fast rollouts can’t be trusted, and one must use stronger settings and longer rollouts to get accurate results. Sometimes all types of rollouts will introduce biases, and it’s not clear in which direction. Sometimes all rollouts produce the wrong answer. Nevertheless, here are some guidelines to follow.First, you need the statistical error to be smaller than what you feel an acceptable error is. All bots give some indication of the magnitude of the statistical error, though sometimes just for the cubeless equity. If the statistical error is too large, you don’t get enough information, and you need to increase the number of trials, perhaps by extending the rollout. Second, you need to be confident that the bot understands how to play the resulting positions reasonably well. This includes cube actions if it is a live cube rollout, and evaluations at the point of truncation if it is a truncated rollout. How Many Trials?When comparing checker plays, there is a tendency to try to roll the plays out until the rollouts indicate that the rollout limit is higher for one play than another. I recommend accepting a third possibility, that the plays are close. This is useful information, and it can save a lot of time when rolling out decisions that are really tossups. A longer rollout will not decrease the systematic error, which might exceed the theoretical difference between the plays.When stratification is used, this tends to increase the estimated statistical error when it actually should decrease the statistical error. On the other hand, be very careful of stopping (or extending) a rollout that is stratifying its first few rolls unless the stratification is done more carefully than in current bots. The rollout may have significantly biased initial rolls that would be balanced out if the rollout were not interrupted. For cube actions, the number of trials necessary depends on the decision. For a take (or beaver) decision, you should use enough trials that the estimated statistical error is relatively small compared with either the distance from the result to the take/pass point or the size of an error you find acceptable. For a decision on whether to double or not when not too good to double, the error estimates are often much larger than they ought to be, and you can usually assume that the confidence interval of the difference between doubling and not doubling is as wide as the confidence interval for cubeless equity. Also see the discussion of whether to use a live cube. In general, for close decisions you need many trials to distinguish the choices. If the plays are far apart, this should become clear after fewer trials. Should You Use Variance Reduction?Yes.Should You Use a Live Cube or Not?It is more reasonable to use a live cube if there is going to be a double soon, if the position may be too good to double, or if you suspect that the cube liveliness (x in Janowski’s formula) is much different from normal, such as in a bearoff. However, this slows down the rollout, and there is a significant chance of introducing systematic errors due to bad cube actions. Most of the time I don’t use a live cube, but simply adjust a cubeless rollout.There seems to be no accurate way to roll out a decision of whether to double or not, including whether you are too good to double. A live cube rollout might help. However, if you believe a position is too good to double but the bot doesn’t, the bot may try to cash next turn no matter what, since it might not evaluate that decision the way you want it to. If you use a live cube with Snowie, I recommend using a 3-ply cube with 2-ply checker play since the cube accuracy increases remarkably but the slowdown is small (compared to a 2-ply cube). A 3-ply cube sees market losers. Snowie 2 allows you to perform a rollout with checker play according to score without using a live cube, and this speeds up rollouts in the Crawford game or with a dead cube, but Snowie 3 requires the use of a live cube to use this option. I don’t believe it is available in Snowie 4, however, checker play according to score is a useful option. Jellyfish 3.0 allows rollouts on level 5 with the cube and on level 6 without the cube. I usually use level 6, and try to make the cube adjustments myself. What is the Right Depth, Search Space, and Speed?Higher plies may be significantly better at handling primes, blitzes, and containment positions. Using a lower depth may introduce large systematic errors. For example, Snowie 3, 2-ply does not handle trap and squeeze plays properly, and while 3-ply is not perfect, it is so much better that you should use 3-ply for positions that often result in opportunities to trap your opponent off an anchor.I have rarely found much of a difference between using Snowie 3 on 100% speed versus 33% speed. I recommend using 33%. However, I have often found significant differences between using a tiny search space and a huge search space. Often the play 3-ply would choose is not considered even on 2-ply when you use a search space that is too small. Using a restricted search space means that for 3-ply to make the right play, 1-ply and 2-ply have to understand the position reasonably well, but the reason you use 3-ply may be that 1-ply and 2-ply are not accurate enough. Since Snowie 1-ply does not use variance reduction, a huge number of trials is needed. While each trial is faster, I recommend always using at least 2-ply when using Snowie. In contrast, GnuBG does use variance reduction with its static evaluation rollouts (called 0-ply, but analogous to Snowie 1-ply or Jellyfish Level 5). This slows them down, but they are still faster than 1-ply (analogous to Snowie 2-ply or Jellyfish Level 6). Gnu 1-ply does not play much better than Gnu 0-ply, which makes 0-ply rollouts a serious option. Should You Truncate?The reduction in statistical errors is tremendous, sometimes reducing the time needed by a factor of 50. However, truncation can introduce serious systematic errors if the program does not have excellent absolute evaluations of the positions at the time of truncation. While the perils of truncation jump out at me, and I rarely perform rollouts truncated at a numerical depth, it’s good to keep in mind that evaluations are already very weak truncated rollouts, and so any truncated rollout without large statistical errors should be an improvement.For cube actions, don’t truncate at a fixed ply. Truncation often introduces unacceptable biases, even in simple positions. You might accept settlements on large cubes (which introduces small systematic errors to reduce the statistical errors). You also might truncate at the bearoff database if you do not expect that there will be many doubles in the bearoff. If you are rolling out similar positions, the systematic errors are probably highly correlated. Also, the difference between the plays may be small, requiring a very small statistical error. Truncated rollouts are probably reasonable. I would not use them to decide whether to hit or not, but instead to decide, for example, which splitting play is better with an opening 4-3. What is the Best Rollout?My vote goes to the Jellyfish interactive rollout. In the course of an interactive rollout, I can get Jellyfish’s advice on how to play, which sometimes helps me to play better, and sometimes helps me to see when Jellyfish (on various levels) would misplay a position. When I really want to understand a position, it goes onto a short list of positions I’m waiting to roll out interactively. After an interactive rollout, I feel confident that I better understand not just the initial position, but all of the positions that commonly arise afterwards. The down sides are that it takes my time, not just computer time, and that the doubling cube is not in play. Sometimes I play out the position afterwards with Snowie a few times to see some sample cube decisions. SummaryRollouts are a great tool, but they are complicated. They can’t always be trusted, but they allow us to study positions and probe the playing strengths of bots.Usually there is a tradeoff between the statistical and systematic errors. In some positions, many rollouts will give the right answers quickly. In others, no rollout will give the right answer. © 2002 by Douglas Zare and GammonVillage. |
Backgammon Sets & Boards |
More articles by Douglas Zare | |
More articles on rollouts | |
Return to: Backgammon Galore |