Computers and Rollouts, by Kit Woolsey

This article originally appeared in the January 2000 issue of GammOnLine.
Thank you to Kit Woolsey for his kind permission to reproduce it here.

Computers and Rollouts

By Kit Woolsey

Due to the probabilistic nature of backgammon, it is often difficult to determine the proper play. One cannot prove that play A is better than play B the way one can in chess by analyzing the possible variations which follow from each play. Since there are 21 possible dice rolls at every turn, even looking ahead a couple of rolls is a difficult task. We have to depend on our intuition and experience.

The one way we can test plays is by rolling out the position a number of times. This can give us a lot of insight into the position, but it can be quite laborious. Rolling out a position 100 times takes several hours, and such a small sample is far from conclusive anyway. The luck element is just too great. Even if we put controls on the rollout which attempt to even out the luck element, the number of trials one can do in a short amount of time will not be sufficient to give us satisfactory results.

The logical thing is to give the position to the computer. Our high-powered computers can roll out positions very quickly, and what would take human beings days can be done by the computers in a couple of minutes. The problem is that the computers may not play well enough for the rollouts to have any meaning. For simple positions such as bearoffs it is easy to program a computer to play pretty well, and the results of rollouts are likely to be accurate. For more complex positions, programming the computer to play well is much more difficult, and rollout results will be less trustworthy.

Back in 1991, I wrote an article for Inside Backgammon about rollouts. This was before the neural nets of today were invented. In that article I made the prediction that within five years we would have programs which played well enough so that we could trust their rollout results, and that our knowledge of backgammon would improve immensely. This turned out to be one of the most accurate predictions I have ever made.

Shortly after this article, the program Expert Backgammon was released. This program was not a neural net—its rules for making evaluations were hand carved. Still it played at a decent intermediate level, and had the capability of doing rollouts. The program wasn't as efficient as the ones we have today and the computers weren't nearly as fast, so it might take over an hour to do a rollout of 1296 trials. Still the ability was there, and for many normal positions the results were believable. Our backgammon knowledge was starting to change.

It soon became apparent to me that we needed many more trials than I had realized before our sample size was large enough, even assuming we could trust the computer to play well enough. For example, the first thing I did when I got Expert Backgammon was to roll out all the opening plays 1296 times each. Imagine my surprise when for an opening 4-2 the play 13/9, 13/11 came out slightly better than making the four point! This wasn't due to the way Expert Backgammon was playing the positions; it was simply because the sample size wasn't large enough and the results for this particular rollout happened to be on the far end of the bell curve. A larger rollout of over 5000 trials showed that the result was a fluke and that making the four point was considerably better.

The play of the program was another matter. For more complex positions, back games in particular, it was immediately clear that the program had no idea what it was doing and that the results were way off base. For simpler positions, it looked like most of the results were reasonable. Still there were problems. As a test, I compared rollouts from Expert Backgammon with rollouts from my own backgammon playing program I had written. My program didn't play as well as Expert Backgammon, but it wasn't too far behind. Still, the results from simple positions such as holding games sometimes differed by as much as .100 in equity between the two programs. This meant that one of the programs was badly misplaying either the side playing the holding game or the side coming in against the holding game, and it wasn't particularly obvious where the misplays were coming from. What this illustrated was the importance of accurate play by the program if a rollout was to be trusted.

It should be noted that the main difficulties were with position evaluation. When it came to play vs. play decisions, the rollouts generally gave accurate results if the plays led to fairly similar positions. The reason is that if the program was making mistakes playing the position, it would make similar mistakes for each play rolled out. Consequently, the mistakes would tend to cancel out, and the real differences between two plays being rolled out would show up. However if we were looking at the proper equity of the position for a cube decision, or if the two plays being rolled out led to wildly different types of positions, there could be trouble.

A little later, a different programming approach was taken. Instead of letting clumsy humans guess what weights to assign to various parameters, why not let the program decide for itself? The idea is to have the program start from scratch with nothing or very little known about how to play backgammon except the rules, have the program play many thousands of games against itself, and from the results of these games determine what weights should be assigned to various parameters and how they should be intermixed. This is the neural network approach. Essentially the program "learns" the same way a human being learns, by seeing what is successful and modifying his behavior based on what he learns. However the program can do this much more objectively, and can play many more games than a human can in his lifetime and remember the results. Would this approach compensate for the human ability to think creatively?

The first neural network backgammon program was written by Gerald Tesauro. It was called TD-Gammon. Bill Robertie played it a series of games, which he later published. The program made several obvious errors. In technical positions such as bearoffs it made some clear mistakes, and in complex positions such as back games there were plenty of things it did wrong. For the most part, however, TD-Gammon played quite competently. In normal positions it consistently chose good moves, and demonstrated a reasonable understanding of concepts such as timing and flexibility. The best human players were still clearly better, but the program played well enough that it looked like its rollout results could be trusted. In fact Tesauro did have TD-Gammon roll out several interesting positions from the Robertie match, and the results looked quite reasonable. There was no question that TD-Gammon was considerably better than any previously written backgammon-playing program.

Tesauro did not stop there. Making use of ever-increasing computer speed and power, he upgraded TD-Gammon. He quadrupled its brain size, and had a much longer training session. Also, and very important, when the program played the computer it used was fast enough so that the program could look ahead one roll (we call this 2-ply). In other words, when the program analyzed a candidate play it would look at all 21 possible dice rolls for the opponent, find the best play (on its 1-ply analysis), and evaluate all these resulting positions. It would then average them out to determine its equity estimate for the play under consideration. This is something a human would find impossible to do in a few seconds. However our high-speed computers were able to perform this look-ahead and still play at a normal pace.

What were the results of these improvements? Once again, Robertie played a series of games against TD-Gammon. The difference was like night and day. TD-Gammon was no longer a competent advanced player. It was a top-flight expert, clearly competitive against the best players in the world. It still made some of the clearly wrong errors, but these didn't cost much. However in the vague positional judgment areas where humans have the most trouble, TD-gammon would consistently find the best play. It was clear that our top experts would have to learn from the computer in order to keep up with it.

Unfortunately, the 2-ply analysis takes some time. This is fine for normal play, but for rollouts it takes too long. The main gain from using computers is that we can roll out a positions several thousand times in a few minutes, but to do this we have to live with the 1-ply rollout and the weaker play of the 1-ply. Still, even at the 1-ply level the program plays decently.

TD-Gammon was never sold commercially. However other programmers saw how successul it was and tried writing their own neural network programs. One of them was Jellyfish, written by Fredrik Dahl. This was commercialized, with rollout capabilities and tutorial analysis during the play. For the first time players had access to the powerful neural network analysis. Not surprisingly, the average skill level of the tournament player took a big jump. Also, some of the previously believed concepts about backgammon were overturned. The wild slotting style of the late 1970's and 1980's was, if the neural nets were to be believed, more costly than previously thought. The race was found to be very important, and many plays were based on racing potential. Purity was found to have been overrated, while ugly attacking plays proved to be stronger than expected. The style of the average good player drifted toward these new concepts. Of course, one does wonder if these results from the bots are somewhat self-fulfilling prophecies. Could it be that the bots prefers blitzes and races to priming games and back games because it plays them better? The jury is still out on that topic, but we will be looking at the question.

How strong do the neural nets play? Improved computer speed allowed a new version of Jellyfish which looks ahead two rolls (that is, both the opponent's roll and the program's next roll), which as might be expected leads to significant improvements. Would anybody be willing to put his money where his mouth is about the strength of the neural net? Malcolm Davis was. He challenged Nack Ballard and Mike Senkiewicz (two players who in anybody's opinion are among the top players of the world) to play Jellyfish 300 games apiece for quite large stakes. The challenge was accepted, and in the summer of 1997 the match took place. I was there, helping to organize things. The players played on a regular board (so that they would be playing under conditions they were used to). Somebody sat opposite them playing the computer's pieces, making the moves the computer recommended. The dice were rolled by the players at the board (not by the program), and the dice rolls and the plays of Nack and Mike were input into the computer by a person operating the computer. The match went quite smoothly, and play moved along at a reasonably normal pace. At the end, Nack was +58 points and Mike was -58 points. This meant that over 600 games Jellyfish had broken even against two of the top players in the world playing at their best since serious money was involved. While there is always a lot of luck in backgammon this is quite a few games, and in my opinion demonstrated that Jellyfish is competitive against anybody. I have played through the games, and I don't think that Jellyfish had been particularly lucky overall—it played just as well as its opponents.

In the last couple of years another commercial neural net has come out: Snowie, written by Olivier Egger. Snowie plays every bit as well as Jellyfish, perhaps a bit better. Its main feature in my opinion is the ability to play or store a match and then have the program analyze the entire match. This is very valuable for learning.

Now that we have programs which play well, and computers which are very fast, can we trust the rollouts of these computers to answer all our questions about backgammon? Yes and no. There are always potential pitfalls, and if we are not careful we may find ourselves believing plays to be correct which are, in fact, wrong. Let's look at some of the dangers.

As we have seen, even 1296 trials may not be sufficient just due to the laws of chance. How many trials do we need? That is hard to say. Regardless of how many trials we run, there is always the possibility that we will get a freak result which gives us the wrong answer. In general, if we roll out a position 10,000 times or so it is very unlikely that play A will come out ahead of play B when, in fact play B is better just due to the luck of the rollout. As we will see there are other greater dangers involved in the rollout.

There are ways of controlling the dice to cut down on the luck factor, and the bots make use of these. For play vs. play decisions, duplicate dice are used—i.e. for each trial of play A, play B is then rolled with the same dice. For many positions where certain rolls are likely to be critical, this is very helpful in cutting down the luck factor. For other more general positions, such as how to play the opening roll, duplicate dice make relatively little difference. The games diverge too quickly, and what is a good roll in one game may be a bad roll in another. Also if we are looking at two plays which lead to very different types of positions, duplicate dice are of dubious value.

Another way of controlling the luck element of the dice is to make sure that each number comes up the same time. For example, if you are rolling out a position 1296 times, the bot will arrange things so that every one of the combinations of 36 opening rolls and 36 responses occurs. This is helpful, since the first couple of rolls are often the most critical in determining the outcome of a position. Jellyfish takes this a step further. When doing a rollout, it arranges things so that for each roll (Black's 10th roll, for example) will have each of the 36 possibilities occur an equal number of times. This does not affect the randomness of the rollout, but does ensure that if a certain roll is generally good that roll will come up its fair share of the time. I don't know if Snowie does the same thing.

A further way to cut down on the luck element is to use truncated rollouts. This can be done when using hand rollouts also. The idea is to not roll the position out all the way, but just a certain number of rolls and then evaluate. Not only is this a huge time-saver, but it cuts down on the luck element which can affect things later in the game. Thus, a smaller sample size is needed to get accurate results, again cutting down on the time of the rollout. The catch, of course, is the evaluation of the position after the truncated rollout. If that evaluation is accurate, then the truncated rollout is likely to give very good results. On the other hand if the resulting position is one which the program has difficulty evaluating properly, then the results of a truncated rollout will be very suspect.

The big danger with a rollout, of course, is that the bot is simply misplaying the position. As we have seen, in order to get top speed out of our rollout it is necessary to have the rollout be done at 1-ply, where the program is at its weakest since it doesn't get to use its lookahead capability—it has to make an assessment of the position as is. How bad are some of its plays at 1-ply? They can be pretty bad. Here are a couple of examples:

150

141

White

money game

Blue

Obviously the correct play is 9/7(2). There is no need to make the anchor. White's board is almost sure to collapse since it will be very difficult for White to escape his back men, so the danger of being attacked is minimal. Yet, on its 1-ply assessment Jellyfish thinks that 23/22, 13/12, 8/7(2) is clearly the best play. Somehow Jellyfish doesn't appreciate the timing considerations involved in the position. Of course at its 3-ply Jellyfish "sees" what is going to happen and correctly determines that 9/7(2) is far superior. However a rollout is done at 1-ply, so if this position came up in a rollout Jellyfish would mangle it and so distort the results.

Snowie had no problem with the double-aces play even on its 1-ply—it thought that 9/7(2) was clearly superior. However Snowie is not immune from big accidents. Consider the following position:

145

99

White

money game

Blue

As any experienced player knows, making the ace point is wrong. There is too much danger that Blue will not escape next turn and that his board will start to crack. The correct play is 22/17, 5/1*, combining attack with escape. However on the 1-ply Snowie gets this wrong, thinking that making the ace point is clearly right. On its 3-ply it sees the danger, and properly concludes that 22/17, 5/1* is far better.

Interestingly enough, Jellyfish gets this right on its 1-ply analysis. These two examples would appear to indicate that Jellyfish has a better understanding of attacking positions, while Snowie has a better understanding of priming positions. This is consistent with my observations from seeing both of the programs play many games. I believe the difference comes from the initial inputs which are put in by the programmer. Of course, this is the same way two different humans are strong or weak in different areas of the game, depending on their experience and schooling in the position types. Computers also have their own personal style of play.

The good news is that while errors such as the above will be made in the rollouts, most of the time it won't matter. A specific position and dice roll has to come up for it to make a difference, and that won't happen often. Also in play vs. play problems the type of error involved may come up for both plays, so the errors will cancel out. In general, problems such as this will only really matter on the first roll for each side in the rollout. After that, things will diversify enough so it won't matter much. If you are really concerned about a rollout, it is worth checking to see how the program plays the initial rolls for the position on its 1-ply. If the plays are decent, the rollout results are likely to be accurate.

Another problem occurs if the program is thematically misplaying a position. This can happen in the most unusual situations. For example, Kent Goulding in Inside Backgammon wrote an article which described how Jellyfish in its rollouts mangled several bearoff positions by not taking a checker off when it was supposed to (in fact, Expert Backgammon with its hand-carved parameters played these bearoffs much better). This problem with Jellyfish has been fixed for latest versions—it now refers to an accurate data base. However the problem could cause very misleading results with positions which are likely to lead to a close race where one side has a crunched position (so cannot misplay it), while the other side has a smoother bearoff.

Back games are another problem for the nets. While the latest versions play the initial structure surprisingly well, in the end-game after a checker has been hit the programs tend to flounder. This can cause misleading results in rollouts. For example, consider how to play an opening 2-1. It seems logical that the slotting play will lead to the slotter being involved in more back games, since if the blot is hit he already has three men back. However this is a long way down the road, and there are a lot of other variations possible. How much affect this will have on a rollout is anybody's guess. However if the slotting and splitting plays are close (and rollouts indicate that they are, with a slight edge to the split), it is quite possible that the weakness in back game play is sufficient to swing the result and make slotting superior. For other positions which are likely to result in back games, the problem is greater, particularly when trying to analyze a cube decision. As usual the problem isn't as great for play vs. play problems, since the positions will probably be somewhat similar.

What about incorporating the cube into rollouts? The main danger is that a couple of big cubes might distort the results. When rolling out a position by hand, the popular way to get the cube into play is as follows: When the person doing the rollouts judges that it is a close double but a clear take, he just rolls on. However when he judges that it is a clear double but a close take, he settles the position as a win for the doubling side. This isn't 100% accurate, of course, but is a good approximation and avoids the problem of big cubes.

The computer programs can settle positions in the same way. A settlement equity is determined. If the estimated equity of the position is below that equity the rollout is continued, but if it is above that equity for the person on roll if he has cube access, then that side is deemed the winner. The question is, what should the settlement number be? In the average type of position, the break-even equity for a pass/take decision is usually about .570 (it is higher that .500 because of the recube potential). If the position has a lot of gammon potential the break even point is higher. The reason is that if the losing side is getting gammoned a lot he must have a higher percentage of wins to compensate, and the more wins he has the more he is likely to make use of the recube. After a lot of trial and error, most players have determined that .550 appears to be a reasonable number to use for the settlement figure.

There are other ways to handle the cubeful rollouts. One, which is used by Snowie, is to let the cube get as high as 8, but then settle. This allows for the effect of proper doubles and takes, but avoids the swings of really high cubes. Another idea which I like (but has not yet been adopted by the programmers) is as follows: If during the rollout the program believes it is double and take, then split off into two rollouts (with the cube on 2), but give each rollout half value in the final total. If during one of these rollouts the program again thinks it is double and take, again split into two rollouts with the cube on 4, and give each of these rollouts 1/4 the overall value. This process can be continued as high as the cube gets. At the cost of the time spent on a few extra rollouts, this procedure takes in the full value of the cube without having to worry about any settlement, yet it avoids the distortions of high cubes (since if the cube is on 32, each rollout there will only count 1/32 of the total). I would like to see this approach implemented. This concept was originally thought of by Michael Zehr.

Whatever procedure is used for cubeful rollouts, the results can be very valuable. Each such rollout produces four equities: cubeless, center cube, side on roll owning cube, opponent owning cube. If the rollouts are meaningful, there is a lot of information to be learned about doubling theory from the results.

Is there a way to use the superior 2-ply and 3-ply playing ability of the programs in rollouts without taking too much time and still have sufficient sample size? The bots can use what is called variance reduction. I admit that I don't understand this very well—hopefully someone with better understanding can explain it in a future article. From what I know, the bots roll out the position making the 2-ply or 3-ply analysis for each play, but estimate how good or bad the dice roll is so as to incorporate the luck element into the equation. Doing this, the claim is that it takes far fewer trials to get meaningful results. It would seem as though this approach would depend a lot on the accuracy of the program's evaluation of the luck of the dice rolls. However from what I have seen these rollouts do tend to get better results in complex positions.

So, what is in store for the future? If computer speed continues to increase at the rate it has over the last 10 years, the next decade should produce computers available to the general public which are fast enough to let the bots do mini-rollouts and still play at normal speed. If this happens, I am confident that the best player in the world will be a computer program. Also, I expect to see improvements in the neural nets, mainly by using several different nets depending on the type of position. Both Jellyfish and Snowie do this to a limited extent (Snowie more so, I believe), but I think if this is done properly the overall evaluation function of the nets on their 1-ply can be improved considerably. With proper training, the nets might even learn to play end-games properly which they now have difficulty with. When this happens, computer rollouts will be very trustworthy, and backgammon knowledge will reach new peaks.

Return to:

GammOnLine Article Index
Backgammon Galore