Programming

 From: Sam Pottle Address: pottle@lunabase.org Date: 27 February 1999 Subject: Neurality Forum: rec.games.backgammon Google: 920075834.435427@ylum.lunabase.org

```I've been reading some of the papers on computer backgammon, and I have
some questions for the neuralistas:

-- I found the following description of Berliner's blockading feature in
his IJCAI '77 paper:

> For each combination of zero to seven blockading points at a distance
> of 1 to 12 spaces in front of a man, we computed the number of rolls
> that could legally be played by the blockade runner.

Is there a fuller description somewhere in the literature?

-- The trend in NN design for backgammon seems strongly toward nets with
lots of inputs, and the Neurogammon paper has some nice comparative
studies indicating that having various input features is lots better
than not having them.  But it seems to me that there is a tradeoff
available here.  If I have a network with 400 inputs and 150 neurons,
that's 60,000 weights.  I could use a really primitive board encoding to
design a network with 60 inputs and 1,000 neurons that would evaluate
just as fast.  Does this really lose?

-- Speaking of network architecture, it seems like everybody's using a
single hidden layer these days, although Tesauro found that two
layers worked a bit better for Neurogammon.  How come?

-- I recall some mention in this forum of some theoretical studies
suggesting that lambda should be roughly 1/branching-factor.  Does
anyone have references for these?

Sam
```

 Gary Wong  writes: ```> Berliner's blockading feature My copies of the papers are at work so I can't read them until Monday, but I once implemented what I thought was a reasonable interpretation at the time. The interpretation was this: int Escapes[ 0x1000 ], i, c, n0, n1; for( i = 0; i < 0x1000; i++ ) { c = 0; for( n0 = 0; n0 <= 5; n0++ ) for( n1 = 0; n1 <= n0; n1++ ) if( !( i & ( 1 << ( n0 + n1 + 1 ) ) ) && !( ( i & ( 1 << n0 ) ) && ( i & ( 1 << n1 ) ) ) ) c += ( n0 == n1 ) ? 1 : 2; Escapes[ i ] = c; } ie. for each of the 2^12 points (i) in front of a chequer (I don't worry about limiting them to 7 points occupied -- illegal configurations will never be referenced), generate the 21 dice combinations (n0 and n1) and count (c) the number of combinations that can be played (twice for non-doublets); to qualify, the sum (n0 + n1) and either n0 or n1 must be playable. ("Playable" is indicated by a 0 bit the appropriate number of bits into i.) Perhaps this interpretation isn't strictly correct for doublets, but I suspect it's close enough. The rest of the code is available if you're interested, at: ftp://ftp.cs.arizona.edu/people/gary/gnubg-0.0.tar.gz (Look in the file eval.c.) > I could use a really primitive board encoding to design a network with > 60 inputs and 1,000 neurons that would evaluate just as fast. > Does this really lose? Personally I would prefer to add input nodes in preference to hidden units (for an equivalent number of weights) as long as the new inputs are providing information that isn't easily derived from the others. I'm certain that 60 inputs is too small (that's barely enough to cover one input per player per point); if some information is useful in evaluating the strength of a position then it is more efficient to provide that information at the inputs if you can, rather than supplying extra hidden units to derive it. > It seems like everybody's using a single hidden layer these days. > How come? My impression is that you take a big performance hit in adding more layers; in particular the training cost is considerably greater. I haven't yet tried more than one layer so I can't provide a justified comparison. Cheers, Gary. ```

 Brian Sheppard  writes: ```I got the following description from Gerry Tesauro. You precompute a table of blockade strengths. A "blockade" is considered to be the pattern of blocked points in a set of 12 consecutive points. I use a table of 4096 bytes, where the strength of a blockade is encoded in a single byte. The strength of a blockade equals the number of rolls that cannot be played fully. For example, if you just hold one point 10 pips away then 6-4 and 5-5 cannot be played fully, so the blockade strength is 3. If you added the point 6 pips away, then 2-2, 3-3, 6-6, 4-2, and 5-1 could not be played fully, so the blockade strengh is 3 + 7 = 10. Quick suggestion: normalize these numbers by dividing by 36.0. Keeping all your neural network inputs in a roughly comparable range improves learning rates. Once you have a table of blockade strengths, you can use it to compute two important measures. One is the Actual blocking, which is the strongest blockade that actually has an enemy man behind it. Another is the Potential blockade, which is the strongest blockade of all. Maybe you can come up with other uses of this table. > I could use a really primitive board encoding to design a network with > 60 inputs and 1,000 neurons that would evaluate just as fast. > Does this really lose? Well, the specific tradeoff you mentioned would lose because you cannot provide a good state description for backgammon using only 60 inputs, but you have a good question. First off, the key insight of the Neurogammon project is that the accuracy of the neural network increases when the inputs match crucial domain concepts. For example, in backgammon it is logically sufficient simply to provide an array of 26 integers as inputs. Unfortunately this does not provide good discrimination because the quality of a position is a highly non-linear function of the input. Gerry Tesauro's (and his coauthor's (Terry Sejnowski??)) insight is that when you use boolean inputs for small numbers of men(e.g. one boolean to represent having a White blot, one for 2 White men, etc.) and only use an integer to represent "excess men" then the neural network learns much faster and performs at a higher level. Per point per color you have Blot, Point, Builder, and Excess inputs (some representations even have an input for having exactly 4 men on a point, but I haven't found this useful except for the 6, 12, 13, and 19 points) so you have a total of 25 * 8 = 200 inputs. (You needn't encode the number of men off the board, since that is a linear function of the others.) Now to your question: a neural network with 200 inputs and 80 neurons has 16000 weights. If you used only 26 inputs you could have 615 neurons and still have the same number of weights. How can this lose? The key observation is that the 200 inputs are almost always zero, and when an input is zero it contributes nothing to the calculation expense because you do not have to propagate its weights. It follows that the 200 input network is very much faster than the 26 input network. It turns out that the correct equivalent number of neurons for the 26-input network is 80. Notice that in the group of 8 inputs that encode the contents of a single point we have at most one non-zero input. And if a point is empty then we have no non-zero inputs. It follows that the number of non-zero inputs in the 200-input encoding is exactly the same as the number of non-zero inputs in the 26-input encoding. So if you wished to be equivalent in speed, the 26-input network must have 80 neurons. (Actually a little less, because there are also benefits to having inputs whose value is 1.0, since that avoids a scalar multiplication in the hidden-node computation.) Of course, actually buidling such a 26-input network would be a big step backwards because the 200-input network has a much richer input encoding. Since boolean inputs are usually zero, it costs relatively little to add an input, and relatively more to add a neuron. This suggests that adding additional "feature recognizers" is pretty cheap. Having said the foregoing, I must report that for quite some time I haven't been able to increase the strength of my network by adding new inputs, whereas I have been able to increase its strength by increasing the number of neurons. > It seems like everybody's using a single hidden layer these days. > How come? There are a lot of practical problems to multi-layer networks. The backprop training algorithms that works so well in theory do not work in practice because the gradient for layers beyond the first is nearly zero. Having a gradient that is so small means that there is almost no learning going on in early levels. It follows that you have to "hack" your learning algorithm, and that will suck up a lot of your time. The common justification for multilayer networks is that they "respresent richer features." This isn't true in theory, of course, since a single-layer network is a universal function approximator, but in practice there may be functions that are easy to approximate using sigmoids of signmoids that are hard to approximate using sigmoids alone. But in backgammon you aren't likely to find such functions. The reason is that the ability of a feature to discriminate between winning and losing positions is a logistic function (i.e. A/(B + exp(C * x)). It follows that a single-layer function contains all the necessary richness. From a computational perspective, instead of using deeper networks why not double the number of neurons in a single layer? It will train faster and probably play better. And if you need extra input features, why not code them by hand? > I recall some mention in this forum of some theoretical studies > suggesting that lambda should be roughly 1/branching-factor. Does > anyone have references for these? This is in one of Sutton's papers. Check out his papers online at UMass's web site. By the way, I do not recommend using non-zero lambdas. Using lambda == 0 results in a much faster, simpler program and the training quality is very nearly as good. After all, what difference is there between lambda == 0.0278 and lambda == 0? Hardly any signal filters through when lambda == 0.0278. Why not just make lambda zero, and take the benefits of simpler and faster code? Warm Regards, Brian Sheppard ```

### Programming

Adjusting to a weaker opponent  (Brian Sheppard, July 1997)
Anticomputer positions  (Bill Taylor+, June 1998)
BKG 9.8 vs. Villa  (Raccoon+, Aug 2006)
BKG 9.8 vs. Villa  (Andreas Schneider, June 1992)
BKG beats world champion  (Marty Storer, Sept 1991)
Backgames  (David Montgomery+, June 1998)
Blockading feature  (Sam Pottle+, Feb 1999)
Board encoding for neural network  (Brian Sheppard, Feb 1997)
Bot weaknesses  (Douglas Zare, Mar 2003)
Building and training a neural-net player  (Brian Sheppard, Aug 1998)
How to count plies?  (Chuck Bower+, Jan 2004)
How to count plies?  (tanglebear+, Mar 2003)
Ideas for improving computer play  (David Montgomery, Feb 1994)
Ideas on computer players  (Brian Sheppard, Feb 1997)
Introduction  (Gareth McCaughan, Oct 1994)
Measuring Difficulty  (John Robson+, Feb 2005)
Methods of encoding positions  (Gary Wong, Jan 2001)
N-ply algorithm  (eXtreme Gammon, Jan 2011)
Neural net questions  (Brian Sheppard, Mar 1999)
Pruning the list of moves  (David Montgomery+, Feb 1994)
Search in Trees with Chance Nodes  (Thomas Hauk, Feb 2004)
Source code  (Gary Wong, Dec 1999)
TD-Gammon vs. Robertie  (David Escoffery, June 1992)
Training for different gammon values  (Gerry Tesauro, Feb 1996)
Training neural nets  (Walter Trice, Nov 2000)
Variance reduction in races  (David Montgomery+, Dec 1998)
Variance reduction of rollouts  (Michael J. Zehr+, Aug 1998)
Variance reduction of rollouts  (Jim Williams, June 1997)
What is a "neural net"?  (Gary Wong, Oct 1998)
Writing a backgammon program  (Gary Wong, Jan 1999)