Forum Archive :
Programming
I've been reading some of the papers on computer backgammon, and I have
some questions for the neuralistas:
 I found the following description of Berliner's blockading feature in
his IJCAI '77 paper:
> For each combination of zero to seven blockading points at a distance
> of 1 to 12 spaces in front of a man, we computed the number of rolls
> that could legally be played by the blockade runner.
Is there a fuller description somewhere in the literature?
 The trend in NN design for backgammon seems strongly toward nets with
lots of inputs, and the Neurogammon paper has some nice comparative
studies indicating that having various input features is lots better
than not having them. But it seems to me that there is a tradeoff
available here. If I have a network with 400 inputs and 150 neurons,
that's 60,000 weights. I could use a really primitive board encoding to
design a network with 60 inputs and 1,000 neurons that would evaluate
just as fast. Does this really lose?
 Speaking of network architecture, it seems like everybody's using a
single hidden layer these days, although Tesauro found that two
layers worked a bit better for Neurogammon. How come?
 I recall some mention in this forum of some theoretical studies
suggesting that lambda should be roughly 1/branchingfactor. Does
anyone have references for these?
Sam


Gary Wong writes:
> Berliner's blockading feature
My copies of the papers are at work so I can't read them until Monday, but
I once implemented what I thought was a reasonable interpretation at the
time. The interpretation was this:
int Escapes[ 0x1000 ], i, c, n0, n1;
for( i = 0; i < 0x1000; i++ ) {
c = 0;
for( n0 = 0; n0 <= 5; n0++ )
for( n1 = 0; n1 <= n0; n1++ )
if( !( i & ( 1 << ( n0 + n1 + 1 ) ) ) &&
!( ( i & ( 1 << n0 ) ) && ( i & ( 1 << n1 ) ) ) )
c += ( n0 == n1 ) ? 1 : 2;
Escapes[ i ] = c;
}
ie. for each of the 2^12 points (i) in front of a chequer (I don't worry
about limiting them to 7 points occupied  illegal configurations will
never be referenced), generate the 21 dice combinations (n0 and n1) and
count (c) the number of combinations that can be played (twice for
nondoublets); to qualify, the sum (n0 + n1) and either n0 or n1 must be
playable. ("Playable" is indicated by a 0 bit the appropriate number of
bits into i.) Perhaps this interpretation isn't strictly correct for
doublets, but I suspect it's close enough.
The rest of the code is available if you're interested, at:
ftp://ftp.cs.arizona.edu/people/gary/gnubg0.0.tar.gz
(Look in the file eval.c.)
> I could use a really primitive board encoding to design a network with
> 60 inputs and 1,000 neurons that would evaluate just as fast.
> Does this really lose?
Personally I would prefer to add input nodes in preference to hidden units
(for an equivalent number of weights) as long as the new inputs are
providing information that isn't easily derived from the others. I'm
certain that 60 inputs is too small (that's barely enough to cover one
input per player per point); if some information is useful in evaluating
the strength of a position then it is more efficient to provide that
information at the inputs if you can, rather than supplying extra hidden
units to derive it.
> It seems like everybody's using a single hidden layer these days.
> How come?
My impression is that you take a big performance hit in adding more layers;
in particular the training cost is considerably greater. I haven't yet
tried more than one layer so I can't provide a justified comparison.
Cheers,
Gary.


Brian Sheppard writes:
I got the following description from Gerry Tesauro.
You precompute a table of blockade strengths. A "blockade" is considered to
be the pattern of blocked points in a set of 12 consecutive points. I use a
table of 4096 bytes, where the strength of a blockade is encoded in a
single byte.
The strength of a blockade equals the number of rolls that cannot be played
fully. For example, if you just hold one point 10 pips away then 64 and
55 cannot be played fully, so the blockade strength is 3. If you added the
point 6 pips away, then 22, 33, 66, 42, and 51 could not be played
fully, so the blockade strengh is 3 + 7 = 10.
Quick suggestion: normalize these numbers by dividing by 36.0. Keeping all
your neural network inputs in a roughly comparable range improves learning
rates.
Once you have a table of blockade strengths, you can use it to compute two
important measures. One is the Actual blocking, which is the strongest
blockade that actually has an enemy man behind it. Another is the Potential
blockade, which is the strongest blockade of all.
Maybe you can come up with other uses of this table.
> I could use a really primitive board encoding to design a network with
> 60 inputs and 1,000 neurons that would evaluate just as fast.
> Does this really lose?
Well, the specific tradeoff you mentioned would lose because you cannot
provide a good state description for backgammon using only 60 inputs, but
you have a good question.
First off, the key insight of the Neurogammon project is that the accuracy
of the neural network increases when the inputs match crucial domain
concepts. For example, in backgammon it is logically sufficient simply to
provide an array of 26 integers as inputs. Unfortunately this does not
provide good discrimination because the quality of a position is a highly
nonlinear function of the input.
Gerry Tesauro's (and his coauthor's (Terry Sejnowski??)) insight is that
when you use boolean inputs for small numbers of men(e.g. one boolean to
represent having a White blot, one for 2 White men, etc.) and only use an
integer to represent "excess men" then the neural network learns much
faster and performs at a higher level.
Per point per color you have Blot, Point, Builder, and Excess inputs (some
representations even have an input for having exactly 4 men on a point, but
I haven't found this useful except for the 6, 12, 13, and 19 points) so you
have a total of 25 * 8 = 200 inputs. (You needn't encode the number of men
off the board, since that is a linear function of the others.)
Now to your question: a neural network with 200 inputs and 80 neurons has
16000 weights. If you used only 26 inputs you could have 615 neurons and
still have the same number of weights. How can this lose?
The key observation is that the 200 inputs are almost always zero, and when
an input is zero it contributes nothing to the calculation expense because
you do not have to propagate its weights. It follows that the 200 input
network is very much faster than the 26 input network.
It turns out that the correct equivalent number of neurons for the 26input
network is 80. Notice that in the group of 8 inputs that encode the
contents of a single point we have at most one nonzero input. And if a
point is empty then we have no nonzero inputs. It follows that the number
of nonzero inputs in the 200input encoding is exactly the same as the
number of nonzero inputs in the 26input encoding. So if you wished to be
equivalent in speed, the 26input network must have 80 neurons. (Actually a
little less, because there are also benefits to having inputs whose value
is 1.0, since that avoids a scalar multiplication in the hiddennode
computation.)
Of course, actually buidling such a 26input network would be a big step
backwards because the 200input network has a much richer input encoding.
Since boolean inputs are usually zero, it costs relatively little to add an
input, and relatively more to add a neuron. This suggests that adding
additional "feature recognizers" is pretty cheap.
Having said the foregoing, I must report that for quite some time I haven't
been able to increase the strength of my network by adding new inputs,
whereas I have been able to increase its strength by increasing the number
of neurons.
> It seems like everybody's using a single hidden layer these days.
> How come?
There are a lot of practical problems to multilayer networks. The backprop
training algorithms that works so well in theory do not work in practice
because the gradient for layers beyond the first is nearly zero. Having a
gradient that is so small means that there is almost no learning going on
in early levels. It follows that you have to "hack" your learning
algorithm, and that will suck up a lot of your time.
The common justification for multilayer networks is that they "respresent
richer features." This isn't true in theory, of course, since a
singlelayer network is a universal function approximator, but in practice
there may be functions that are easy to approximate using sigmoids of
signmoids that are hard to approximate using sigmoids alone.
But in backgammon you aren't likely to find such functions. The reason is
that the ability of a feature to discriminate between winning and losing
positions is a logistic function (i.e. A/(B + exp(C * x)). It follows that
a singlelayer function contains all the necessary richness.
From a computational perspective, instead of using deeper networks why not
double the number of neurons in a single layer? It will train faster and
probably play better. And if you need extra input features, why not code
them by hand?
> I recall some mention in this forum of some theoretical studies
> suggesting that lambda should be roughly 1/branchingfactor. Does
> anyone have references for these?
This is in one of Sutton's papers. Check out his papers online at UMass's
web site.
By the way, I do not recommend using nonzero lambdas. Using lambda == 0
results in a much faster, simpler program and the training quality is very
nearly as good. After all, what difference is there between lambda ==
0.0278 and lambda == 0? Hardly any signal filters through when lambda ==
0.0278. Why not just make lambda zero, and take the benefits of simpler and
faster code?
Warm Regards,
Brian Sheppard




Programming
 Adjusting to a weaker opponent (Brian Sheppard, July 1997)
 Anticomputer positions (Bill Taylor+, June 1998)
 BKG 9.8 vs. Villa (Raccoon+, Aug 2006)
 BKG 9.8 vs. Villa (Andreas Schneider, June 1992)
 BKG beats world champion (Marty Storer, Sept 1991)
 Backgames (David Montgomery+, June 1998)
 Blockading feature (Sam Pottle+, Feb 1999)
 Board encoding for neural network (Brian Sheppard, Feb 1997)
 Bot weaknesses (Douglas Zare, Mar 2003)
 Building and training a neuralnet player (Brian Sheppard, Aug 1998)
 How to count plies? (Chuck Bower+, Jan 2004)
 How to count plies? (tanglebear+, Mar 2003)
 Ideas for improving computer play (David Montgomery, Feb 1994)
 Ideas on computer players (Brian Sheppard, Feb 1997)
 Introduction (Gareth McCaughan, Oct 1994)
 Measuring Difficulty (John Robson+, Feb 2005)
 Methods of encoding positions (Gary Wong, Jan 2001)
 Nply algorithm (eXtreme Gammon, Jan 2011)
 Neural net questions (Brian Sheppard, Mar 1999)
 Pruning the list of moves (David Montgomery+, Feb 1994)
 Search in Trees with Chance Nodes (Thomas Hauk, Feb 2004)
 Source code (Gary Wong, Dec 1999)
 TDGammon vs. Robertie (David Escoffery, June 1992)
 Training for different gammon values (Gerry Tesauro, Feb 1996)
 Training neural nets (Walter Trice, Nov 2000)
 Variance reduction in races (David Montgomery+, Dec 1998)
 Variance reduction of rollouts (Michael J. Zehr+, Aug 1998)
 Variance reduction of rollouts (Jim Williams, June 1997)
 What is a "neural net"? (Gary Wong, Oct 1998)
 Writing a backgammon program (Gary Wong, Jan 1999)
From GammOnLine
Long message
Recommended reading
Recent addition

 
