Warning: this and the next couple of posts are written from the point
of view of a mathematician. The overall aim is to point out a
technical problem with the way variance reduction is done to decide
which of two bots is stronger and by how much. This problem is
unlikely to be important in practice, but is a problem if you want a
rigorous statistical result. My explanation of the proposed solution
(and perhaps of the problem) may well be comprehensible by, and of
interest to, only `boffins'.
Recently, most of the posts about comparing bots have concentrated on
money play, running long sessions to estimate one bots expectation
against the other. But if you want a definite result, there is a
problem with this, due to very large cubes. As is fairly well known,
the expectation may not even be defined, and, even more likely, the
variance is probably infinite.
This may sound like a purely theoretical problem, but it could well be
a practical one. For example, (and I don't know the details about
this), supposedly an earlier version of Jellyfish misunderstood some
backgame positions so badly that it would double from behind. In any
case, it's easy to imagine that there are very rare positions where
one of the bots doubles from behind, in a situation which is more or
less stable for several moves. If the other bot redoubles in this kind
of position, very rare large cubes can result. (I've known a human to
double when closed out, with his other 14 men on the 1 and 2 points, I
suppose because he was `ahead in the race'; if he'd had the courage
of his convictions, a very large cube could have resulted
while I was bringing my spares round the board.)
For good bots, such a position is obviously unusual, but how unusual,
and how do we know? If you have in total 5000 games between the bots,
say, and this hasn't occurred, then you can say it probably doesn't
happen on average in more than 1 in 1500 games, say:
If in the very long run it occurred 1 in 1500 times, there would
be a 96.4% chance it happened in 5000 games (.964=1exp(5000/1500)).
When it does happen, how high does the cube get? It could easily be
high enough to dominate other equity differences between the bots, and
certainly to throw off any variance estimate by a significant amount.
Ok  you might say, we'll just limit the cube, to 16 say, if both bots
have settings to play with limited cubes. Or you could say, we'll let
them play, but estimate `expected result in games without very large
cubes', if you're happy with that instead. But this isn't very
satisfactory: a bot that mostly wins a bit, but very rarely lets you
take 1024 points off it is not a good money player in the long run!
But, when it comes to variance reduction, there is still a problem:
you can't really tell what the variance is, because the rare large
cubes may bump it up, and you don't know how often they happen. E.g.,
a 16 point result every 1500 games adds more than .1 to the variance,
which (I guess from a recent post of Zare) is comparable with the
apparent variance after reduction.
Match play has a big advantage: the results are bounded: calling a
win/loss +1/1, we know the variance is at most 1. This makes it much
easier (possible) to formulate rigorous statistical tests to show with
a certain confidence that one bot is better than the other.

Ok, so for rigorous comparisons between bots it seems a good idea to
look at match play. But variance reduction is still needed: I'm going
to work in units of %mwc (match winning chance), and to subtract 50,
so all results will be changes in %mwc from the initial position. In
these terms, the result of a match from the first player's point of
view is +/ 50. Thus, the unreduced variance of a single result is
close to 2500 and standard deviation (sd) to 50%mwc. (I'll keep
talking about variance estimates, but quoting sd=sqrt(variance) for
the numbers, as these are more informative.)
The difference between two good bots could well be around 1% in a
short match; to have a chance of detecting such a difference we need
to get the sd down to .6%mwc say (allowing 95% confidence as a
satisfactory result), which takes (50/.6)^2 > 6900 matches. To be
sure of detecting it we need quite a few more. (The confidence
interval with width .6% either side of the centre might happen to end
up with its center .6% away from the true value, so could still
include 0 if the true value is 1.2%.)
So let's use luck analysis to reduce the variance. This bit is
standard, but to briefly recap why this is fair, and because we'll
modify it later:
Let X be the result of a match, so X is a random variable which is +/
50. We want to estimate the average or expectation E(X), which is the
first player's edge in the long run. We can do this indirectly as
follows: let Y be any random variable known to have expectation
0. Then E(XY)=E(X)E(Y)=E(X), so estimate E(XY) instead. This helps
if XY has small variance. A good choice of Y is any reasonable
unbiased luck estimate, formed by, before each roll, deciding an
estimated equity for each possible roll, and taking the luck of the
actual roll as the equity of the actual roll minus the average.
Because we take real minus average over all rolls, Y is guaranteed to
have expectation 0, no matter which bot calculates it.
(Aside: the `luck adjusted results' reported by gnu when you analyse
your matches against it, say, are obtained in this way. If you get 32%
is doesn't really mean you had 32% chance of winning. The result is an
_unbiased estimator_ of your chance of winning: if you play many
matches (and play in the same way!), the average `luck adjusted
result' converges to your chance of winning. Of course the actual
result is also an unbiased estimator of your chance of winning, but
with larger variance; the average of the luck adjusted result will
converge faster.)
Here are some numbers calculated this way using gnubg (0.12, but it
doesn't really matter) for the variance reduction of the Jellyfish
light 3.5 vs Snowie 3.2 matches posted on Tony Lezard's web page. I've
used both 0ply and 1ply variance reduction.
Variance reduced results from Jellyfish's point of view.
#matches #moves 0ply: sd sd/sqrt(n1) interval 1ply: sd
sd/sqrt(n1) interval
1pt 1000 56593 10.77 0.34 [1.54,0.42] 9.09 0.29
[1.35,0.40]
3pt 100 9175 11.95 1.20 [3.18,0.77] 11.39 1.14
[3.37,0.40]
5pt 100 15770 15.08 1.52 [5.00,0.02] 14.20 1.43
[5.11,0.41]
7pt 200 41468 12.45 0.88 [3.65,0.74] 11.68 0.83
[3.58,0.85]
25pt 100 68159 12.95 1.30 [8.14,3.86] 12.32 1.24
[8.49,4.42]
All numbers in %mwc50, so, for example, the end of the first line
suggests that Jellyfish has at most 50+(0.40) = 49.6% chance against
Snowie 3 in a 1pt match. The intervals have 5% tails either side; this
was chosen because in this case it's sensible to look just at the
upper end, since it was already believed that Snowie is stronger.
(Note that 1ply luck analysis is better than 0ply, but not by very
much: it's probably still worth doing, as it's much quicker than
running the actual games, and does reduce the variance a bit.)
Note also that longer matches seem as good or maybe slightly better
for discriminating  the same number of moves seem to be needed to get
a (nonrigorously) significant result. Whether this is actually true
depends on where in the intervals the real results are. I find it
slightly surprising, as the games in long matches have different
importances  rising near the end of a close match  and I'd guess
games with equal importance give better results.
The confidence interval results are nonrigorous for the following
reason: the variances are estimates from the data, and could be
wrong. Also, for these numbers of matches, the distribution of the
mean does _not_ converge very well to a normal distribution.
(It is true that the centres of the intervals are unbiased estimators
of the true figures; we just don't know their variance.)
Just looking at the table, one can see that the standard deviations
are not very reliable: e.g., it's unlikely that the real number for
5pt matches is higher than for both 3pt and 7pt. Looking at the XY
numbers, with 0ply reduction there's an outlier of 74 (all others in
range [30,35]), which on it's own contributes 2 to the standard
deviation. There isn't enough data to tell whether such outliers have
frequency 1 in 100 or 1 in 400 or 1 in 200, say, so the standard
deviation is uncertain.
While these and similar results are not rigorously justifiable, I do
believe they are correct, at least after modest widening of the
confidence intervals.

So, now we come to the rigorous part: X is the distribution of the
result of a match (of a certain length) between the bots, so X is
+/50%mwc with certain probabilities, and the expectation E(X) is what
we wish to calculate. We also have a way of estimating the luck Y in a
match which is unbiased, and having done variance reduction, we have n
independent samples from the distribution Z=XY. How does one make a
valid confidence interval for the expectation E(Z)=E(X)?
Just using a central limit approximation with the estimated variance
of Z is not good enough: (a) the estimated variance could be out by a
fair bit, and (b) for smallish numbers of matches, the convergence to
a normal distribution is not good enough  at least we don't know it
is without some assumptions on the distribution.
Actually, things are much worse: although X is bounded, XY need not
be, and might have infinite variance! (Actually, it can't be infinite,
because of results about matches terminating, but it could be very
large.) It could be that the analysis program misunderstands some
position, and thinks one bot throws a joker when it throws an average
roll, so the luck estimate is way off the real luck. Or it could be
that one bot does throw a joker but doesn't realize it, and plays a
neutral move. Either of these could happen many times in a row. So
we'd like to modify Y so that it's bounded: here's how we do it.
Let's fix a bound on Y in advance, say +/70%mwc. (This bound should
be at least a bit more than 50%, because the real luck in a match
between good players is near 50%.) Before each roll, decide on luck
figures for each possible roll as before. If none of the possible
figures added to Y (or rather, the running total luck estimate which
at the end of the match will give us Y) takes Y out of range, add the
figure for the actual roll. Otherwise do nothing. (Or better, add a
multiple of the actual figure, chosen not depending on the actual
roll, so that for any possible roll, the result is in range.)
This way, Z=XY is guaranteed to be bounded, by +/120%mwc, say.
Since at each stage the increment we add to Y has mean zero (it is the
old mean zero increment multiplied by a factor not depending on the
actual roll), the modified Y still has mean zero, so we still have
E(Z)=E(X).
There's still a problem estimating the variance: large Z values may
seem unlikely, but the following definitely can't be ruled
out. Suppose your total sample is n matches. Maybe, every n/3 matches,
one bot, 'the fish' gets confused, and throws an almost won
position. In this case, even if the luck estimate is good, because the
fish gets a result of 50% when it was lucky enough to get a result of
50%, Z is around 100%. Thus, even though Z may seem to have a nice
concentrated distribution, we can't rule out that really it takes
values near the maximum/minimum once every n/3 times.
(Here n/3 comes in because exp(3)=4.98% which is a nonnegligible
probability, so even having observed no +/100s in n matches, this
isn't sufficient evidence that they don't occur one in n/3.)
So how do we get a confidence interval? I decided to construct a
statistical test valid for any distribution on a given interval [M,M]
(here M=120%mwc). (Douglas Zare/anyone else: do you know if anyone
else has done this for bg/ if there's a standard method?) Knowing
nothing about the distribution, we use the `order statistics'.
Say Z_1,...,Z_n are the n samples from Z=XY sorted in increasing order.
Let's choose increasing probabilities p_1,...,p_n. Suppose that
the following hypothesis H holds:
H: at least a fraction p_i of the (unknown) distribution of Z is <= Z_i.
This gives a step function lower bound on the distribution function of
Z, so the mean of Z is at most
B = p_1*Z_1+sum(Z_i*(p_ip_{i1}),i=2..n)+M*(1p_n)
 the highest mean distribution satisfying the constraint has weight
p_1 at Z_1, p_ip_{i1} at Z_i, and 1p_n at M.
How do we chose the p_i? Well, the probability that H holds
essentially doesn't depend on the unknown distribution: it's always at
least the probability that U_i>=p_i where U_1,...,U_n are sorted
independent samples from a uniform distribution on the interval
[0,1]. (It's exactly equal to this probability if Z has a continuous
distribution.) So we can choose any p_i for which the latter
probability is 95%, say. What's a good choice? Depends on the
distribution of Z: if we knew this, we could optimize the expectation,
say, of the bound B. So what we do in practice is take some of the
data (e.g., 100 1pt matches) and use that to guess (by linear
interpolation, say) a distribution function for Z. Then we construct a
test that works well for this distribution, and apply it to the
remaining data. The optimization process is simple in theory, but not
so easy in practice. Alex Selby and I spent some time programming this
and getting it to converge reasonably well. (The last part was mostly
Alex.)
The description above gives a method for calculating one end (the
upper one) of the confidence interval; obvious modifications give the
lower end.
Results  all match lengths together.
As the primary rigorous test, I decided to merge all the data, as it
seemed likely to be borderline whether there is enough data to produce
an interval excluding zero. A fussy person might notice that the
procedure requires independent samples from a single distribution, not
a mixture. This was taken into account (choosing a total number 1172
of matches, and for each match independently choosing the length to be
1,3,5,7,25 pts with probabilities 10,1,1,2,1 over 15, and hoping that
in the data set on Tony Lezard's page there were enough matches of
each length).
Final rigorous result: for a match length chosen at random as above, a
90% confidence interval for Jellyfish's edge is [3.67%mwc,0.55%mwc].
This used a bound on Y of 70%mwc, obtained by optimizing the
expectation of the upper end of the interval, using a distribution for
Z guessed from the data not used in the actual test.
In other words (assuming the nonrigorous results are correct, which I
believe) there isn't quite enough data to show rigorously by our
method that Snowie 3.2 is better than Jellyfish 3.5.
Results  individual match lengths
Here are the results for the individual match lengths. Note that the
self contained procedure of using 10%, say, of the data to estimate
the distribution to optimize for only makes much sense for the 1pt
matches. For the longer matches, the distribution for each
match length has been used to optimize for the next one.
length #matches #moves bound on Y used 90% confidence interval
1pt 1000 56593 60 [3.47,1.41]
3pt 100 9175 56 [7.83,7.63]
5pt 100 15770 53 [16.7,5.40]
7pt 200 41468 40 [11.0,9.22]
(7pt 200 41468 55 [9.47,4.58])
25pt 100 68159 60 [15.5,2.07]
For 7pt, 40 is a stupid value suggested by optimizing using the data
for 5pt matches. 55 is a guess at a better value.
(Here are the 1ply nonrigorous results again:
1pt 1000 56593 [1.35,0.40]
3pt 100 9175 [3.37,0.40]
5pt 100 15770 [5.11,0.41]
7pt 200 41468 [3.58,0.85]
25pt 100 68159 [8.49,4.42]
)
Note that for these rigorous tests, short matches seem to be better:
basically, as discussed above, there's an unavoidable 300/n
contribution to each end of the confidence interval, from the
possibility of maximum/minimum results that we happen not to see. This
compares with the sd/sqrt(n) contribution which occurs in both the
rigorous and simple (ordinary CLT) methods. For few long matches, the
first term dominates. Also (a much more minor point), for many short
matches we have a better guess at the distribution to start with. (The
variability and smallness of the bounds on Y for lengths other than 1
suggests to me that the guessed distribution I used was not very
reliable, and that some kind of a priori guess might have worked
better.)
Finally, a theoretical point:
in the large n limit, the method here does not work very well: using a
central limit theorem (CLT) with error terms, for very large n one can
first get an upper bound on the variance with high confidence
(applying the CLT to the squares of the samples, using the fourth
power of M as a bound on their variance), which will be close to the
real variance. Then one can apply the CLT to the samples
themselves. Thus, for n large, the confidence interval will have
asymptotically width (Phi^{1}(.95)Phi^{1}(.05))*sd/sqrt(n) that one
expects. Our method will give asymptotically larger width by some
constant factor.
Question: given an unknown distribution on [M,M], construct a
confidence interval for the mean based on n samples in a way that
(ideally) is optimal if one has a correct guess for the distribution,
or at least works well for moderate n and is asymptotically optimal.
Oliver Riordan
P.S. If you wish to Email my directly about this, my address is
initial.surname@dpmms.cam.ac.uk
