Nick Kravitz writes:
Brett,
I am not sure if you are still watching this thread, but I did not see a
response to your request to test randomness in your results. I am a
quantitative analyst (quant) and decent backgammon player (at least I like
to think so) and am giving my opinion on your request for randomness
analysis below:
There are a few statistical methods to test for randomness. I have used
Pearson's chisquared test, which is probably the best known and tested
method for doing so. (see
http://en.wikipedia.org/wiki/Pearson's_chisquared_test)
Before I give the results, here are some general comments on statistical
testing for people without a requisite background. (You can also find this
in any test on statistical testing or here:
http://en.wikipedia.org/wiki/Statistical_hypothesis_testing)
In the same way that we cannot poll an entire voting population to predict
election results, we cannot roll dice infinite number of times to conclude
definitively whether they are fair. As such, statistical tests are not
foolproof and set up to acknowledge the possibility of an incorrect
conclusion. For example, if we are testing the randomness of a single die
within the dice tube by rolling it 500 times, we might get the same number
on all tosses and conclude the dice tube was loaded. However it is still
possible (but extremely unlikely) that the tube was indeed fair and we
simply happened to be unlucky. In the event we roll 500 of the same number,
we can only state there is strong evidence (but no proof) we were rolling
with an unfair (nonrandom) dice tube. Likewise, our dice tube could be
loaded to roll more 1's than it should  for example, 1's with probability
50% and 2, 3, 4, 5, and 6 each with probability 10%. However when testing
this loaded dice tube, we might roll approximately equal frequencies of
each number by chance, in which case we would incorrect conclude the tube
was fair when in fact it was not. (side note: Dewey vs Truman was a famous
example of a statistical poll failing
http://en.wikipedia.org/wiki/Dewey_Defeats_Truman)
By convention, statistical tests are most commonly set up to reject the
null hypothesis when it is in fact true with probability 5%, although this
value is arbitrary and purely conventional. We simply need some threshold
to start being suspicious of nonrandomness. Sometimes a truly random
process produces results that look nonrandom; when this happens this is
called a type I error, or false positive, and would be equivalent to the
first case above (concluding the dice tube is nonrandom when in fact it is
fair)
The output of running the test on a set of random frequencies generated or
observed would be a "pvalue"  which can be interpreted as probability of
obtaining a test statistic at least as extreme as the one that was actually
observed, assuming that the null hypothesis is true (in our case, that the
underlying process is indeed random). http://en.wikipedia.org/wiki/Pvalue
By construction, we would expect that if we run many experiments of
throwing dice, the pvalues would be uniform, and that given a truly random
process, over many experiments there would be a 5% chance we would return a
false positive. (Equivalently, our pvalue would be between 0 and 0.05 with
5% probability)
Before I ran the test on the numbers from the Meyer website, I first ran
the test on my own numbers, which I generated from a computer program which
I am certain can produce unbiased random numbers. I generated 1000
experiments of 500 die rolls each. Most of the trials I got numbers that
looked random enough (for example, 94, 86, 89, 79, 77, 75) which returned a
pvalue of 0.642. However, around 5% of the time (or around 50 times out of
1000) the numbers looked nonrandom enough to trip a false positive (for
example, 59, 84, 73, 98, 88, 98) which returned a pvalue of 0.016. It is
actually a good thing to get some false positives; this indicates the test
is working as expected. If all 1000 experiments produced pvalues above 5%,
I would be suspicious that the underlying random process was not working
correctly.
Next, I applied the test to the numbers on the Meyer website, which
provided results for a total of 12 experiments, one for each starting
number for each die. If the rolls were truly random, we would expect the
results to look similar to the process described above that we know to be
random; i.e. pvalues approximately uniform between 0 and 1 (in particular,
about half the pvalues to be above 0.5 and half below, with maybe 1
observation close to or exceeding the 5% threshold of nonrandom suspicion)
The pvalues I calculated ranged from 0.55 (least random, Blue 5) to 0.997
(most random, RED 1). These results look too good to be true. In fact, if
we rolled dice from a process we know in advance to be purely random (for
example, rolling a precision die, or having a computer generate random
numbers for us) the probability we would get results at least this good by
pure chance would be 0.0000143 (equivalently about 1 in 70,000) To put this
into backgammon perspective, there would be a better chance of your first 3
rolls coming out all double sizes (a mere 1 in 47,000)
I do not know how the experiment was run. Although there is nothing to
indicate that the results were somehow doctored, (or perhaps the most
random results "selected" from a larger set of experiments) due to the fact
the results look suspect, I would recommend having them resampled
independently by someone without an interest in the results of the test.
