On or about July 18, Jean Burns sent a message to jcs-online, reading in
part:
I think one of the most exciting findings in parapsychology has been the
development by an experimental test by Ed May and co-workers (May et al.,
1995) which can determine whether psi results have been produced by
psychokinesis (PK) or extrasensory perception (ESP).
[...]
The above analysis has been applied to large number of experiments using
random number generators, in which correlations with operator intention
demonstrated the presence of psi (May et al., 1995). However, the
analysis of May et al. showed that PK was *not* present.
[...]
May, E.C., Utts, J.M., and Spottiswoode, S.J.P. (1995). Decision
Augmentation Theory, Journal of Scientific Exploration, 9(4), 453-488.
Presumably the strong statement that May et. al. "showed that PK was
*not* present" is driven primarily by the 8.6-sigma refutation quoted in
the abstract. May, responding to a post of mine, wrote:
York suggested that I quote an 8-sigma effect based upon a meta analysis
of all the RNG data.
Wrong. In the abstract of our JP paper we attribute an 8.6 sigma favor
of the influence model to and analysis of a large number of individual
button presses of PEAR data.
I apologize for my misinterpretation of the statements made by May
et.al.; it is very clear from the relevant paragraph (p. 467 of the JSE
reference given above) that the 8.6-sigma result comes from examining
the data generated by one specific operator at PEAR, at two different
sequence lengths.
However, I feel compelled to point out that the figures given in that
very same paragraph of the JSE article also show that the 8.6-sigma
figure is, to say the least, suspect, and the use of that value without
qualification or caveat in the abstract of the article, and in
subsequent discussions, is distinctly questionable.
In that paragraph, the authors demonstrate the calculation of the 8.6
figure from the observed Z^2 data in the two subsets used. It is derived
by using the observed effect size in the short-sequence data as a
prediction for the effect in the long-sequence data. However, they
also point out that if one performs the calculation in the other
direction, using the long-sequence observation to predict an effect for
the short-sequence data, the t-score is only 2.398, rather than 8.643.
Since there is no obvious quality that makes one dataset the
"prediction" and the other the "test", it is decidedly not clear why
one should prefer one value over the other, and the fact that they are
in such stark contrast (an overwhelming powerful refutation vs a
moderately convincing one) suggests that there is something worrisome
about such a strong dependence on an essentially arbitrary choice of
analysis method.
I will therefore spend a couple of paragraphs showing the application of
a standard, symmetrical test to the data used by May et.al., restricting
myself to the data values actually published in the paper lest I be
accused of generating a red herring by throwing some extra data into the
pot. Fortunately, the raw data used by May et.al. to derive the T-scores
above are also given on page 467 of JSE 9/4. The short-sequence
data involve 5918 trials at 200 bits per trial; the Stouffer Z for
the presence of an anomalous effect is 3.37, the observed Z^2 value
is 1.063 +/- 0.019. The long-sequence data comprises 597 trials of 10^5
bits each; the Stouffer Z is 2.45, the observed Z^2 is 1.002 +/- 0.050.
Given these figures, the most natural evaluation for consistency with a
hypothesis would seem to be to treat both observations as empirical
measurements of a model parameter, and construct a T- or Z- test against
the null hypothesis that the parameter has the same value. The first
line of May, Utts, and Spottiswoode's Table 1 (p. 461) gives the
expected value of Z^2 in terms of model parameters. For the PK or
"micro-AP" model, E[Z^2] = 1 + epsilon^2 *n; for the DAT model, it is
simply the sum mu(z)^2 + sigma(z)^2. The fact that the latter is a sum
of two unknown parameters of the model is essentially irrelevant, since
the same sum is measured in both cases.
Calculating from the observed figures listed for Z^2 (we're back on p.
467 again), one finds that for n=200 data epsilon^2*n = 0.063+/-0.019;
since n=200, epsilon^2 = (3.15+/-0.95)x10^(-4). (I am keeping more
significant digits than I am entitled to in the intermediate figures, to
try to avoid too much accumulation of roundoff error.) For the
long-sequence data, one finds epsilon^2 *n = 0.002+/-0.05; since n=10^5,
this observation gives epsilon^2 = (2+/-50)x10^(-8). Any hand calculator
with a square root key will allow the reader to verify that this gives a
T score of 3.3 against the micro-AP hypothesis. Both observations are
treated symmetrically in this calculation, rather than one being used
to predict the other.
For the DAT model, the unknown parameter is simply equal to the
observed Z^2, and so the T test is simply the comparison of
1.063+/-0.019 against 1.002 +/- 0.050, yielding T=1.14.
I am not presenting degrees of freedom for any of the T scores, as they
have so many that they are for all practical purposes equivalent to Zs.
The two-sample T-test is a straightforward technique that I have found
in several standard references on statistics. The predict-and-compare
process used by May et.al. to get T=8.6 (or T=2.4, depending on which
direction you choose), on the other hand, is not one that I have ever
before seen applied to such a simple evaluation. I therefore contend
that the data examined on p. 467 of the JSE reference above amount, by
standard and well-understood statistical tests, to a T=3.3 result, not
to a T=8.6 result, and that people should stop talking about an
eight-sigma refutation of PK models for observed anomalies.
York Dobyns
ydobyns@phoenix.princeton.edu