Response To Critique of Dream Investigation Results: Minecraft Speedrunning Team December 2020
Response To Critique of Dream Investigation Results: Minecraft Speedrunning Team December 2020
1
saying that there are 80 choices for the
starting position of the 20 successful coin
tosses in the string of 100 cases gives 280
20 =
Applying the Bonferroni correction and ability density function for the product of n iid uniform variables
2
4 Including all 11 streams as the p-value outputted, after correcting for the
number of streams, is the p-value for Dream’s entire
Dream’s response paper notes that: livestream history. Were it applied to someone else,
it would also be applied to their entire livestream his-
However, as is discussed throughout this tory. Moreover, their estimation of 300 livestreamed
document, choosing to put a break point runs per day over the past year is highly implausi-
between the streams after seeing the prob- ble. Many runs are not livestreamed, and the esti-
abilities would require including a correc- mation is based on current numbers, even though
tion for the bias of knowing this result. Minecraft speedrunning has grown massively in the
This implies that we did not correct for this bias, but recent months.
we did, as per section 8.2 in our initial paper. Dream’s At the time of Dream’s run, there were 487 run-
response paper concludes that when including all ners who had times in 1.16 – far under 1000 – and
11 streams in the analysis, there is “no statistically the vast majority of these were unpopular or did not
significant evidence that Dream was modifying the stream. Selection bias could only be induced from
probabilities”. This result is expected and meaning- observed runners, so speedrunners who had no sig-
less, as Dream is only accused of using a modified nificant viewership watching their attempts should
game for the last 6 streams; including all streams not be included. Frankly, there were probably fewer
dilutes the data, yielding inconsistent results. than 50 runners in any version who might’ve been
examined like this, but we used 1000 as an upper
bound.
5 Correction Across Runners Note that treating whether or not someone is “ob-
served” as a binary value is a simplification: the less
The rebuttal paper states: likely extreme luck would be noticed for someone,
the less they contribute to sampling bias. We in-
In Section 8.3, they claim that their calcula- cluded people who have only a handful of viewers in
tion of p is for a runner within their entire the calculation even though the amount of sampling
speedrunning career. This is presumably bias they introduce is likely negligible.
based on the argument from Section 8.2 Additionally, note that this is one of the most im-
that they have already corrected for every portant factors shifting the number upwards in the
possible subset of streams... Further, that response paper. Severely overestimating the number
correction was based on choosing 6 of 11 of livestreamed attempts artificially inflates the final
livestream events from Dream, suggesting number to a massive degree.
that their definition of “career” is 11 multi-
hour livestream events comprising about
50 runs.
6 The number of RNG types
This is incorrect. The p-value this process generates
is the probability that results as extreme as Dream’s Dream’s response paper corrects across 37 different
are obtained if one chooses the most extreme se- random factors. It is worth noting that, even using
quence of streams from a runner’s entire stream- this increased number of factors, the final p-value
ing career. The choice of 11 is only due to the fact only changes by a factor of 15. If we accepted this
that this happens to be the amount of times Dream list, it would not change our conclusion, but we still
has streamed speedrun attempts — to calculate that hold that the list is seriously flawed.
value for a different runner, you would use the num- Dream suggests that eye breaking odds, various
ber of times they had streamed instead of using 11. mob spawn rates, dragon perch time, triangulation
The response paper suggests correcting across ability, and various seed-based factors should be
livestreams instead of individuals. This is redundant, counted. However, these are more difficult to cheat
3
than blaze rods and piglin bartering rates, and in 7 Paradigm Inconsistency
some cases are entirely implausible for us to examine.
The dominant theory is that Dream cheated by mod- In section 4.2 of Dream’s response paper, the author
ifying the internal configuration files in his launcher explains they use the Bayesian statistics paradigm in-
jar file directly. Other methods are possible as well, stead of the hypothesis testing paradigm used in our
but this is likely the most straightforward. Using this report. That is, Dream’s response paper attempts to
method, only entity drops and piglin barters can be calculate the probability that Dream cheated given
modified. the bartering and blaze data; in contrast, our paper
Dream offers frequency of triangulation into calculates the probability of obtaining bartering and
stronghold as one factor. However, this isn’t random blaze results at least as extreme as Dream’s under
at all, and is instead a skill-based factor3 . Addition- the assumption the game is unmodified. These are
ally, many of the factors proposed are seed-based. entirely different probabilities, but Dream’s response
An extensive amount of time would be required to paper confuses the two paradigms throughout, pro-
seedfind for enough randomly generatable world ducing an uninterpretable result.
seeds for a livestream, making it not a very plausi-
ble method for long-term cheating. Further, it is in
principle possible to detect set-seeds based on non- 7.1 Unclear Corrections
seed random factors. As a simplified example, if we
clearly know the LCG state at a fixed length from Dream’s response paper mimics many of the bias
seed generation, we can backstep to seed generation corrections in our original paper, but because the
to find what seed should’ve been generated. Frankly, starting value is the posterior probability of an un-
this would be rather difficult to do, but it would be modified game and not a p-value, some of these cor-
attempted first instead of statistical analysis. rections are unjustified. Indeed, it is not trivially
Some suggested factors rely on strategies that obvious that frequentist p-value corrections can be
were either defunct or nonexistent at the time of applied to such a probability.
Dream’s runs. Monuments, and string from barters, Dream’s response paper attempts to correct for
are only important for so-called “hypermodern” the stopping rule. This is perfectly fine under a fre-
strategies, which often skip villages and explore the quentist paradigm like we used. However, it is in-
ocean. These strategies did not exist at the time of consistent with the Bayesian paradigm used in the
Dream’s run. Similarly, ender pearl trades are prac- response paper. Bayesians follow the likelihood prin-
tically never used in 1.16 runs due to it being more ciple, such that changes to the likelihood by a factor
difficult and slower to get pearls via trades than via that does not depend on the parameter of interest
barters. As a result, no top runs in 1.16 utilize villager do not change the results. A well-known feature of
trading. the likelihood principle is that stopping rules are
Finally, some factors occur too rarely to obtain a irrelevant to analyses that use methods following
large enough sample for analysis. For instance, one it. Hence, the author should not have accounted for
only gets to the end portal on nearly completed runs, stopping rules at all, including the dropping of the
so there would be very few events to check. last data point. Indeed, the response paper itself
Clearly, the 37 number is entirely unrealistic. It stated that one of the reasons why a Bayesian ap-
relies on the use of strategies that Dream could not proach was used is to avoid having to model the stop-
have used, and the investigation of factors that we ping rule of each run. However, despite this state-
could not investigate. Again though, even if we ac- ment, the author goes on to drop the last data point
cept the full 37 number, it only changes our result by in attempt to address the stopping rule.
a factor of 15 – not enough to change our conclusion. Similarly, the response paper attempts to correct
for selection bias across runners. This is rather odd, as
3 How well a player can triangulate based on eye throws. the goal of these corrections is to control error rates,
4
a goal that is not shared with Bayesian methods4 . Relevant Links:
The likelihoods across individuals are independent of
one another, and therefore comparisons across other By Moderators or Dream
individuals are irrelevant to a Bayesian analysis.
1. Dream Investigation Results, original moderator
paper.
7.2 Invalid Comparison 2. Critique of Dream Investigation Results, Dream re-
The final conclusion of Dream’s response paper con- sponse paper by Photoexcitation.
flates the posterior probability with the p-value once
3. Did Dream Fake His Speedruns - Official Moderator
more.
Analysis, Moderator YouTube investigation re-
port.
In any case, the conclusion of the MST Re-
port that there is, at best, a 1 in 7.5 trillion 4. Did Dream Fake His Speedrun - RESPONSE, Dream
chance that Dream did not cheat is too ex- response video.
treme for multiple reasons that have been
discussed in this document.
By Others
Again, the 1 in 7.5 trillion chance does not represent 5. Reddit r/statistics comment by mfb, a particle
the probability that Dream did not cheat; it repre- physicist with a PhD in physics.
sents the probability of any Minecraft speedrunner
to get results at least as extreme as Dream’s using 6. The chances of “lucky streaks”, a Reddit post by
an unmodified game while streaming. Widening the particle physicist mfb.
scope to any streaming speedrunner already artifi-
7. Dream’s cheating scandal - explaining ALL the math
cially enlarges the p-value in Dream’s favor and was
simply, YouTube video by Mathemaniac.
only done to prevent accusations of p-hacking and
the like. 8. Blog post by Professor Andrew Gelman.
Even if Dream’s response calculation were done
correctly, the 1 in 10 million posterior probability
would not be directly comparable to the 1 in 7.5 tril-
lion figure and would still imply a 99.99999% chance
of Dream cheating.
8 Conclusion
The author of Dream’s response paper appears to mix
frequentist and Bayesian methods, resulting in an un-
interpretable final result. Further, these methods are
applied incorrectly, preventing valid conclusions be-
ing made. Despite these problems being in Dream’s
favor, the author presents a probability that still sug-
gests that Dream was using a modified game. Hence,
our conclusion remains unchanged.
4 With the exception of matching priors, although such can
5
A Julia Simulation Code 20
21 end
end
22 res
A.1 Stopping Rule Simulations 23 end
24
25 # probability is numruns / 500000000
1 using Random
2 using Distributions
3 using Plots
4
5 Random . seed !(1234) A.3 1% Event Simulation
6 nbsplit = []
7 for i ∈ 1:1000
1 using Random
8 n = 0
2 using Distributed
9 nseq = 0
3 using Distributions
10 while nseq != 100
4
11 x = 0
5 numruns = @distributed (+) for i ∈
12 while x != 2
1:500000000
13 x += rand ( Bernoulli (0.1) )
6 x = rand ( Bernoulli (0.01) , 100)
14 n += 1
7
15 end
8 res = false
16 nseq += 1
9 count = 0
17 end
10 for j ∈ 1: length ( x )
18 push !( nbsplit , n )
11 if x [ j ]
19 end
12 count += 1
20
13 else
21 Random . seed !(1234)
14 count = 0
22 nb = []
15 end
23 for i ∈ 1:1000
16
24 x = 0
17 if count == 3
25 n = 0
18 res = true
26 while x != 200
19 break
27 x += rand ( Bernoulli (0.1) )
20 end
28 n += 1
21 end
29 end
22 res
30 push !( nb , n )
23 end
31 end
24
32
25
33
26 # probability is numruns / 500000000
34 # nb : Direct negative binomial result
35 # nbsplit : Chunked negative binomial result
36
37 println ( nb == nbsplit )