1 Assignment 2: Hypothesis Testing
1 Assignment 2: Hypothesis Testing
Richard Pei
Due: Wed, Nov 13, 2019 in class
Submission: Complete this notebook and print out the output or electronically submit it.
Everything you need to complete is marked with a TODO. For textual questions create a new cell
under the question to respond to it.
1.1 Motivation
In a standard randomized control trial, our null hypothesis is often trivial—nothing happens, no
difference in the mean, no difference in the relative ranking. In this assignment, we look to generalize
this idea to compare observed data against an assumed statistical model. That is, could the observed
data plausibly be generated from the known model.
An air shower is a cascade of ionized particles and electromagnetic radiation produced in the
atmosphere when a primary cosmic ray (i.e. one of extraterrestrial origin) enters the atmosphere.
When a particle, which could be a proton, a nucleus, an electron, a photon, or (rarely) a positron,
strikes an atom’s nucleus in the air it produces many energetic hadrons. We have a detector that
observes particles that reach a ground station and measures the particle energy and arrival time.
We have the following theoretical model to describe particle behavior. The energy of each particle
is drawn independent of arrival time from a Gamma Distribution. The particles arrive as a Poisson
process. We have the following simulator:
[282]: import numpy as np
import pandas as pd
import csv
def simulate_burst(total):
"""Simulates a trial of total # of particles returns
a dataframe with two columns one with observed time (in microsecs)
and one with the energy in kilojoules.
1
"""
t = 0
data = []
for trial in range(total):
t += np.random.exponential(scale=1.0)
obs = np.random.gamma(2.15,1.96, 1)[0]
data.append({'otime_us':t, 'energy_kj':obs})
return pd.DataFrame(data)
In addition to the simulator, you are given a daset of real observations download. You will write a
function to load this dataset into a pandas dataframe. The dataset contains some missing values,
the function should drop all rows with any missing values (i.e., NaN)
1.3 Pre-Processing
Before, we begin testing, we show analyze the data for potential problems.
Q1. Compare particle energies generated from the simulator and the real data. If they do differ,
explain how.
The standard deviation for the energy_kj observations in the simulation is much lower in com-
parison to the real data. We can see that the real data has a much larger range of values, with a
minimum of -246 and a maximum of 286, whereas the simulated data ranges from numbers close
to 0 to the 20’s. Otherwise, from a cursory glance, the observed time data looks pretty similar for
both datasets.
[284]: real_data = load_data("part.csv")
real_data.describe()
2
[285]: otime_us energy_kj
count 1000.000000 1000.000000
mean 486.054345 4.136582
std 277.162276 2.792321
min 1.006010 0.119621
25% 249.919173 2.080854
50% 482.904447 3.465478
75% 734.233356 5.582866
max 941.752822 21.250186
Q2. Your engineers tell you that all energy readings should be positive. Are there any negative
values in either of the datasets? If so, is there any unexpected pattern to those values in terms of
times they occur or values they take on?
There are 50 negative values in the real dataset, but none in the simulated dataset. There doesn’t
seem to be a pattern to the values in terms of the times they occur, but something that is a bit
unusual is that all of the negative values for energy take on whole number values, whereas most of
the data is otherwise made up of decimals with many digits.
[286]: negative_real = real_data[real_data['energy_kj'] < 0]
display(negative_real)
negative_real.count()
otime_us energy_kj
1 1.559228 -128.0
60 60.057136 -139.0
78 78.842368 -104.0
90 90.386515 -24.0
91 91.418192 -68.0
97 97.516804 -89.0
99 99.581593 -85.0
109 109.091903 -32.0
139 139.651419 -3.0
140 140.264270 -96.0
145 145.371315 -246.0
147 147.747298 -57.0
149 149.362157 -116.0
176 176.692820 -49.0
246 246.619152 -54.0
250 250.519167 -6.0
256 256.876512 -54.0
259 259.419507 -64.0
345 345.558844 -1.0
367 367.561908 -8.0
375 375.054020 -63.0
385 385.442051 -26.0
401 401.327724 -10.0
411 411.325126 -96.0
3
426 426.192545 -40.0
444 444.886387 -99.0
448 448.646410 -176.0
456 456.825609 -42.0
458 458.552481 -149.0
473 473.629957 -60.0
490 490.731601 -33.0
496 496.752992 -104.0
552 552.966821 -1.0
579 579.681490 -59.0
601 601.755655 -43.0
608 608.223338 -154.0
621 621.345263 -146.0
626 626.532020 -114.0
633 633.202466 -115.0
636 636.976924 -145.0
642 642.655734 -12.0
647 647.641942 -91.0
659 659.982006 -31.0
735 735.695685 -61.0
813 813.923310 -170.0
824 824.701075 -111.0
832 832.205138 -92.0
855 855.948694 -29.0
920 920.261943 -153.0
953 953.112440 -113.0
[286]: otime_us 50
energy_kj 50
dtype: int64
Empty DataFrame
Columns: [otime_us, energy_kj]
Index: []
Q3. Are there any other energy readings that are suspect in the real dataset? Roughly what fraction
of values are suspect
There are 59 outliers (not including the negative outliers since those were already accounted for in
the previous question), based on the commonly-used 1.5*IQR method. Out of 996 values, that is
about 5.9% of data points being positive outliers. There are also 50 negative values, so in total 109
values are “suspect,” or about 10.9%.
4
[288]: real_energy_25 = np.percentile(real_data['energy_kj'], 25)
real_energy_75 = np.percentile(real_data['energy_kj'], 75)
real_energy_iqr = real_energy_75 - real_energy_25
outlier_upper = real_energy_75 + 1.5*real_energy_iqr
outlier_lower = real_energy_25 - 1.5*real_energy_iqr
outliers_upper = real_data[(real_data['energy_kj'] > outlier_upper)]
outliers_lower = real_data[(real_data['energy_kj'] < outlier_lower)]
display(outliers_upper)
display(outliers_lower)
print(outliers_upper.count())
otime_us energy_kj
10 10.328715 179.000000
29 29.357244 286.000000
40 40.450206 77.000000
44 44.062026 17.262445
66 66.414355 95.000000
83 83.225272 12.132064
134 134.122454 24.000000
138 138.411715 32.000000
193 193.517617 202.000000
198 198.709251 13.444559
225 225.527135 47.000000
226 226.495737 13.748464
230 230.786301 108.000000
303 303.465199 12.000000
325 325.822698 22.000000
357 357.300472 12.513608
373 373.338382 46.000000
386 386.898723 93.000000
389 389.455461 17.053003
390 390.707437 59.000000
424 424.691913 129.000000
461 461.802427 111.000000
475 475.674377 87.000000
505 505.208910 12.008687
523 523.111136 71.000000
532 532.583502 234.000000
543 543.609067 47.000000
554 554.085722 44.000000
587 587.948751 13.061088
606 606.028233 14.000000
617 617.752082 170.000000
618 618.984450 212.000000
628 628.372317 32.000000
641 641.523170 215.000000
645 645.383789 12.496337
5
668 668.442729 55.000000
707 707.823960 237.000000
729 729.514045 15.810724
748 748.053431 48.000000
750 750.114660 64.000000
751 751.737856 70.000000
762 762.923478 15.387726
770 770.461953 252.000000
782 782.413487 182.000000
796 796.871935 64.000000
804 804.785334 107.000000
814 814.325378 52.000000
836 836.177740 120.000000
848 848.248260 21.000000
865 865.109645 159.000000
870 870.471135 83.000000
875 875.870579 28.000000
888 888.478847 12.542099
890 890.446386 81.000000
908 908.795865 36.000000
921 921.007503 94.000000
930 930.771796 41.000000
958 958.856752 47.000000
967 967.457753 107.000000
otime_us energy_kj
1 1.559228 -128.0
60 60.057136 -139.0
78 78.842368 -104.0
90 90.386515 -24.0
91 91.418192 -68.0
97 97.516804 -89.0
99 99.581593 -85.0
109 109.091903 -32.0
140 140.264270 -96.0
145 145.371315 -246.0
147 147.747298 -57.0
149 149.362157 -116.0
176 176.692820 -49.0
246 246.619152 -54.0
250 250.519167 -6.0
256 256.876512 -54.0
259 259.419507 -64.0
367 367.561908 -8.0
375 375.054020 -63.0
385 385.442051 -26.0
401 401.327724 -10.0
6
411 411.325126 -96.0
426 426.192545 -40.0
444 444.886387 -99.0
448 448.646410 -176.0
456 456.825609 -42.0
458 458.552481 -149.0
473 473.629957 -60.0
490 490.731601 -33.0
496 496.752992 -104.0
579 579.681490 -59.0
601 601.755655 -43.0
608 608.223338 -154.0
621 621.345263 -146.0
626 626.532020 -114.0
633 633.202466 -115.0
636 636.976924 -145.0
642 642.655734 -12.0
647 647.641942 -91.0
659 659.982006 -31.0
735 735.695685 -61.0
813 813.923310 -170.0
824 824.701075 -111.0
832 832.205138 -92.0
855 855.948694 -29.0
920 920.261943 -153.0
953 953.112440 -113.0
otime_us 59
energy_kj 59
dtype: int64
Based on your answers to Q1, Q2, Q3, write a function that cleans the real data by removing all
problematic observations.
Now, we will compare the particle energies from the simulated data and the real data. Fill in the
following hypothesis tests. Be reasonable about this You may not import methods from statistics
packages that perform the test for you.
[289]: def clean(df):
df = df.copy()
quartile_1, quartile_3 = np.percentile(df['energy_kj'], 25), np.
,→percentile(df['energy_kj'], 75)
7
df = df[(df['energy_kj'] < outlier_upper) & (df['energy_kj'] >␣
outlier_lower) & (df['energy_kj'] >= 0)]
,→
return df
cleaned_data = clean(real_data)
cleaned_data
s_count = simulated["energy_kj"].count()
r_count = real["energy_kj"].count()
se = np.sqrt(r_count*s_count*(r_count+s_count+1)/12)
rank_sum = real["rank"].sum()
8
z = (rank_sum - (r_count*(r_count+s_count+1))/2)/se
return 2*(1 - st.norm.cdf(abs(z)))
9
1.5 Comparing the Arrival Times
So far, we have only tested the particle energies. Another important aspect of our model is the
arrival process (i.e., the times).
Q1. Describe a hypothesis test that evaluates whether the arrival process significantly differs in the
simulator from the observed data.
In order to determine whether the arrival process differs in the simulator from the observed data, I
will be testing whether the average distance between arrival times is the same between the simulated
data and the observed data. To do this, I will add a column for each dataset that calculates the
difference between that observation and the previous observation. Then, I will calculate the mean
and standard deviation of that column, then perform a z-test finding whether there is a difference
in the average of the distances between observations (similar to the two-sample z-test performed
earlier with the energy values).
We were told the simulated data comes from a Poisson process, and from the simulate_burst
function, we can see that the average time between arrivals is 1.0, though each time between
arrivals is random. We will check the average distance between arrival times for the simulation
and the average distance between arrival times for the real data and see if they differ, using a
two-sample z-test.
Q2. Do your pre-processing choice above change? Why or why not?
The pre-processing choice changes. We don’t need to clean the real dataset, as it appears to come
from a Poisson process as well, with average distance between arrivals around 1. Our only “pre-
processing” step is to add the third column that calculates the difference between arrival time for
each observation and the previous observation. There will be one fewer observation than in the
total dataset (i.e. 995 for the real dataset and 9999 for the simulated dataset) due to the calculation
being based on differences.
Furthermore, we don’t need to remove the observations where the energy values are out-
liers/negative, since we were given that the observed times are actually independent of the energy
values. Thus, the timing is not affected by the energy values. Also, we can see that if we did remove
these times from our data, this would affect the results of our test because there would be gaps
in the arrival times, leading to the real data necessarily having a higher average distance between
arrivals.
[293]: def test_arrival_process(simulated, real):
simulated['distance'] = simulated['otime_us'].diff(+1)
real['distance'] = real['otime_us'].diff(+1)
10
return 2*(1 - st.norm.cdf(abs(z)))
11