Week-3 NK
Week-3 NK
Question 5.1
Using crime data from http://www.statsci.org/data/general/uscrime.txt (description
at http://www.statsci.org/data/general/uscrime.html), test to see whether there is an
outlier in the last column (number of crimes per 100,000 people). Is the lowest-
crime city an outlier? Is the highest-crime city an outlier? Are there others? Use the
grubbs.test function in the outliers package in R.
Answer:
Sub-question: See whether there is an outlier in the last column (number of crimes
per 100,000 people)
Based on chart 1.(a), I don’t think there are any outliers, but looking at 1.(b)
it seems there are outliers. So I would not conclude just based on these
charts and investigate the data further.
Step-2.
I conduct Grubbs test to test for outliers in the data. The results suggest that
there are no outliers in the data.
1
alternative hypothesis: highest value 1993 is an at a 5% level of significance.
outlier
2 Test if highest and lowest values are outliers Because p-value is >0.05, we cannot reject the
G = 4.26880, U = 0.78103, p-value = 1 null-hypothesis. So there are no outliers in the data
alternative hypothesis: 342 and 1993 are outliers at a 5% level of significance.
Step-3: I also run a loop to test if of the values is a outlier (see the R-code in ‘code-
section). The output below confirms that there are no outliers in the data.
Answer is No. To answer this, I conducted Grubbs test. The results of the test
confirms that lowest-crime city is not an outlier (as the p-value>0.05 suggests that
we cannot reject the null-hypothesis at a 5% level of significance.).
data: cr
G = 4.26880, U = 0.78103, p-value = 1
alternative hypothesis: 342 and 1993 are outliers
2
To answer this, I conducted Grubbs test. The results of the test confirm that highest-
crime city is not an outlier at a 5% level of significance. However at a 10% level of
significance, we can conclude that highest-crime city is an outlier.
data: cr
G = 2.81290, U = 0.82426, p-value = 0.07887
alternative hypothesis: highest value 1993 is an outlier
Question 6.1
Describe a situation or problem from your job, everyday life, current events, etc., for
which a Change Detection model would be appropriate. Applying the CUSUM
technique, how would you choose the critical value and the threshold?
Answer:
Critical value would change from country to country. It will depend upon the
volatility of inflation in a country. I would use estimate the standard deviation of the
inflation during a period of relatively stable inflation. Then I would use this one-
standard deviation as a critical value.
3
Question 6.2
1. Using July through October daily-high-temperature data for Atlanta for 1996
through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e.,
when the weather starts cooling off) each year.
You can use R if you’d like, but it’s straightforward enough that an Excel
spreadsheet can easily do the job too.
Answer:
For formulae, I following the notations mentioned in the lecture.
I use Excel spreadsheet to conduct this analysis.
I build a CUMSUM model for each year. The procedure used to build each model is
as follows.
Before calculating St, we need to choose value for ‘C’. Intuitively, ‘C’ is the value,
below which the change in temperature is caused by random factors. So our ‘S t’
should not be affected by changes below the value of ‘C’.
I estimate the value of C as follows. For calculations see the attached excel
workbook (sheet-Data).
1. Calculate the daily change in temperature (∆T=X t-Xt-1). This is because we
are interested to know which change in temperature is caused by random
factors, and which ones are not.
2. Calculate the standard deviation of ∆T.
3. Because we are interested only in decreases (i.e. changes in one
direction), C=0.5* average standard deviation.
I then estimate St = max{0, St-1+(µ- Xt -C)}. Pls see the excel workbook (sheet-
Analysis-1).
The value of threshold ‘T’ is chosen as 5*C. This is also based on observation that
any T value below this number would detect false changes during that year.
4
Based on this approach, unofficial summer ends each year are as follows
Answer:
5
To answer this question we need to look time-series of yearly mean (or median)
temperature, and conduct change point analysis over it. This is done in attached
excel workbook (sheet- Analysis-2).
For each year, when I looked at difference between mean and median, I felt there
was substantial difference. So I preferred to use median as measure of central
tendency.
By observation, one can feel that after 2009, there seems to be a shift higher in the
median temperature. Nevertheless, I conduct the change-point analysis as below.
I calculate ‘C’ in the same way as I did previously. However due to significant
increase in median temperature during 2010, the standard deviation estimate
would be higher as well. So I would prefer to estimate standard deviation by
excluding the 2010 point (of (∆T).
I determine the value of ‘T’, by observing the St over time. We can observe that
before 2010, changes in St were not much. Indeed maximum St prior to 2010 was
0.71. If my T is les than 0.71, then I might get a false detection. So I keep T at 1.
This is more of a subjective assessment based on the data.
6
The result of calculations are presented in below chart. Based on this chart
the answer to the question is ‘Yes’- the Atlanta’s summer climate has
gotten warmer. Specifically, since 2009.
R-code
#import data and create new vector strong the values of crime rate (i.e. last column)
View(uscrime)
cr <- as.numeric(uscrime$Crime)
plot(cr)
7
#Grubbs test for two opposite outliers
function(x) {
test <- x
pv <- grubbs.result$p.value
pv <- grubbs.result$p.value
grubbs.flag(cr)