0% found this document useful (0 votes)
7 views8 pages

Week-3 NK

Uploaded by

Nagaraj Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

Week-3 NK

Uploaded by

Nagaraj Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Week-3 Homework submission

Question 5.1
Using crime data from http://www.statsci.org/data/general/uscrime.txt (description
at http://www.statsci.org/data/general/uscrime.html), test to see whether there is an
outlier in the last column (number of crimes per 100,000 people). Is the lowest-
crime city an outlier? Is the highest-crime city an outlier? Are there others? Use the
grubbs.test function in the outliers package in R.

Answer:

Sub-question: See whether there is an outlier in the last column (number of crimes
per 100,000 people)

Step-1. Observation of plots

Based on chart 1.(a), I don’t think there are any outliers, but looking at 1.(b)
it seems there are outliers. So I would not conclude just based on these
charts and investigate the data further.

Figure-1: Charts to observe outliers


1.(a). Simple dot plot of crime rate observations 1.(b)Box and whiskers plot

Step-2.
I conduct Grubbs test to test for outliers in the data. The results suggest that
there are no outliers in the data.

Figure-2: Grubbs test for outliers

No. Test results Conclusion


1 Test if highest value is outlier Because p-value is >0.05, we cannot reject the
G = 2.81290, U = 0.82426, p-value = 0.07887 null-hypothesis. So the highest value is not a outlier

1
alternative hypothesis: highest value 1993 is an at a 5% level of significance.
outlier

2 Test if highest and lowest values are outliers Because p-value is >0.05, we cannot reject the
G = 4.26880, U = 0.78103, p-value = 1 null-hypothesis. So there are no outliers in the data
alternative hypothesis: 342 and 1993 are outliers at a 5% level of significance.

Step-3: I also run a loop to test if of the values is a outlier (see the R-code in ‘code-
section). The output below confirms that there are no outliers in the data.

Figure- 3: Output of looped test of outliers at a 5% level of significance


Data Data Data
Value Is Outlier? Value Is Outlier? Value Is Outlier?
point point point
1 791 FALSE 17 539 FALSE 33 1072 FALSE

2 1635 FALSE 18 929 FALSE 34 923 FALSE

3 578 FALSE 19 750 FALSE 35 653 FALSE

4 1969 FALSE 20 1225 FALSE 36 1272 FALSE

5 1234 FALSE 21 742 FALSE 37 831 FALSE

6 682 FALSE 22 439 FALSE 38 566 FALSE

7 963 FALSE 23 1216 FALSE 39 826 FALSE

8 1555 FALSE 24 968 FALSE 40 1151 FALSE

9 856 FALSE 25 523 FALSE 41 880 FALSE

10 705 FALSE 26 1993 FALSE 42 542 FALSE

11 1674 FALSE 27 342 FALSE 43 823 FALSE

12 849 FALSE 28 1216 FALSE 44 1030 FALSE

13 511 FALSE 29 1043 FALSE 45 455 FALSE

14 664 FALSE 30 696 FALSE 46 508 FALSE

15 798 FALSE 31 373 FALSE 47 849 FALSE

16 946 FALSE 32 754 FALSE

Sub-question: Is the lowest-crime city an outlier?

Answer is No. To answer this, I conducted Grubbs test. The results of the test
confirms that lowest-crime city is not an outlier (as the p-value>0.05 suggests that
we cannot reject the null-hypothesis at a 5% level of significance.).

Grubbs test for two opposite outliers

data: cr
G = 4.26880, U = 0.78103, p-value = 1
alternative hypothesis: 342 and 1993 are outliers

Sub-question: Is the highest-crime city an outlier?

2
To answer this, I conducted Grubbs test. The results of the test confirm that highest-
crime city is not an outlier at a 5% level of significance. However at a 10% level of
significance, we can conclude that highest-crime city is an outlier.

> grubbs.test(cr, type = 10, opposite = FALSE, two.sided = FALSE)

Grubbs test for one outlier

data: cr
G = 2.81290, U = 0.82426, p-value = 0.07887
alternative hypothesis: highest value 1993 is an outlier

Sub-question- Are there others?

Based on the output presented in Figure-3, there are no other outliers.

Question 6.1
Describe a situation or problem from your job, everyday life, current events, etc., for
which a Change Detection model would be appropriate. Applying the CUSUM
technique, how would you choose the critical value and the threshold?

Answer:

Change detection model can be applied in macroeconomics. Specifically in


detecting a change in the inflation of a country. This is important because
significant change in inflation is usually followed by changes to the monetary policy
(i.e. interest rates). An early detection of change in inflation means one can predict
the central bank’s monetary policy changes, and hence the interest rate markets
(like bond prices).

Critical value would change from country to country. It will depend upon the
volatility of inflation in a country. I would use estimate the standard deviation of the
inflation during a period of relatively stable inflation. Then I would use this one-
standard deviation as a critical value.

Determining threshold will be a iterative process. I would use some historical


observations on what was the level of my CUMSUM (i.e. St) in last 10 years when
the central bank recognized that there was a change in the inflation of the
economy.

3
Question 6.2
1. Using July through October daily-high-temperature data for Atlanta for 1996
through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e.,
when the weather starts cooling off) each year.
You can use R if you’d like, but it’s straightforward enough that an Excel
spreadsheet can easily do the job too.

Answer:
For formulae, I following the notations mentioned in the lecture.
I use Excel spreadsheet to conduct this analysis.

I build a CUMSUM model for each year. The procedure used to build each model is
as follows.

Before calculating St, we need to choose value for ‘C’. Intuitively, ‘C’ is the value,
below which the change in temperature is caused by random factors. So our ‘S t’
should not be affected by changes below the value of ‘C’.
I estimate the value of C as follows. For calculations see the attached excel
workbook (sheet-Data).
1. Calculate the daily change in temperature (∆T=X t-Xt-1). This is because we
are interested to know which change in temperature is caused by random
factors, and which ones are not.
2. Calculate the standard deviation of ∆T.
3. Because we are interested only in decreases (i.e. changes in one
direction), C=0.5* average standard deviation.

I then estimate St = max{0, St-1+(µ- Xt -C)}. Pls see the excel workbook (sheet-
Analysis-1).

The value of threshold ‘T’ is chosen as 5*C. This is also based on observation that
any T value below this number would detect false changes during that year.

4
Based on this approach, unofficial summer ends each year are as follows

29 Sep 1996 7 Sep 2003 27 Sep 2010


25 Sep 1997 16 Sep 2004 6 Sep 2011
30 Sep 1998 6 Oct 2005 1 Oct 2012
20 Sep 1999 21 Sep 2006 16 Aug 2013
6 Sep 2000 16 Sep 2007 25 Sep 2014
25 Sep 2001 17 Sep 2008 14 Sep 2015
24 Sep 2002 1 Oct 2009

Representing the control chart below.

2. Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate


has gotten warmer in that time (and if so, when).

Answer:

5
To answer this question we need to look time-series of yearly mean (or median)
temperature, and conduct change point analysis over it. This is done in attached
excel workbook (sheet- Analysis-2).

For each year, when I looked at difference between mean and median, I felt there
was substantial difference. So I preferred to use median as measure of central
tendency.

By observation, one can feel that after 2009, there seems to be a shift higher in the
median temperature. Nevertheless, I conduct the change-point analysis as below.

I calculate ‘C’ in the same way as I did previously. However due to significant
increase in median temperature during 2010, the standard deviation estimate
would be higher as well. So I would prefer to estimate standard deviation by
excluding the 2010 point (of (∆T).

Standard deviation using all data=2.5489

Standard deviation excluding 2010=1.9704 (I use this)

So value of C = 1.9704/2= 0.9851

St is calculated using the formula: S t = max{0, St-1+(Xt - µ- C)}

I determine the value of ‘T’, by observing the St over time. We can observe that
before 2010, changes in St were not much. Indeed maximum St prior to 2010 was
0.71. If my T is les than 0.71, then I might get a false detection. So I keep T at 1.
This is more of a subjective assessment based on the data.

6
The result of calculations are presented in below chart. Based on this chart
the answer to the question is ‘Yes’- the Atlanta’s summer climate has
gotten warmer. Specifically, since 2009.

R-code
#import data and create new vector strong the values of crime rate (i.e. last column)

uscrime <- read.delim("N:/ISyE 6501/W-3/uscrime.txt")

View(uscrime)

cr <- as.numeric(uscrime$Crime)

#Simple dot plot of crime rate observations

plot(cr)

# Box and whiskers plot of crime rate

boxplot(cr,main="Box and whiskers plot of crime rate")

#Grubbs test for one outlier

grubbs.test(cr, type = 10, opposite = FALSE, two.sided = FALSE)

7
#Grubbs test for two opposite outliers

grubbs.test(cr, type = 11, opposite = FALSE, two.sided = FALSE)

# Testing if each data point is an outlier

function(x) {

outliers <- NULL

test <- x

grubbs.result <- grubbs.test(test)

pv <- grubbs.result$p.value

while(pv < 0.05) {

outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3]))

test <- x[!x %in% outliers]

grubbs.result <- grubbs.test(test)

pv <- grubbs.result$p.value

return(data.frame(X=x,Outlier=(x %in% outliers)))

grubbs.flag(cr)

You might also like