Isye HW2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

ISYE6501 HW2

Q4.1

Describe a situation or problem from your job, everyday life, current events,
etc., for which a clustering model would be appropriate. List some (up to 5)
predictors that you might use.

I think classification can be used at self-weighing station in super market chains. It can be used to classify
what grocery were being placed on the weighing machine. I would think we can use weigh, volume, colour,
density (interaction term between weigh and volume) as predictors.

Q4.2
The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical
response. The predictors are the width and length of the sepal and petal of flowers and the response is the
type of flower. The data is available from the R library datasets and can be accessed with iris once the
library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/
ml/datasets/Iris ). The response values are only given to see how well a specific method performed and
should not be used to build the model.
Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors,
your suggested value of k, and how well your best clustering predicts flower type.

iris <- read.table("iris.txt")

totss <- as.numeric()


train_x <- iris %>%
select(-Species) %>%
scale
test_y <- iris %>%
select(Species)

totss <- as.numeric()


for (k in 1:10) {
model <- kmeans(train_x,
centers = k,
iter.max = 1000,
algorithm = 'Lloyd')
totss <- c(totss,model$tot.withinss)

plot(totss)

1
600
500
400
totss

300
200
100

2 4 6 8 10

Index

Elbow plot seems to indicate that 3 clusters are ideal, which is aligned with our intuition.

set.seed(6501)
model <- kmeans(train_x,
centers = 3,
iter.max = 1000,
algorithm = 'Lloyd')

conf_matrix <- table(model$cluster,pull(test_y)) %>%


as.data.frame.matrix() %>%
arrange(desc(across(everything()))) %>%
as.matrix()

conf_matrix

## setosa versicolor virginica


## 3 49 0 0
## 2 1 37 8
## 1 0 13 42

Of the 3 clusters that our model created, cluster 1 corresponds with virgina, cluster 2 corresponds with
versicolor, cluster 3 corresponds with setosa.

2
Visualise based on Sepal Width and Petal Length

ggplot(iris, aes(x= Sepal.Width, y=Petal.Length, col=Species))+


geom_point(shape = model$cluster)

Species
Petal.Length

setosa
4
versicolor
virginica

2.0 2.5 3.0 3.5 4.0 4.5


Sepal.Width

conf_matrix %>%
diag %>%
sum()/nrow(train_x)

## [1] 0.8533333

We then calculate the accuracy based on correct clusters divided by total number of observations which
came out to be 85% accurate.

Q5.1
Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at
http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last
column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R.

3
uscrime <- read.table("uscrime.txt",
header=TRUE)

ggplot(uscrime,aes(y=0,x=Crime))+
geom_boxplot()+
labs(title = "Boxplot of Crime Rate across US States in 1960",
y="")+
geom_jitter(alpha=0.1)+
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank())

Boxplot of Crime Rate across US States in 1960

500 1000 1500 2000


Crime
Boxplot seems to indicate that there are 3 outliers.

library(outliers)
# grubbs.test(uscrime$Crime,
# type = 20 # since we are hypothesizing that there are 3 outliers
# )

Error message: Error in qgrubbs(q, n, type, rev = TRUE) : n must be in range 3-30
grubbs.test seems to be unable to handle data with more than 30 observations.

grubbs.test(uscrime$Crime,
type = 10)

##

4
## Grubbs test for one outlier
##
## data: uscrime$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier

For a one tailed Grubb’s test for one outlier, we failed to reject the null hypothesis since p>0.05.

Q6.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a Change
Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical
value and the threshold?
In the food manufacturing industry, we can use change detection to identify if there are any deviations in
the weighs of our packages. Since it is not as critical, we can afford it to detect changes slower and have
lesser likelihood of having false detection, we can perhaps use +/- 2 standard deviations.

Q6.2.1
1. Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a
CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling
off) each year. You can get the data that you need from the file temps.txt or online, for example
at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/
airport/KFTY/2015/7/1/CustomHistory.html . You can use R if you’d like, but it’s straightforward
enough that an Excel spreadsheet can easily do the job too.

temps <- read.table("temps.txt", header = TRUE) %>%


mutate(DAY = as.Date(DAY, format = "%d-%b"))

ggplot(temps %>%
pivot_longer(cols=-DAY,
names_to="Year",
values_to = "temp"),
aes(x=DAY, y=temp, col=Year))+
geom_line()

5
Year
X1996
X1997
X1998
100 X1999
X2000
X2001
90
X2002
X2003
X2004
80
temp

X2005
X2006

70 X2007
X2008
X2009
60 X2010
X2011
X2012
50
X2013
Jul Aug Sep Oct Nov X2014
DAY
X2015
We see that summer generally ends around September. We shall do some exploration specific to that period
to better understand what summer temperatures are like.

ggplot(temps %>%
filter(between(DAY,ymd('20230701'),ymd('20230901'))) %>%
pivot_longer(cols=-DAY,
names_to="Year",
values_to = "temp"),
aes(y=Year, x=temp))+
geom_boxplot()

6
X2015
X2014
X2013
X2012
X2011
X2010
X2009
X2008
X2007
Year

X2006
X2005
X2004
X2003
X2002
X2001
X2000
X1999
X1998
X1997
X1996
70 80 90 100
temp
There seems to be quite a lot of variation in temperature. Also, there are a number of outliers whereby
temperatures are at the lower end. We would want our cusum algorithm to be less sensitive.

#expected average temperature in Summer


exp_temp <- temps %>%
filter(between(DAY,ymd('20230701'),ymd('20230901'))) %>%
summarise(across(is.numeric,~median(.x)))

## Warning: Predicate functions must be wrapped in ‘where()‘.


##
## # Bad
## data %>% select(is.numeric)
##
## # Good
## data %>% select(where(is.numeric))
##
## i Please update your code.
## This message is displayed once per session.

lowest_temp <- temps %>%


filter(between(DAY,ymd('20230701'),ymd('20230901'))) %>%
summarise(across(where(is.numeric),~min(.x)))

exp_temp - lowest_temp

## X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008

7
## 1 11 16 10 17 16 7 13 14 12 11 11 10 10
## X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1 17 9 13 10 20 12 16

The largest difference between median and lowest temperature is 20 in 2013. We will set our threshold to
be around that temperature.

C = 10
thres = 25

cusum <- list()


for (year in colnames(exp_temp)) {
st1 <- 0
for (i in 1:nrow(temps)) {
cusum[[year]][["value"]][i] <- max(0, st1 + exp_temp[1,year] - temps[[year]][i] - C)
st1 <- cusum[[year]][["value"]][i]
}
cusum[[year]][["day"]][1] <- paste0(substr(year,2,5),
temps$DAY[match(TRUE,cusum[[year]][["value"]]>=thres)]
%>% format("%m%d"))
}

end_of_summer <- sapply(cusum, "[[", 2)

end_of_summer

## X1996 X1997 X1998 X1999 X2000 X2001 X2002


## "19960930" "19970927" "19981010" "19990923" "20000907" "20010927" "20020927"
## X2003 X2004 X2005 X2006 X2007 X2008 X2009
## "20031001" "20041013" "20051024" "20060929" "20071014" "20081019" "20091005"
## X2010 X2011 X2012 X2013 X2014 X2015
## "20101001" "20110907" "20121003" "20131019" "20141019" "20150925"

After fine tuning the C and T, these are the unofficial end of summer for each year. # Q6.2.2

2. Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer
in that time (and if so, when).

summers <- temps %>%


pivot_longer(cols=-DAY,
names_to="Year",
values_to = "temp") %>%
mutate(DAY = paste0(substr(Year,2,5),substr(DAY,6,7),substr(DAY,9,10))) %>%
left_join(end_of_summer %>%
as.data.frame() %>%
rownames_to_column(),
by=c("Year"="rowname")) %>%
rename("end_of_summer_year"=".") %>%
filter(DAY<end_of_summer_year)

med_summer <- summers %>%

8
mutate(Year = as.numeric(substr(Year,2,5))) %>%
group_by(Year) %>%
summarise(temp = mean(temp)) %>%
ungroup()

ggplot(med_summer,aes(x=Year, y=temp, group="1"))+


geom_line()+
labs(title = "Median Summer Temperature")

Median Summer Temperature


92.5

90.0
temp

87.5

85.0

2000 2005 2010 2015


Year

Median temperature seems to fluctuate quite a lot. I will be using the median temperature from 1996 to 2003
as the expected temperature. Since small changes are critical for climate change data, we will use relatively
smaller C and thresholds.

C = 0.5
thres = 5

exp_summer_temp <- med_summer %>%


filter(between(Year,1996,2003)) %>%
summarise(temp = median(temp)) %>%
pull()

cusum_year <- data.frame(Year=integer(),


value=double())
st1 <- 0
for (i in 1:nrow(med_summer)) {

9
cusum_year[i,1] <- med_summer[[i,1]]
cusum_year[i,2] <- max(0, st1 + med_summer[[i,2]] - exp_summer_temp - C)
st1 <- cusum_year[i,2]
}

cusum_year[match(TRUE, cusum_year$value >= thres),1]

## [1] 2011

Atlanta’s climate has gotten warmer since 2011.

10

You might also like