07exercise Solution
07exercise Solution
(a) Explore the date and describe them with plots, tables and words. Do you think the collected data of the
different countries are comparable?
Solution:
rm(list = ls())
ec <- read.csv("data/ec.txt")
rownames(ec) <- ec[, 1]
ec <- ec[, -1]
head(ec)
str(ec)
summary(ec)
## Work Price Salary SalaryCat
## Min. :1583 Min. : 30.30 Min. : 2.70 (2.6,35.1] :21
## 1st Qu.:1745 1st Qu.: 49.65 1st Qu.: 14.38 (35.1,67.6]:21
## Median :1849 Median : 70.50 Median : 43.65 (67.6,100] : 4
## Mean :1880 Mean : 68.86 Mean : 39.55 NA’s : 2
## 3rd Qu.:1976 3rd Qu.: 81.70 3rd Qu.: 59.70
## Max. :2375 Max. :115.50 Max. :100.00
## NA’s :2 NA’s :2
head(ec[order(ec$Price), ])
tail(ec[order(ec$Price), ])
ec[is.na(ec$Work), ]
pairs(ec)
● ● ●
● ● ● ●● ● ●
●
● ● ● ● ●●● ● ● ●
Work ● ●● ●
●●
● ●●
●
●
● ●●
●
●●
●
● ●●
●
●●
●
●
● ●
●
● ● ●
●
●
●●
● ●
●
●
● ●●●● ● ●
●
●● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●●●● ●● ● ●
1600
● ●
●● ● ● ●
●
● ●
● ● ●
● ● ● ● ● ●●● ●
● ●
●● ● ● ●
●
●
● ● ● ●
●
80
●● ● ● ● ● ● ●
● ● ● ●
●
●●●● ●
●
●
● ● ●● ●●
● ● ● ●●
●
●
●
●
● Price ●●
●
●●●● ● ●
●● ● ●●
● ●●● ● ●
● ●● ●
●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●● ● ● ●
●
●● ●●
40
● ● ●
●
● ●
● ● ●
● ● ●
● ● ●
40 80
● ● ● ● ● ● ●● ●● ● ●
●
● ● ●
● ●●
●●●● ● ● ●
●●●
●
● ●●
●
●
●
●
● ●●● ● ● ●
●●●
● ●
●●
●
●●●● ●●
●
●
●
●
Salary ●
●
●
●
●
●
●
● ● ● ● ●
●
● ● ●● ●● ● ●
● ●●●● ●
●
● ●● ●● ● ● ● ●●
● ●
●
●
0
3.0
● ●● ● ●● ● ●● ● ●
SalaryCat
2.0
● ●●●●
●●●●●
●●● ● ●● ● ● ●●
●●
●●
●●●
●●●
●●
● ●● ●●● ●● ●●
●
●●
●●●●●●
●●●
●
●
1.0
●
●●● ●●● ●●● ●●●● ●
●● ● ● ● ●● ●●●●
●●
●●●●●●●
●● ● ●
●●
●●●
●●
●●● ●●
●
●●●●
12
8
12
10
10
6
Frequency
Frequency
Frequency
8
4
6
4
4
2
2
2
0
0
1600 1800 2000 2200 2400 40 60 80 100 120 0 20 40 60 80 100
## SalaryCat
## (2.6,35.1] (35.1,67.6] (67.6,100]
## (1.58e+03,1.85e+03] 6 16 1
## (1.85e+03,2.11e+03] 10 5 3
## (2.11e+03,2.38e+03] 5 0 0
(b) Which cities are similar? Conduct a cluster analysis. Decide on a “reasonable” size of the grouping. Plot
and interpret the results.
Solution:
35
6
1.5
30
5
Hong Kong
25
4
20
1.0
Tokyo
Taipei
3
15
Oslo Stockholm
2
10
Height
Height
Height
0.5
Angeles
Bombay
LosHouston
Luxembourg
LagosManila
Kuala LumpurHong Kong
5
1
Panama
Lisbon
Sao Paulo
Johannesburg
Copenhagen
Helsinki
Tel AvivTaipei
Madrid
Tel Aviv
York
Geneva
Zurich
Manila
CityAires
Stockholm
Tokyo
0
Rio de Janeiro
Amsterdam
Montreal
Seoul
0
Luxembourg
Athens
Nicosia
Bombay
Houston
Dublin
Dusseldorf
Frankfurt
Chicago
Toronto
Lisbon
0.0
York
Angeles
Kuala Lumpur
Bogota
Brussels
Sydney
Panama
New
Singpore
Caracas
Paulo
Montreal
Dublin
Nairobi
Aires
Johannesburg
Copenhagen
Madrid
Seoul
Helsinki
Oslo
Kong
Taipei
Geneva
Zurich
Vienna
Milan
Stockholm
Tokyo
Amsterdam
de Janeiro
Lagos
Manila
Luxembourg
Bombay
Houston
Buenos
Athens
Nicosia
Lisbon
Panama
Angeles
York
Dusseldorf
Frankfurt
Tel Aviv
Toronto
Chicago
Paulo
Bogota
Montreal
Johannesburg
Aires
Copenhagen
Madrid
Brussels
Sydney
Helsinki
Oslo
Seoul
Dublin
Caracas
Singpore
Geneva
Zurich
Amsterdam
SaoLagos
Rio de Janeiro
Athens
Nicosia
Chicago
Dusseldorf
City
Nairobi
Toronto
Frankfurt
Bogota
Lumpur
Brussels
Sydney
Caracas
Singpore
City
Nairobi
Milan
Vienna
Paris
London
Vienna
Milan
London
Paris
London
Paris
LosNew
Mexico
Buenos
Sao
Hong
Mexico
LosNew
Mexico
Buenos
Kuala
Rio
## Hierarchical clustering
grp <- cutree(hc1n, k = 3)
list(first = names(grp[grp == 1]), second = names(grp[grp == 3]), third = names(grp[grp ==
3]))
## $first
## [1] "Amsterdam" "Brussels" "Chicago" "Copenhagen" "Dublin"
## [6] "Dusseldorf" "Frankfurt" "Geneva" "Helsinki" "Houston"
## [11] "London" "Los Angeles" "Luxembourg" "Madrid" "Milan"
## [16] "Montreal" "New York" "Oslo" "Paris" "Stockholm"
## [21] "Sydney" "Tokyo" "Toronto" "Vienna" "Zurich"
##
## $second
## [1] "Bogota" "Bombay" "Caracas" "Hong Kong" "Kuala Lumpur"
## [6] "Manila" "Panama" "Singpore" "Taipei" "Tel Aviv"
##
## $third
## [1] "Bogota" "Bombay" "Caracas" "Hong Kong" "Kuala Lumpur"
## [6] "Manila" "Panama" "Singpore" "Taipei" "Tel Aviv"
## K-means
km <- kmeans(dat, center = 3, nstart = 20)
grp2 <- km$cluster
list(first = names(grp2[grp2 == 1]), second = names(grp2[grp2 == 3]), third = names(grp2[grp2 ==
3]))
## $first
## [1] "Amsterdam" "Brussels" "Chicago" "Copenhagen" "Dublin"
## [6] "Dusseldorf" "Frankfurt" "Geneva" "Helsinki" "Houston"
## [11] "London" "Los Angeles" "Luxembourg" "Madrid" "Milan"
## [16] "Montreal" "New York" "Oslo" "Paris" "Stockholm"
## [21] "Sydney" "Tokyo" "Toronto" "Vienna" "Zurich"
##
## $second
## [1] "Athens" "Buenos Aires" "Johannesburg" "Lagos"
## [5] "Lisbon" "Mexico City" "Nairobi" "Nicosia"
## [9] "Rio de Janeiro" "Sao Paulo" "Seoul"
##
## $third
## [1] "Athens" "Buenos Aires" "Johannesburg" "Lagos"
## [5] "Lisbon" "Mexico City" "Nairobi" "Nicosia"
## [9] "Rio de Janeiro" "Sao Paulo" "Seoul"
(c) Find a “good” linear regression model with Salary as response variable. Interpret the results and discuss
the model fit.
Solution:
##
## Call:
## lm(formula = Salary ~ Work + Price, data = ec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.392 -10.615 1.819 8.081 34.279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.09421 31.32573 0.322 0.749
## Work -0.01673 0.01422 -1.177 0.246
## Price 0.86876 0.11587 7.497 2.47e-09 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 14.83 on 43 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6572, Adjusted R-squared: 0.6412
## F-statistic: 41.21 on 2 and 43 DF, p-value: 1.009e-10
##
## Call:
## lm(formula = Salary ~ Price, data = ec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.679 -10.279 -0.861 11.280 32.635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.6766 7.6007 -3.378 0.00154 **
## Price 0.9304 0.1038 8.963 1.75e-11 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 14.89 on 44 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6461, Adjusted R-squared: 0.6381
## F-statistic: 80.34 on 1 and 44 DF, p-value: 1.746e-11
anova(mod2, mod1, test = "F")
require(MASS)
step <- stepAIC(mod1, direction = "both")
## Start: AIC=250.99
## Salary ~ Work + Price
##
## Df Sum of Sq RSS AIC
## - Work 1 304.5 9760.5 250.44
## <none> 9456.0 250.99
## - Price 1 12361.2 21817.2 287.44
##
## Step: AIC=250.44
## Salary ~ Price
##
## Df Sum of Sq RSS AIC
## <none> 9760.5 250.44
## + Work 1 304.5 9456.0 250.99
## - Price 1 17822.0 27582.5 296.23
summary(step)
##
## Call:
## lm(formula = Salary ~ Price, data = ec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.679 -10.279 -0.861 11.280 32.635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.6766 7.6007 -3.378 0.00154 **
## Price 0.9304 0.1038 8.963 1.75e-11 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 14.89 on 44 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6461, Adjusted R-squared: 0.6381
## F-statistic: 80.34 on 1 and 44 DF, p-value: 1.746e-11
−2 1
● ● ● ● ●● ● ● ●● ●
● ● ●●●●●●
● ● ●●●●●●●
● ●●
● ●● ●
● ● ● ● ● ●● ●●●●●●●●●●●●
● ● ● ● ● ●●●●●●
Stockholm ● ● Stockholm
0 20 40 60 80 −2 −1 0 1 2
Standardized residuals
Scale−Location Residuals vs Leverage
1.5
1
● ● ● ● ●
● ● ● ●● ●●● ●● ●● ● ●● ●
●
●● ●
● ● ●● ● ●
● ●●●●●
● ● ●● ● ● ● ●● ● ● ●●
●● ● ●● ● ● ●
● ● ●
● ●● ●● ● ● ●
Cook's distance
● Oslo ●
0.0
−3
Stockholm ● 0.5
(d) Use PCA and display the results in a biplot. Interpret the results.
Solution:
biplot(pca)
−6 −4 −2 0 2 4
Rio deLagos
Janeiro
4
Lisbon
0.2
Sydney
Frankfurt
Amsterdam Sao Paulo
Athens
Nicosia
Brussels
2
Dusseldorf
Dublin
OsloMadrid
Paris
London Mexico
Seoul City
Nairobi
Vienna
Copenhagen
Milan
Luxembourg
Buenos Bombay
JohannesburgAires
0.0
HelsinkiMontreal
0
PC2
Stockholm
Toronto Caracas
Panama
Singpore
Salary ChicagoTel AvivBogota
Kuala Lumpur
−6 −4 −2
Price Houston
New York
−0.2
Tokyo
Geneva Manila
Zurich
Los Angeles
Taipei Work
−0.4
Hong Kong
PC1
(e) Use linear and quadratic discriminant analysis to discriminate the three salary levels of SalaryCat base on
the variables Work and Price. Display the results graphically. Compare the results of the linear and the
quadratic approach.
Solution:
require(MASS)
lda1 <- lda(SalaryCat ~ Work + Price, data = ec.noNA)
Work <- with(ec.noNA, seq(min(Work), max(Work), length = 100))
Price <- with(ec.noNA, seq(min(Price), max(Price), length = 100))
grid <- expand.grid(Work = Work, Price = Price)
pred.lda1 <- predict(lda1, grid)$class
qda1 <- qda(SalaryCat ~ Work + Price, data = ec.noNA)
pred.qda1 <- predict(qda1, grid)$class
par(mfrow = c(1, 2))
cols <- c("white", rgb(0.9, 0.9, 0.8), rgb(0.8, 0.9, 0.9))
image(Work, Price, array(as.numeric(pred.lda1), c(100, 100)), col = cols)
with(ec, points(Work, Price, col = SalaryCat))
image(Work, Price, array(as.numeric(pred.qda1), c(100, 100)), col = cols)
with(ec, points(Work, Price, col = SalaryCat))
● ● ● ● ● ●
● ●
100
100
● ●
● ●
● ●
● ●
● ● ● ● ● ●
●● ●●
80
80
● ●
Price
● ● Price ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
60
● 60 ●
● ●
● ● ● ●
● ●
●
● ● ●
● ●
● ●
● ● ● ● ● ● ● ●
● ●
40
40
● ●
● ●
● ●
Work Work
(f) Compare the findigs from the different approaches above. Make five distict statements and refere to all
points (a) to (e) at least once.
Solution:
Keywords: PCA and clustering: no assumption on the data’s distribution, unsupervised algorithm, no
prediction, just description. PCA gives the ’why’ to the clustering’s result. Linear regression, discriminant
analysis: predictive methods, Gaussian assumption.
Turn page.
Problem 20 (Do NOT use R for the following tasks)
(a) Consider the simple linear regression Yei = β0 + β1 X ei ∼ N (µ, σ 2 ) and Ui ∼ N (0, σu2 ). X
ei + Ui with X ei and
xe
Ui are independent and i = 1, . . . , n.
Solution:
(a)
(i)
Yei ∼ N (β0 + β1 µ, σu2 + β12 σxe)
(ii)
i ∼ N (0, σu2 + σv2 )
(iii)
i ∼ N (0, σu2 + β12 σw
2
)
(b) • Consider a real-valued statistic Tn = h(X1:n ), based on a random sample X1:n from a distribution with
probability mass or density function f (x; θ) where θ is an unknown scalar parameter. If the random
variable Tn is computed to make inference about θ, then it is called an estimator. We may simply
write T rather than Tn if the sample size n is not important. The particular value t = h(x1:n ) that an
estimator takes for a realization x1:n of the random sample X1:n is called an estimate.
(c) Keywords: bootstrapping, sample n times the same data with replacement (with n large), derive n estimates,
construct the confidence interval based on those n estimates.
(d) Keywords: estimation of the true MSE, leave one observation yi out, predict ŷi , repeat over all the data,
calculate M
\ SE.
http://www.math.uzh.ch/sta121.1|07exercise.pdf 2014-10-13