0% found this document useful (0 votes)
25 views9 pages

07exercise Solution

The document appears to be an exam problem from a statistics course exploring work and living conditions in cities around the world in 1991 based on data from the Economic Research Department of the Union Bank of Switzerland. The data includes variables on working hours, cost of living indexes, wage indexes, and categorized wage levels for various cities. The student is asked to explore and describe the data through plots, tables and analysis to assess if the data from different countries are comparable. R code is provided to import, manipulate and examine the data.

Uploaded by

ayanabi8753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views9 pages

07exercise Solution

The document appears to be an exam problem from a statistics course exploring work and living conditions in cities around the world in 1991 based on data from the Economic Research Department of the Union Bank of Switzerland. The data includes variables on working hours, cost of living indexes, wage indexes, and categorized wage levels for various cities. The student is asked to explore and describe the data through plots, tables and analysis to assess if the data from different countries are comparable. R code is provided to import, manipulate and examine the data.

Uploaded by

ayanabi8753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

University of Zurich STA121

Institute of Mathematics Reinhard Furrer

Exercises 7 Gilles Kratzer, Simone Tiberi

Test Exam, 120 minutes


Problem 19 (Use R for the following tasks)
The dataset ec.txt was collected by the Economic Research Department of the Union Bank of Switzerland in
1991. We want to investigate the following question: What was the work and living conditions in different cities
around the globe in 1991? (The data are available at the course webpage)
The variables are:

City: City name


Work: Weighted average of the number of working hours in 12 occupations
Price: Index of the cost 112 goods and services excluding rent (Zurich = 100)
Salary: Index of hourly earnings in 12 occupations after deductions (Zurich = 100)
SalaryCat: Was generated from Salary with the R code: cut(Salary, breaks = 3)

(a) Explore the date and describe them with plots, tables and words. Do you think the collected data of the
different countries are comparable?
Solution:

ec <- read.csv("data/worldEconomy.txt", sep = "\t")


ec <- within(ec, SalaryCat <- cut(Salary, breaks = 3))
write.csv(ec, "data/ec.txt", row.names = FALSE)

rm(list = ls())
ec <- read.csv("data/ec.txt")
rownames(ec) <- ec[, 1]
ec <- ec[, -1]
head(ec)

## Work Price Salary SalaryCat


## Amsterdam 1714 65.6 49.0 (35.1,67.6]
## Athens 1792 53.8 30.4 (2.6,35.1]
## Bogota 2152 37.9 11.5 (2.6,35.1]
## Bombay 2052 30.3 5.3 (2.6,35.1]
## Brussels 1708 73.8 50.5 (35.1,67.6]
## Buenos Aires 1971 56.1 12.5 (2.6,35.1]

str(ec)

## ’data.frame’: 48 obs. of 4 variables:


## $ Work : int 1714 1792 2152 2052 1708 1971 NA 2041 1924 1717 ...
## $ Price : num 65.6 53.8 37.9 30.3 73.8 56.1 37.1 61 73.9 91.3 ...
## $ Salary : num 49 30.4 11.5 5.3 50.5 12.5 NA 10.9 61.9 62.9 ...
## $ SalaryCat: Factor w/ 3 levels "(2.6,35.1]","(35.1,67.6]",..: 2 1 1 1 2 1 NA 1 2 2 ...

summary(ec)
## Work Price Salary SalaryCat
## Min. :1583 Min. : 30.30 Min. : 2.70 (2.6,35.1] :21
## 1st Qu.:1745 1st Qu.: 49.65 1st Qu.: 14.38 (35.1,67.6]:21
## Median :1849 Median : 70.50 Median : 43.65 (67.6,100] : 4
## Mean :1880 Mean : 68.86 Mean : 39.55 NA’s : 2
## 3rd Qu.:1976 3rd Qu.: 81.70 3rd Qu.: 59.70
## Max. :2375 Max. :115.50 Max. :100.00
## NA’s :2 NA’s :2

head(ec[order(ec$Price), ])

## Work Price Salary SalaryCat


## Bombay 2052 30.3 5.3 (2.6,35.1]
## Cairo NA 37.1 NA <NA>
## Bogota 2152 37.9 11.5 (2.6,35.1]
## Manila 2268 40.0 4.0 (2.6,35.1]
## Kuala Lumpur 2167 43.5 9.9 (2.6,35.1]
## Jakarta NA 43.6 NA <NA>

tail(ec[order(ec$Price), ])

## Work Price Salary SalaryCat


## Geneva 1880 95.9 90.3 (67.6,100]
## Zurich 1868 100.0 100.0 (67.6,100]
## Stockholm 1805 111.3 39.2 (35.1,67.6]
## Helsinki 1667 113.6 66.6 (35.1,67.6]
## Tokyo 1880 115.0 68.0 (67.6,100]
## Oslo 1583 115.5 63.7 (35.1,67.6]

ec[is.na(ec$Work), ]

## Work Price Salary SalaryCat


## Cairo NA 37.1 NA <NA>
## Jakarta NA 43.6 NA <NA>

pairs(ec)

40 60 80 100 1.0 1.5 2.0 2.5 3.0


● ● ●
2200

● ● ●
● ● ● ●● ● ●

● ● ● ● ●●● ● ● ●
Work ● ●● ●
●●
● ●●


● ●●

●●

● ●●

●●


● ●

● ● ●


●●
● ●


● ●●●● ● ●

●● ● ● ●














● ●●●●●● ●● ● ●
1600

● ●
●● ● ● ●

● ●
● ● ●

● ● ● ● ● ●●● ●
● ●
●● ● ● ●


● ● ● ●

80

●● ● ● ● ● ● ●
● ● ● ●

●●●● ●


● ● ●● ●●
● ● ● ●●



● Price ●●

●●●● ● ●
●● ● ●●
● ●●● ● ●
● ●● ●
●● ● ●











● ● ●● ● ● ●

●● ●●
40

● ● ●

● ●
● ● ●

● ● ●
● ● ●
40 80

● ● ● ● ● ● ●● ●● ● ●

● ● ●
● ●●
●●●● ● ● ●
●●●

● ●●




● ●●● ● ● ●
●●●
● ●
●●

●●●● ●●




Salary ●






● ● ● ● ●

● ● ●● ●● ● ●
● ●●●● ●

● ●● ●● ● ● ● ●●
● ●


0
3.0

● ●● ● ●● ● ●● ● ●

SalaryCat
2.0

● ●●●●
●●●●●
●●● ● ●● ● ● ●●
●●
●●
●●●
●●●
●●
● ●● ●●● ●● ●●

●●
●●●●●●
●●●


1.0


●●● ●●● ●●● ●●●● ●
●● ● ● ● ●● ●●●●
●●
●●●●●●●
●● ● ●
●●
●●●
●●
●●● ●●

●●●●

1600 1800 2000 2200 2400 0 20 40 60 80 100


par(mfrow = c(1, 3))
with(ec, {
hist(Work)
hist(Price)
hist(Salary)
})

Histogram of Work Histogram of Price Histogram of Salary


14

12

8
12

10
10

6
Frequency

Frequency

Frequency
8

4
6

4
4

2
2
2
0

0
1600 1800 2000 2200 2400 40 60 80 100 120 0 20 40 60 80 100

Work Price Salary

with(ec, table(cut(Work, 3), SalaryCat))

## SalaryCat
## (2.6,35.1] (35.1,67.6] (67.6,100]
## (1.58e+03,1.85e+03] 6 16 1
## (1.85e+03,2.11e+03] 10 5 3
## (2.11e+03,2.38e+03] 5 0 0

ec.noNA <- na.omit(ec)

(b) Which cities are similar? Conduct a cluster analysis. Decide on a “reasonable” size of the grouping. Plot
and interpret the results.
Solution:

dat <- scale(ec.noNA[, -4])


par(mfrow = c(1, 3))
hc1n <- hclust(dist(dat))
plot(hc1n)
hc2n <- hclust(dist(dat), method = "single")
plot(hc2n)
hc3n <- hclust(dist(dat), method = "ward")
plot(hc3n)
Cluster Dendrogram Cluster Dendrogram Cluster Dendrogram

35
6

1.5

30
5

Hong Kong

25
4

20
1.0

Tokyo
Taipei
3

15
Oslo Stockholm
2

10
Height

Height

Height
0.5

Angeles
Bombay

LosHouston
Luxembourg
LagosManila
Kuala LumpurHong Kong

5
1

Panama
Lisbon
Sao Paulo
Johannesburg

Copenhagen
Helsinki
Tel AvivTaipei

Madrid
Tel Aviv

York
Geneva
Zurich
Manila

CityAires
Stockholm
Tokyo

0
Rio de Janeiro

Amsterdam

Montreal
Seoul
0

Luxembourg

Athens
Nicosia
Bombay
Houston

Dublin
Dusseldorf
Frankfurt

Chicago
Toronto
Lisbon

0.0
York
Angeles

Kuala Lumpur
Bogota

Brussels
Sydney
Panama

New
Singpore
Caracas
Paulo
Montreal
Dublin

Nairobi
Aires
Johannesburg
Copenhagen
Madrid

Seoul
Helsinki
Oslo

Kong
Taipei
Geneva
Zurich

Vienna
Milan

Stockholm
Tokyo
Amsterdam

de Janeiro
Lagos

Manila
Luxembourg

Bombay
Houston
Buenos
Athens
Nicosia

Lisbon

Panama
Angeles
York
Dusseldorf
Frankfurt

Tel Aviv
Toronto
Chicago

Paulo
Bogota

Montreal

Johannesburg
Aires
Copenhagen
Madrid
Brussels
Sydney

Helsinki
Oslo

Seoul
Dublin
Caracas
Singpore

Geneva
Zurich

Amsterdam

SaoLagos
Rio de Janeiro
Athens
Nicosia
Chicago
Dusseldorf
City
Nairobi

Toronto
Frankfurt

Bogota
Lumpur
Brussels
Sydney

Caracas
Singpore
City
Nairobi
Milan
Vienna
Paris
London
Vienna
Milan
London
Paris

London
Paris
LosNew

Mexico
Buenos
Sao

Hong
Mexico

LosNew

Mexico
Buenos

Kuala
Rio

dist(dat) dist(dat) dist(dat)


hclust (*, "complete") hclust (*, "single") hclust (*, "ward.D")

## Hierarchical clustering
grp <- cutree(hc1n, k = 3)
list(first = names(grp[grp == 1]), second = names(grp[grp == 3]), third = names(grp[grp ==
3]))

## $first
## [1] "Amsterdam" "Brussels" "Chicago" "Copenhagen" "Dublin"
## [6] "Dusseldorf" "Frankfurt" "Geneva" "Helsinki" "Houston"
## [11] "London" "Los Angeles" "Luxembourg" "Madrid" "Milan"
## [16] "Montreal" "New York" "Oslo" "Paris" "Stockholm"
## [21] "Sydney" "Tokyo" "Toronto" "Vienna" "Zurich"
##
## $second
## [1] "Bogota" "Bombay" "Caracas" "Hong Kong" "Kuala Lumpur"
## [6] "Manila" "Panama" "Singpore" "Taipei" "Tel Aviv"
##
## $third
## [1] "Bogota" "Bombay" "Caracas" "Hong Kong" "Kuala Lumpur"
## [6] "Manila" "Panama" "Singpore" "Taipei" "Tel Aviv"

## K-means
km <- kmeans(dat, center = 3, nstart = 20)
grp2 <- km$cluster
list(first = names(grp2[grp2 == 1]), second = names(grp2[grp2 == 3]), third = names(grp2[grp2 ==
3]))

## $first
## [1] "Amsterdam" "Brussels" "Chicago" "Copenhagen" "Dublin"
## [6] "Dusseldorf" "Frankfurt" "Geneva" "Helsinki" "Houston"
## [11] "London" "Los Angeles" "Luxembourg" "Madrid" "Milan"
## [16] "Montreal" "New York" "Oslo" "Paris" "Stockholm"
## [21] "Sydney" "Tokyo" "Toronto" "Vienna" "Zurich"
##
## $second
## [1] "Athens" "Buenos Aires" "Johannesburg" "Lagos"
## [5] "Lisbon" "Mexico City" "Nairobi" "Nicosia"
## [9] "Rio de Janeiro" "Sao Paulo" "Seoul"
##
## $third
## [1] "Athens" "Buenos Aires" "Johannesburg" "Lagos"
## [5] "Lisbon" "Mexico City" "Nairobi" "Nicosia"
## [9] "Rio de Janeiro" "Sao Paulo" "Seoul"

(c) Find a “good” linear regression model with Salary as response variable. Interpret the results and discuss
the model fit.
Solution:

mod1 <- lm(Salary ~ Work + Price, data = ec)


summary(mod1)

##
## Call:
## lm(formula = Salary ~ Work + Price, data = ec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.392 -10.615 1.819 8.081 34.279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.09421 31.32573 0.322 0.749
## Work -0.01673 0.01422 -1.177 0.246
## Price 0.86876 0.11587 7.497 2.47e-09 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 14.83 on 43 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6572, Adjusted R-squared: 0.6412
## F-statistic: 41.21 on 2 and 43 DF, p-value: 1.009e-10

mod2 <- update(mod1, Salary ~ Price)


summary(mod2)

##
## Call:
## lm(formula = Salary ~ Price, data = ec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.679 -10.279 -0.861 11.280 32.635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.6766 7.6007 -3.378 0.00154 **
## Price 0.9304 0.1038 8.963 1.75e-11 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 14.89 on 44 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6461, Adjusted R-squared: 0.6381
## F-statistic: 80.34 on 1 and 44 DF, p-value: 1.746e-11
anova(mod2, mod1, test = "F")

## Analysis of Variance Table


##
## Model 1: Salary ~ Price
## Model 2: Salary ~ Work + Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 9760.5
## 2 43 9456.0 1 304.51 1.3847 0.2458

require(MASS)
step <- stepAIC(mod1, direction = "both")

## Start: AIC=250.99
## Salary ~ Work + Price
##
## Df Sum of Sq RSS AIC
## - Work 1 304.5 9760.5 250.44
## <none> 9456.0 250.99
## - Price 1 12361.2 21817.2 287.44
##
## Step: AIC=250.44
## Salary ~ Price
##
## Df Sum of Sq RSS AIC
## <none> 9760.5 250.44
## + Work 1 304.5 9456.0 250.99
## - Price 1 17822.0 27582.5 296.23

summary(step)

##
## Call:
## lm(formula = Salary ~ Price, data = ec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.679 -10.279 -0.861 11.280 32.635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.6766 7.6007 -3.378 0.00154 **
## Price 0.9304 0.1038 8.963 1.75e-11 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 14.89 on 44 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6461, Adjusted R-squared: 0.6381
## F-statistic: 80.34 on 1 and 44 DF, p-value: 1.746e-11

par(mfrow = c(2, 2))


plot(step)
Standardized residuals
Residuals Residuals vs Fitted Normal Q−Q
● Luxembourg Zurich
● ● ● Zurich
Luxembourg ● ●
−40 20 ● ● ●
● ●● ●
●●●● ●●●●●●● ●

−2 1
● ● ● ● ●● ● ● ●● ●
● ● ●●●●●●
● ● ●●●●●●●
● ●●
● ●● ●
● ● ● ● ● ●● ●●●●●●●●●●●●
● ● ● ● ● ●●●●●●
Stockholm ● ● Stockholm

0 20 40 60 80 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
1.5

Stockholm ● ● Zurich 0.5


● Luxembourg Zurich
● ● ● ●


●●●● ● ●

1
● ● ● ● ●
● ● ● ●● ●●● ●● ●● ● ●● ●

●● ●
● ● ●● ● ●
● ●●●●●
● ● ●● ● ● ● ●● ● ● ●●
●● ● ●● ● ● ●
● ● ●
● ●● ●● ● ● ●
Cook's distance
● Oslo ●
0.0

−3
Stockholm ● 0.5

0 20 40 60 80 0.00 0.02 0.04 0.06 0.08 0.10 0.12

Fitted values Leverage

(d) Use PCA and display the results in a biplot. Interpret the results.
Solution:

pca <- prcomp(ec.noNA[, -4], scale = TRUE)


round(pca$sdev, 2)

## [1] 1.47 0.80 0.44

biplot(pca)

−6 −4 −2 0 2 4

Rio deLagos
Janeiro
4

Lisbon
0.2

Sydney
Frankfurt
Amsterdam Sao Paulo
Athens
Nicosia
Brussels
2

Dusseldorf
Dublin
OsloMadrid
Paris
London Mexico
Seoul City
Nairobi
Vienna
Copenhagen
Milan
Luxembourg
Buenos Bombay
JohannesburgAires
0.0

HelsinkiMontreal
0
PC2

Stockholm
Toronto Caracas
Panama
Singpore
Salary ChicagoTel AvivBogota
Kuala Lumpur
−6 −4 −2

Price Houston
New York
−0.2

Tokyo
Geneva Manila
Zurich
Los Angeles
Taipei Work
−0.4

Hong Kong

−0.4 −0.2 0.0 0.2

PC1

(e) Use linear and quadratic discriminant analysis to discriminate the three salary levels of SalaryCat base on
the variables Work and Price. Display the results graphically. Compare the results of the linear and the
quadratic approach.
Solution:

require(MASS)
lda1 <- lda(SalaryCat ~ Work + Price, data = ec.noNA)
Work <- with(ec.noNA, seq(min(Work), max(Work), length = 100))
Price <- with(ec.noNA, seq(min(Price), max(Price), length = 100))
grid <- expand.grid(Work = Work, Price = Price)
pred.lda1 <- predict(lda1, grid)$class
qda1 <- qda(SalaryCat ~ Work + Price, data = ec.noNA)
pred.qda1 <- predict(qda1, grid)$class
par(mfrow = c(1, 2))
cols <- c("white", rgb(0.9, 0.9, 0.8), rgb(0.8, 0.9, 0.9))
image(Work, Price, array(as.numeric(pred.lda1), c(100, 100)), col = cols)
with(ec, points(Work, Price, col = SalaryCat))
image(Work, Price, array(as.numeric(pred.qda1), c(100, 100)), col = cols)
with(ec, points(Work, Price, col = SalaryCat))

● ● ● ● ● ●
● ●
100

100
● ●
● ●
● ●
● ●
● ● ● ● ● ●
●● ●●
80

80
● ●
Price

● ● Price ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
60

● 60 ●
● ●
● ● ● ●
● ●

● ● ●
● ●
● ●
● ● ● ● ● ● ● ●
● ●
40

40

● ●
● ●
● ●

1600 1800 2000 2200 1600 1800 2000 2200

Work Work

(f) Compare the findigs from the different approaches above. Make five distict statements and refere to all
points (a) to (e) at least once.
Solution:
Keywords: PCA and clustering: no assumption on the data’s distribution, unsupervised algorithm, no
prediction, just description. PCA gives the ’why’ to the clustering’s result. Linear regression, discriminant
analysis: predictive methods, Gaussian assumption.

Turn page.
Problem 20 (Do NOT use R for the following tasks)

(a) Consider the simple linear regression Yei = β0 + β1 X ei ∼ N (µ, σ 2 ) and Ui ∼ N (0, σu2 ). X
ei + Ui with X ei and
xe
Ui are independent and i = 1, . . . , n.

(i) What is the distribution of Ỹi ?


(ii) Suppose that we can only observe Yei subject to measurement errors, i.e., we observe Yi = Yei + Vi with
Vi ∼ N (0, σv2 ), where Vi is independent of X ei and Ui . Then we are trying to fit the regression line
using the equation Yi = β0 + β1 X ei + i . What is the distribution of i ?
(iii) Suppose that we can only observe X ei subject to measurement errors, i.e., we observe Xi = X e i + Wi
2
with Wi ∼ N (0, σw ), where Wi is independent of Xi and Ui . Then we are trying to fit the regression
e
line using the equation Yei = β0 + β1 Xi + i . What is the distribution of i ?

(b) Describe in a view lines:

• What is an estimator? What is an estimate?


• How can resampling techniques be used to derive confidence intervals?
• What is leave-one-out cross-validation?

Solution:

(a)

(i)
Yei ∼ N (β0 + β1 µ, σu2 + β12 σxe)

(ii)
i ∼ N (0, σu2 + σv2 )

(iii)
i ∼ N (0, σu2 + β12 σw
2
)

(b) • Consider a real-valued statistic Tn = h(X1:n ), based on a random sample X1:n from a distribution with
probability mass or density function f (x; θ) where θ is an unknown scalar parameter. If the random
variable Tn is computed to make inference about θ, then it is called an estimator. We may simply
write T rather than Tn if the sample size n is not important. The particular value t = h(x1:n ) that an
estimator takes for a realization x1:n of the random sample X1:n is called an estimate.

(c) Keywords: bootstrapping, sample n times the same data with replacement (with n large), derive n estimates,
construct the confidence interval based on those n estimates.

(d) Keywords: estimation of the true MSE, leave one observation yi out, predict ŷi , repeat over all the data,
calculate M
\ SE.

http://www.math.uzh.ch/sta121.1|07exercise.pdf 2014-10-13

You might also like