0% found this document useful (0 votes)
24 views12 pages

STA3022F Exam June 2013

Uploaded by

alutakaunda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

STA3022F Exam June 2013

Uploaded by

alutakaunda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIVERSITY OF CAPE TOWN

DEPARTMENT OF STATISTICAL SCIENCES


STA3022F
JUNE 2013 EXAMINATION

INTERNAL EXAMINERS: A/Prof S Lubbe, Dr Ş Er TOTAL MARKS: 100


INTERNAL ASSESSOR: A/Prof F Little
EXTERNAL EXAMINER: Dr T Berning TIME ALLOWED: 3 hours
PAGES: 14
INSTRUCTIONS: ANSWER EACH SECTION IN A SEPARATE BOOK.
ALL QUESTIONS MAY BE ATTEMPTED.
MARKS ARE ALLOCATED FOR INTERMEDIATE CALCULATIONS.

SECTION A: EXPLORATORY METHODS [Available marks: 51]

Question 1 [3 marks]
Consider a data set of the monthly inflation rate over several years for all African countries.
(a) Would you consider inflation rate to be a numerical, ordinal categorical or nominal variable. (1)
(b) If the inflation figure for Zimbabwe for June 2012 is missing, name one method of overcoming this
problem. (1)
(c) You want to perform an analysis that requires normally distributed data. You plotted histograms of the
data, some of which are shown below. Since the data is very skew, what can be done before embarking
on the analysis? (1)
300
400

250
150

300

200
Frequency

Frequency

Frequency
100

150
200

100
100
50

50
0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4

x x x
200
250

150
150
200
Frequency

Frequency

Frequency
150

100
100
100

50
50
50
0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5

x x x

1
Question 2 [15 marks]
Consider the following contingency table and associated R output. In the table below, 300 people
were asked what bank they used most often and to identify the most important reason why they
chose that bank.

Helpful Enjoy Branch close Keeps me Many Competitive


TOTAL
staff advertising to home informed ATMs interest rates
FNB 14 14 2 23 7 12 72
ABSA 13 6 10 15 5 21 70
NEDBANK 11 16 8 4 10 8 57
STDBANK 12 14 30 8 28 9 101
TOTAL 50 50 50 50 50 50 300

> ca(bank.data)
Principal inertias (eigenvalues):
1 2 3
Value 0.174676 0.046073 0.021778
Percentage 72.02% 19% 8.98%

Rows:
FNB ABSA NEDBANK STDBANK
Mass 0.240000 0.233333 0.190000 0.336667
ChiDist 0.542201 0.468606 0.383164 0.525126
Inertia 0.070556 0.051238 0.027895 0.092838
Dim. 1 -1.172400 -0.794016 0.278962 1.228645
Dim. 2 -0.781987 1.457919 -1.315956 0.289685

Columns:
Branch Competitive
Helpful Enjoy close Keeps me Many interest
staff advertising to home informed ATMs rates
Mass 0.166667 0.166667 0.166667 0.166667 0.166667 0.166667
ChiDist 0.205443 0.400249 0.618174 0.614089 0.516271 0.476418
Inertia 0.007034 0.026700 0.063690 0.062851 0.044423 0.037829
Dim. 1 -0.427021 0.023291 1.378472 -1.336570 1.197046 -0.835218
Dim. 2 -0.278986 -1.788994 1.041542 0.087283 -0.301211 1.240367

(a) In order to test the hypothesis 𝐻0 : no significant association between bank and reason for using that
bank, give the test statistic and its associated distribution. Also define each of the symbols you use in the
definition of the test statistic and indicate how to calculate or obtain them. (4)

(b) Give an expression for the Pearson residuals calculated for a contingency table. (1)

(c) Once the matrix of Pearson residuals is obtained, each entry is divided by the grand total from the
contingency table. Name and give a mathematical expression for the method of obtaining the CA map
coordinates from this matrix. (2)

(d) Construct a CA map for the contingency table above. (5)

(e) What proportion of variance in the data are you displaying in your map in (d)? (1)

(f) Which bank provides the most competitive interest rates? (1)

(g) Why do customers tend to choose Nedbank? (1)

2
Question 3 [18 marks]

In an analysis of Olympic decathlon scores, 160 complete starts were made by 139 athletes. The
scores for each of the 10 decathlon events were standardized and the signs of the timed events
changed so that large scores are good for all events. The factor analysis output is shown below.

(a) Give the general model for factor analysis for factors 𝐹1 , … , 𝐹𝑞 based on manifest variables
𝑋1 , … , 𝑋𝑝 . (2)

(b) Give the general principal component analysis model for principal components 𝑌1 , … , 𝑌𝑚 based
on manifest variables 𝑋1 , … , 𝑋𝑝 . (2)

(c) Name two methods for estimating the factor loadings in the Factor analysis model. (2)

(d) Why did the researcher decide to use four factors? (1)

(e) What is the advantage of the varimax rotation? (1)

(f) Interpret and name the four factors. (8)

(g) Give an expression for the communality of the 𝑖-th manifest variable and explain what its
meaning is. (2)
4
3
eigenvalue

2
1

2 4 6 8 10

Index

> eigenvalues
[1] 4.21290 2.88958 2.24310 1.05940 0.91770 0.66980 0.57570 0.42110 0.32190 0.25250

3
> loadings
[,1] [,2] [,3] [,4]
[100m run] -0.6961444 0.02209774 -0.468329773 0.41636799
[long jump] -0.7925464 0.07517162 -0.254696029 0.11462221
[shot put] -0.7710515 -0.43442586 0.197341218 0.11216995
[high jump] -0.7107713 0.18069329 0.004571862 -0.36745024
[400m run] -0.6048010 0.54866998 -0.045077267 0.39698116
[110m hurdles] -0.5126311 -0.08271661 -0.371744685 -0.56108748
[discus] -0.6897880 -0.45643672 0.288597881 0.07771845
[pole vault] -0.7609729 0.16239082 0.018258009 -0.30422480
[javelin] -0.5184056 -0.25162810 0.518908343 0.07356806
[1500m run] -0.2197133 0.74576659 0.493070762 -0.08518221

> varimax(loadings)
[100m run] -0.885 -0.139 0.182 -0.205
[long jump] -0.664 0.201 -0.693
[shot put] -0.023 0.819 -0.152
[high jump] -0.121 0.293 0.237 -0.683
[400m run] -0.746 0.750
[110m hurdles] -0.108 -0.161 -0.826
[discus] -0.185 0.832 -0.204
[pole vault] -0.207 0.193 0.124 -0.656
[javelin] 0.188 0.754
[1500m run] 0.921

[,1] [,2] [,3] [,4]


SS loadings 2.045 1.376 2.234 1.924
Proportion Var 0.205 0.138 0.223 0.192
Cumulative Var 0.205 0.342 0.566 0.758

Question 4 [5 marks]

In order to evaluate the quality of health care at a local clinic, researchers designed a questionnaire
consisting of several statements. The statements are rated on a five point scale and each statement is
designed such that a response of “Totally disagree”, scored 1, indicates poor quality and “Totally
agree”, scored 5, indicates excellent quality. Several aspects of health care quality need to be
evaluated. Questions 2, 5, 8, 15 and 23 is designed to deal with waiting time at the clinic. In order to
assess whether the questions were adequately designed, the researchers did a pilot study asking 20
random clinic visitors to complete the questionnaire. The data below was captured from the pilot
study.

(a) Use Chronbach’s alpha to assess the internal consistency of Questions 2, 5, 8, 15, 23. (3)

(b) Based on your calculation in (a), can you confirm that these questions were well designed to
assess waiting time at the clinic. Motivate your answer. (2)

4
𝑄2 𝑄5 𝑄8 𝑄15 𝑄23 𝑄2 + 𝑄5 + 𝑄8 + 𝑄15 + 𝑄23
1 1 1 1 1 5
5 5 5 1 5 21
1 4 1 3 1 10
4 5 2 3 2 16
2 3 1 1 2 9
5 5 5 3 5 23
3 4 1 1 1 10
2 3 2 3 2 12
5 5 5 3 5 23
1 1 1 1 1 5
5 5 5 3 5 23
2 3 2 2 2 11
5 5 5 5 5 25
5 5 3 5 3 21
1 1 1 1 1 5
5 5 3 3 3 19
5 5 5 5 5 25
2 3 1 3 1 10
2 3 1 1 1 8
1 1 1 1 2 6
Variance 3.042 2.463 3.103 2.050 2.871 54.661

Question 5 [8 marks]

The marketing department of a margarine company would like to segment all eight products in the
market, own and competition, in order to find segments with ‘gaps’ where a new product can be
launched. They asked a panel of 15 evaluators to assess the pairwise differences between the
products. Based on all 15 evaluations, the following dissimilarity matrix was computed.

𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8
𝑃1 0 2.5 1.0 2.8 2.9 3.5 2.7 0.8
𝑃2 2.5 0 2.0 1.9 1.9 2.4 0.6 3.7
𝑃3 1.0 2.0 0 1.8 2.1 2.8 3.2 1.5
𝐷 = 𝑃4 2.8 1.9 1.8 0 0.3 1.2 2.1 2.3
𝑃5 2.9 1.9 2.1 0.3 0 0.4 2.0 2.5
𝑃6 3.5 2.4 2.8 1.2 0.4 0 3.8 4.4
𝑃7 2.7 0.6 3.2 2.1 2.0 3.8 0 4.2
𝑃8 [0.8 3.7 1.5 2.3 2.5 4.4 4.2 0]

A hierarchical cluster analysis was performed on the data to segment the products. In the process
the following three clusters were identified. In the next step, two of the three clusters need to be
merged.
Cluster A : Products 𝑃1, 𝑃3, 𝑃8
Cluster B : Products 𝑃2, 𝑃7
Cluster C : Products 𝑃4, 𝑃5, 𝑃6
5
(a) Determine the distance between the clusters and suggest which two clusters need to be merged
next based on the single linkage method. (4)

(b) Determine the distance between the clusters and suggest which two clusters need to be merged
next based on the complete linkage method. (4)

6
SECTION B: PREDICTIVE METHODS [Available marks: 50]

ANSWER EACH SECTION IN A SEPARATE BOOK

Question 6 [15 marks]

The following data was randomly collected from an estate agency website on 98 houses that are for
sale. The data includes the level of the house price (priceLevel), the view that the house has (sea
view, mountain view or no view), the number of rooms in total (totalroom) and the size of the house
(size). The researcher wants to know which of the house attributes can be used in order to
discriminate a house according to the level of its price. The first few rows of the data and the
observed price levels are given in Tables 1 and 2 below. Use the attached output to answer the
questions below.

Price levels and size, total room number and view attributes of the first 20 houses in the data
set.
> ExamQ1All[1:20,18:22]
priceLevel size totalroom viewSea viewMount noView
1 med 81 2 1 0 0
2 med 86 3 1 0 0
3 med 70 2 0 0 1
4 med 62 2 0 0 1
5 low 53 2 0 0 1
6 med 79 2 0 0 1
7 med 81 2 0 0 1
8 med 86 2 1 0 0
9 low 61 2 0 1 0
10 med 52 2 0 1 0
11 med 74 2 0 1 0
12 med 92 2 1 0 0
13 med 84 2 0 1 0
14 med 70 2 0 1 0
15 med 61 2 0 1 0
16 med 79 2 0 0 1
17 high 144 2 0 0 1
18 high 144 2 0 0 1
19 med 95 2 0 0 1
20 med 95 2 0 0 1

(a) Evaluate the hit-rate. (3)

(b) Indicate to which price level group, the discriminant model has assigned house 6 and house 9.
Were these correct assignments? (2)

(c) Write down the discriminant functions. (2)

(d) How useful is the second discriminant function as a predictor of group membership in this
problem situation? Explain. (2)

(e) Calculate the group centroids for the medium priced houses. (2)

7
(f) Is the discriminant model able to statistically discriminate between houses belonging to each of
the three price levels? State appropriate null and alternative hypotheses and justify any
conclusions with supporting statistical evidence. Between which of the groups can the model
discriminate significantly? (4)

Discriminant Analysis Results


> fit1 <- lda(priceLevel~size+totalroom+viewSea+viewMount,
+ data=houseprices, method="moment")

> fit1
Call:
lda(priceLevel ~ size + totalroom + viewSea + viewMount, data = houseprices, method =
"moment")

Prior probabilities of groups:


high low med
0.1428571 0.2857143 0.5714286

Group means:
size totalroom viewSea viewMount
high 138.85714 4.500000 0.4285714 0.2142857
low 60.82143 2.803571 0.0000000 0.2857143
med 78.07143 2.642857 0.2321429 0.3392857

Coefficients of linear discriminants:


LD1 LD2
Constant -4.10345000 -0.82147200
size 0.04945093 -0.01282709
totalroom -0.02052425 0.86211733
viewSea 0.90284141 -1.64665812
viewMount -0.18736819 -1.16439112

Proportion of trace:
LD1 LD2
0.9249 0.0751

Sample Sizes of Each Category


> table(priceLevel)
priceLevel
high low med
14 28 56

Classification Table

> classificationTable=table(ExamQ1All$priceLevel,fitPredict$class)
> classificationTable

high low med


high 12 0 2
low 1 17 10
med 1 6 49

Mahalanobis Distances Between the Groups


> d2
Low and Med Low and High Med and High
[1,] 1.762034 17.87178 10.45382

8
Observed price levels for all of the 98 houses in the data set.

> priceLevel
[1] med med med med low med med med low med med med med med med
[16] med high high med med med med med low med med med med med med
[31] med med med med med med med med med low low med med low med
[46] med med med low high med med med med med med med med med med
[61] high high med high low high high med high high med high high high high
[76] med med med low low low low low low low low low low low low
[91] low low low low low low low low
Levels: high low med

Predicted price levels for all of the 98 houses in the data set.

> fitPredict$class
[1] med med med med low med med med med med med med med med med
[16] med high high med med med med med med med med med med med med
[31] low low med med med med med med med low low med low low med
[46] low med med low high med med high med low med med med med med
[61] high high med high low high high med high high med high high med med
[76] low med med med med low med low low med low low low med low
[91] low low high low low med med med

Question 7 [12 marks]

A financial investor wants to propose a model to help the decision of investment on stock returns
assessing the following attributes:

Variable
Criteria Levels
Code
Sales profitability profit 0= negative profitability ratio, 1=positive profitability ratio
Market-to-Book value mbv 0=Below average, 1= above average
Beta as risk beta 0=less risk, 1= high risk
Profit per share ppershare 0=negative profit per share, 1=positive profit per share
Debt ratio debt 0=low debt ratio, 1=high debt ratio
1=share is in first 30, 2=share is in first 50, 3=share is in first 100,
National indice Indice 4=share is not classified in the first 100

An analyst used Classification Trees to identify an appropriate decision rule to classify future stock
returns as positive or negative return.

In the data set, 310 firms were observed.

Relevant results from the Classification Tree module of R for the input data of the 6 attributes
considered appropriate is given below.

> rpartfit <- rpart(formula, data=datatowork)


> rpartfit
n =310

node), split, n, loss, yval, (yprob)


* denotes terminal node

9
1) root 310 94 positive return (0.3032258 0.6967742)
2) ppershare=negative ppershare 75 35 positive return (0.4666667 0.5333333)
4) mbv=>2 30 13 negative return (0.5666667 0.4333333)
8) debt=high 12 3 negative return (0.7500000 0.2500000) *
9) debt=low 18 8 positive return (0.4444444 0.5555556) *
5) mbv=<=2 45 18 positive return (0.4000000 0.6000000) *
3) ppershare=positive ppershare 235 59 positive return (0.2510638 0.7489362) *

(a) (7)

(b) Interpret the Classification Tree and define an appropriate decision rule for selecting a positive
return. (2)

(c) What percentage of firms is correctly identified as having a positive return by the chosen
criteria? Justify. Use this finding to comment on the reliability of the derived decision rule. (3)

Question 8 [9 marks]

The below model was fitted to data on 32 insurance companies where


𝑌 = 𝑃𝐸 𝑟𝑎𝑡𝑖𝑜 (price-earnings ratio)
𝑋1 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑖𝑛𝑠𝑢𝑟𝑎𝑛𝑐𝑒 𝑐𝑜𝑚𝑝𝑎𝑛𝑖𝑒𝑠, 𝑖𝑛 𝑏𝑖𝑙𝑙𝑖𝑜𝑛𝑠 𝑜𝑓 𝑅𝑎𝑛𝑑𝑠
𝑋2 = 𝐷𝑢𝑚𝑚𝑦 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝑡𝑎𝑘𝑖𝑛𝑔 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 1 𝑓𝑜𝑟 𝑟𝑒𝑔𝑖𝑜𝑛𝑎𝑙 𝑐𝑜𝑚𝑝𝑎𝑛𝑖𝑒𝑠 𝑎𝑛𝑑 0 𝑓𝑜𝑟 𝑛𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝑐𝑜𝑚𝑝𝑎𝑛𝑖𝑒𝑠

(a) . (1)

(b) Assess the overall quality of the model at 5% significance level. (2)

(c) Which of the independent variables are significant? (2)

(d) Which type of analysis of variance method would be appropriate if you wish to include both of
the independent variables? What would be the difference between this method and regression
analysis? (1)

(e) State the coefficient of determination and correlation coefficient and explain the difference
between them. (2)

(f) Estimate the price-earnings ratio of a national insurance company with a size of 3 billion
Rands. (1)
> summary(lm(pe_ratio~size+regional,data=QuestionR))
lm(formula = pe_ratio ~ size + regional, data = QuestionR)
Coefficients:
Estimate Std.Error t value Pr(>|t|)
Intercept 7.62
size -0.16 0.008
regional 1.23 0.496
---

Residual standard error: 1.303 on 10 degrees of freedom


Multiple R-squared: 0.9157, Adjusted R-squared 0.8905
F-statistic: 36.22 on 3 and 10 DF, p-value: 1.109e-05

10
𝑡29,0.05⁄ = 2.045
2

Question 9 [14 marks]

This question refers to a conceptual model that predicts reading (READ) and mathematics (MATH)
ability from observed scores of two intelligence scales, verbal comprehension (VC) and perceptual
organization (PO). The READ variable is indicated by basic word reading (BW) and reading
comprehension (RC) scores. The MATH variable is indicated by calculation (CL) and reasoning
(RE) scores. It is known that READ has an impact on the MATH variable. The following R output
gives the unstandardized estimates of the model.

(a) . (4)

(b) (2)

(c) Write down the set of …. equations for the model. Also indicate which of the equations are
from the measurement and structural part of the model and which part of the model is
significant at a 5% significance level. (5)

(d) Give a description of the measurement model part of the full SEM. That is, how are the latent
constructs being measured? (1)

(e) …. (2)

> summary(fit, fit.measures=TRUE)

Number of observations 200

χ2 = 8.63, p = 0.12, RMSEA = 0.057, SRMR = 0.017

Parameter estimates:
Estimate Std.err Z-value P(>|z|)
Latent variables:
READ =~
BW 1.000
RC 1.350 0.100
MATH =~
CL 1.000
RE 1.050 0.070
Regressions:
READ ~
VC 0.480 0.070
PO 0.040 0.050
MATH ~
VC 0.550 0.060
PO 0.160 0.050
READ 0.786 0.021

Variances:
BW 79.550 10.300

11
RC 5.280 11.960
CL 64.220 8.980
RE 36.770 7.880
READ 69.100 10.440
MATH 56.980 9.210

12

You might also like