STA3022F Exam June 2013
STA3022F Exam June 2013
Question 1 [3 marks]
Consider a data set of the monthly inflation rate over several years for all African countries.
(a) Would you consider inflation rate to be a numerical, ordinal categorical or nominal variable. (1)
(b) If the inflation figure for Zimbabwe for June 2012 is missing, name one method of overcoming this
problem. (1)
(c) You want to perform an analysis that requires normally distributed data. You plotted histograms of the
data, some of which are shown below. Since the data is very skew, what can be done before embarking
on the analysis? (1)
300
400
250
150
300
200
Frequency
Frequency
Frequency
100
150
200
100
100
50
50
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4
x x x
200
250
150
150
200
Frequency
Frequency
Frequency
150
100
100
100
50
50
50
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5
x x x
1
Question 2 [15 marks]
Consider the following contingency table and associated R output. In the table below, 300 people
were asked what bank they used most often and to identify the most important reason why they
chose that bank.
> ca(bank.data)
Principal inertias (eigenvalues):
1 2 3
Value 0.174676 0.046073 0.021778
Percentage 72.02% 19% 8.98%
Rows:
FNB ABSA NEDBANK STDBANK
Mass 0.240000 0.233333 0.190000 0.336667
ChiDist 0.542201 0.468606 0.383164 0.525126
Inertia 0.070556 0.051238 0.027895 0.092838
Dim. 1 -1.172400 -0.794016 0.278962 1.228645
Dim. 2 -0.781987 1.457919 -1.315956 0.289685
Columns:
Branch Competitive
Helpful Enjoy close Keeps me Many interest
staff advertising to home informed ATMs rates
Mass 0.166667 0.166667 0.166667 0.166667 0.166667 0.166667
ChiDist 0.205443 0.400249 0.618174 0.614089 0.516271 0.476418
Inertia 0.007034 0.026700 0.063690 0.062851 0.044423 0.037829
Dim. 1 -0.427021 0.023291 1.378472 -1.336570 1.197046 -0.835218
Dim. 2 -0.278986 -1.788994 1.041542 0.087283 -0.301211 1.240367
(a) In order to test the hypothesis 𝐻0 : no significant association between bank and reason for using that
bank, give the test statistic and its associated distribution. Also define each of the symbols you use in the
definition of the test statistic and indicate how to calculate or obtain them. (4)
(b) Give an expression for the Pearson residuals calculated for a contingency table. (1)
(c) Once the matrix of Pearson residuals is obtained, each entry is divided by the grand total from the
contingency table. Name and give a mathematical expression for the method of obtaining the CA map
coordinates from this matrix. (2)
(e) What proportion of variance in the data are you displaying in your map in (d)? (1)
(f) Which bank provides the most competitive interest rates? (1)
2
Question 3 [18 marks]
In an analysis of Olympic decathlon scores, 160 complete starts were made by 139 athletes. The
scores for each of the 10 decathlon events were standardized and the signs of the timed events
changed so that large scores are good for all events. The factor analysis output is shown below.
(a) Give the general model for factor analysis for factors 𝐹1 , … , 𝐹𝑞 based on manifest variables
𝑋1 , … , 𝑋𝑝 . (2)
(b) Give the general principal component analysis model for principal components 𝑌1 , … , 𝑌𝑚 based
on manifest variables 𝑋1 , … , 𝑋𝑝 . (2)
(c) Name two methods for estimating the factor loadings in the Factor analysis model. (2)
(d) Why did the researcher decide to use four factors? (1)
(g) Give an expression for the communality of the 𝑖-th manifest variable and explain what its
meaning is. (2)
4
3
eigenvalue
2
1
2 4 6 8 10
Index
> eigenvalues
[1] 4.21290 2.88958 2.24310 1.05940 0.91770 0.66980 0.57570 0.42110 0.32190 0.25250
3
> loadings
[,1] [,2] [,3] [,4]
[100m run] -0.6961444 0.02209774 -0.468329773 0.41636799
[long jump] -0.7925464 0.07517162 -0.254696029 0.11462221
[shot put] -0.7710515 -0.43442586 0.197341218 0.11216995
[high jump] -0.7107713 0.18069329 0.004571862 -0.36745024
[400m run] -0.6048010 0.54866998 -0.045077267 0.39698116
[110m hurdles] -0.5126311 -0.08271661 -0.371744685 -0.56108748
[discus] -0.6897880 -0.45643672 0.288597881 0.07771845
[pole vault] -0.7609729 0.16239082 0.018258009 -0.30422480
[javelin] -0.5184056 -0.25162810 0.518908343 0.07356806
[1500m run] -0.2197133 0.74576659 0.493070762 -0.08518221
> varimax(loadings)
[100m run] -0.885 -0.139 0.182 -0.205
[long jump] -0.664 0.201 -0.693
[shot put] -0.023 0.819 -0.152
[high jump] -0.121 0.293 0.237 -0.683
[400m run] -0.746 0.750
[110m hurdles] -0.108 -0.161 -0.826
[discus] -0.185 0.832 -0.204
[pole vault] -0.207 0.193 0.124 -0.656
[javelin] 0.188 0.754
[1500m run] 0.921
Question 4 [5 marks]
In order to evaluate the quality of health care at a local clinic, researchers designed a questionnaire
consisting of several statements. The statements are rated on a five point scale and each statement is
designed such that a response of “Totally disagree”, scored 1, indicates poor quality and “Totally
agree”, scored 5, indicates excellent quality. Several aspects of health care quality need to be
evaluated. Questions 2, 5, 8, 15 and 23 is designed to deal with waiting time at the clinic. In order to
assess whether the questions were adequately designed, the researchers did a pilot study asking 20
random clinic visitors to complete the questionnaire. The data below was captured from the pilot
study.
(a) Use Chronbach’s alpha to assess the internal consistency of Questions 2, 5, 8, 15, 23. (3)
(b) Based on your calculation in (a), can you confirm that these questions were well designed to
assess waiting time at the clinic. Motivate your answer. (2)
4
𝑄2 𝑄5 𝑄8 𝑄15 𝑄23 𝑄2 + 𝑄5 + 𝑄8 + 𝑄15 + 𝑄23
1 1 1 1 1 5
5 5 5 1 5 21
1 4 1 3 1 10
4 5 2 3 2 16
2 3 1 1 2 9
5 5 5 3 5 23
3 4 1 1 1 10
2 3 2 3 2 12
5 5 5 3 5 23
1 1 1 1 1 5
5 5 5 3 5 23
2 3 2 2 2 11
5 5 5 5 5 25
5 5 3 5 3 21
1 1 1 1 1 5
5 5 3 3 3 19
5 5 5 5 5 25
2 3 1 3 1 10
2 3 1 1 1 8
1 1 1 1 2 6
Variance 3.042 2.463 3.103 2.050 2.871 54.661
Question 5 [8 marks]
The marketing department of a margarine company would like to segment all eight products in the
market, own and competition, in order to find segments with ‘gaps’ where a new product can be
launched. They asked a panel of 15 evaluators to assess the pairwise differences between the
products. Based on all 15 evaluations, the following dissimilarity matrix was computed.
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8
𝑃1 0 2.5 1.0 2.8 2.9 3.5 2.7 0.8
𝑃2 2.5 0 2.0 1.9 1.9 2.4 0.6 3.7
𝑃3 1.0 2.0 0 1.8 2.1 2.8 3.2 1.5
𝐷 = 𝑃4 2.8 1.9 1.8 0 0.3 1.2 2.1 2.3
𝑃5 2.9 1.9 2.1 0.3 0 0.4 2.0 2.5
𝑃6 3.5 2.4 2.8 1.2 0.4 0 3.8 4.4
𝑃7 2.7 0.6 3.2 2.1 2.0 3.8 0 4.2
𝑃8 [0.8 3.7 1.5 2.3 2.5 4.4 4.2 0]
A hierarchical cluster analysis was performed on the data to segment the products. In the process
the following three clusters were identified. In the next step, two of the three clusters need to be
merged.
Cluster A : Products 𝑃1, 𝑃3, 𝑃8
Cluster B : Products 𝑃2, 𝑃7
Cluster C : Products 𝑃4, 𝑃5, 𝑃6
5
(a) Determine the distance between the clusters and suggest which two clusters need to be merged
next based on the single linkage method. (4)
(b) Determine the distance between the clusters and suggest which two clusters need to be merged
next based on the complete linkage method. (4)
6
SECTION B: PREDICTIVE METHODS [Available marks: 50]
The following data was randomly collected from an estate agency website on 98 houses that are for
sale. The data includes the level of the house price (priceLevel), the view that the house has (sea
view, mountain view or no view), the number of rooms in total (totalroom) and the size of the house
(size). The researcher wants to know which of the house attributes can be used in order to
discriminate a house according to the level of its price. The first few rows of the data and the
observed price levels are given in Tables 1 and 2 below. Use the attached output to answer the
questions below.
Price levels and size, total room number and view attributes of the first 20 houses in the data
set.
> ExamQ1All[1:20,18:22]
priceLevel size totalroom viewSea viewMount noView
1 med 81 2 1 0 0
2 med 86 3 1 0 0
3 med 70 2 0 0 1
4 med 62 2 0 0 1
5 low 53 2 0 0 1
6 med 79 2 0 0 1
7 med 81 2 0 0 1
8 med 86 2 1 0 0
9 low 61 2 0 1 0
10 med 52 2 0 1 0
11 med 74 2 0 1 0
12 med 92 2 1 0 0
13 med 84 2 0 1 0
14 med 70 2 0 1 0
15 med 61 2 0 1 0
16 med 79 2 0 0 1
17 high 144 2 0 0 1
18 high 144 2 0 0 1
19 med 95 2 0 0 1
20 med 95 2 0 0 1
(b) Indicate to which price level group, the discriminant model has assigned house 6 and house 9.
Were these correct assignments? (2)
(d) How useful is the second discriminant function as a predictor of group membership in this
problem situation? Explain. (2)
(e) Calculate the group centroids for the medium priced houses. (2)
7
(f) Is the discriminant model able to statistically discriminate between houses belonging to each of
the three price levels? State appropriate null and alternative hypotheses and justify any
conclusions with supporting statistical evidence. Between which of the groups can the model
discriminate significantly? (4)
> fit1
Call:
lda(priceLevel ~ size + totalroom + viewSea + viewMount, data = houseprices, method =
"moment")
Group means:
size totalroom viewSea viewMount
high 138.85714 4.500000 0.4285714 0.2142857
low 60.82143 2.803571 0.0000000 0.2857143
med 78.07143 2.642857 0.2321429 0.3392857
Proportion of trace:
LD1 LD2
0.9249 0.0751
Classification Table
> classificationTable=table(ExamQ1All$priceLevel,fitPredict$class)
> classificationTable
8
Observed price levels for all of the 98 houses in the data set.
> priceLevel
[1] med med med med low med med med low med med med med med med
[16] med high high med med med med med low med med med med med med
[31] med med med med med med med med med low low med med low med
[46] med med med low high med med med med med med med med med med
[61] high high med high low high high med high high med high high high high
[76] med med med low low low low low low low low low low low low
[91] low low low low low low low low
Levels: high low med
Predicted price levels for all of the 98 houses in the data set.
> fitPredict$class
[1] med med med med low med med med med med med med med med med
[16] med high high med med med med med med med med med med med med
[31] low low med med med med med med med low low med low low med
[46] low med med low high med med high med low med med med med med
[61] high high med high low high high med high high med high high med med
[76] low med med med med low med low low med low low low med low
[91] low low high low low med med med
A financial investor wants to propose a model to help the decision of investment on stock returns
assessing the following attributes:
Variable
Criteria Levels
Code
Sales profitability profit 0= negative profitability ratio, 1=positive profitability ratio
Market-to-Book value mbv 0=Below average, 1= above average
Beta as risk beta 0=less risk, 1= high risk
Profit per share ppershare 0=negative profit per share, 1=positive profit per share
Debt ratio debt 0=low debt ratio, 1=high debt ratio
1=share is in first 30, 2=share is in first 50, 3=share is in first 100,
National indice Indice 4=share is not classified in the first 100
An analyst used Classification Trees to identify an appropriate decision rule to classify future stock
returns as positive or negative return.
Relevant results from the Classification Tree module of R for the input data of the 6 attributes
considered appropriate is given below.
9
1) root 310 94 positive return (0.3032258 0.6967742)
2) ppershare=negative ppershare 75 35 positive return (0.4666667 0.5333333)
4) mbv=>2 30 13 negative return (0.5666667 0.4333333)
8) debt=high 12 3 negative return (0.7500000 0.2500000) *
9) debt=low 18 8 positive return (0.4444444 0.5555556) *
5) mbv=<=2 45 18 positive return (0.4000000 0.6000000) *
3) ppershare=positive ppershare 235 59 positive return (0.2510638 0.7489362) *
(a) (7)
(b) Interpret the Classification Tree and define an appropriate decision rule for selecting a positive
return. (2)
(c) What percentage of firms is correctly identified as having a positive return by the chosen
criteria? Justify. Use this finding to comment on the reliability of the derived decision rule. (3)
Question 8 [9 marks]
(a) . (1)
(b) Assess the overall quality of the model at 5% significance level. (2)
(d) Which type of analysis of variance method would be appropriate if you wish to include both of
the independent variables? What would be the difference between this method and regression
analysis? (1)
(e) State the coefficient of determination and correlation coefficient and explain the difference
between them. (2)
(f) Estimate the price-earnings ratio of a national insurance company with a size of 3 billion
Rands. (1)
> summary(lm(pe_ratio~size+regional,data=QuestionR))
lm(formula = pe_ratio ~ size + regional, data = QuestionR)
Coefficients:
Estimate Std.Error t value Pr(>|t|)
Intercept 7.62
size -0.16 0.008
regional 1.23 0.496
---
10
𝑡29,0.05⁄ = 2.045
2
This question refers to a conceptual model that predicts reading (READ) and mathematics (MATH)
ability from observed scores of two intelligence scales, verbal comprehension (VC) and perceptual
organization (PO). The READ variable is indicated by basic word reading (BW) and reading
comprehension (RC) scores. The MATH variable is indicated by calculation (CL) and reasoning
(RE) scores. It is known that READ has an impact on the MATH variable. The following R output
gives the unstandardized estimates of the model.
(a) . (4)
(b) (2)
(c) Write down the set of …. equations for the model. Also indicate which of the equations are
from the measurement and structural part of the model and which part of the model is
significant at a 5% significance level. (5)
(d) Give a description of the measurement model part of the full SEM. That is, how are the latent
constructs being measured? (1)
(e) …. (2)
Parameter estimates:
Estimate Std.err Z-value P(>|z|)
Latent variables:
READ =~
BW 1.000
RC 1.350 0.100
MATH =~
CL 1.000
RE 1.050 0.070
Regressions:
READ ~
VC 0.480 0.070
PO 0.040 0.050
MATH ~
VC 0.550 0.060
PO 0.160 0.050
READ 0.786 0.021
Variances:
BW 79.550 10.300
11
RC 5.280 11.960
CL 64.220 8.980
RE 36.770 7.880
READ 69.100 10.440
MATH 56.980 9.210
12