Aspects of Multivariate Analysis
Aspects of Multivariate Analysis
1
• The need to understand the relationships between many variables makes
multivariate analysis an inherently difficult subject.
2
The objectives of scientific investigations to which multivariate methods
most naturally lend themselves include the following
• prediction
If the results disagree with informed opinion, do not admit a simple logical
interpretation, and do not show up clearly in a graphical presentation, they are
probably wrong. There is no magic about numerical methods, and many ways
in which they can break down. They are a value aid to the interpretation of
data, not sausage machines automatically transforming bodies of numbers into
packet of scientific fact.
3
1.2 Application of Multivariate Techniques
Data reduction or simplification
• Data related to responses to visual stimuli were used to develop a rule for
separating people suffering from a multiple-sclerosis-caused visual pathology
from those not suffering from disease.
• The U.S. Internal Revenue Service uses data collected from tax returns to
sort taxpayers into two groups: those that will be audited and those will not.
5
Investigation of the dependence among variables
• Data on several variables were used to identify factors that were responsible
for client success in hiring external consultants.
6
Prediction
• The association between test scores, and several high school performance
variables, and several college performance variables were used to develop
predictors of success in college.
7
Hypotheses testing
• Experimental data on several variables were used to see whether the nature
of the instruction makes any difference in perceives risk, as quantified by test
scores.
8
1.3 The Organization of Data
Array
10
Descriptive Statistics
A large data set is bulky, and its very mass poses a serious obstacle to
any attempt to visually extract pertinent information. Much of the information
contained in the data can be assessed by calculating certain summary numbers,
known as descriptive statistics.
• The average of the squares of the distances of all of the number from mean
provides a measure of the spread, or variation, in numbers.
11
• Sample mean
n n
1X 1X
x̄1 = xj1 or x̄k = xjk k = 1, 2, . . . , p
n i=1 n i=1
• Sample variance
n
1X
s2k = skk = (xjk − x̄k )2, k = 1, 2, . . . , p.
n j=1
√
• Sample standard deviation skk .
12
• Sample covariance
n
1X
sik = (xji − x̄i)(xjk − x̄k )
n j=1
n
P
(xji − x̄i)(xjk − x̄k )
sik j=1
rik = √ √ =s s
sii skk n
P Pn
(xji − x̄i)2 (xjk − x̄k )2
j=1 j=1
13
1. The value of r must between −1 and +1 inclusive.
3. The value of rik remain unchange if the measurements of ith variable are
changed to yji = axji + b, j = 1, 2, . . . , n, and the value of the kth variable
are changed to yjk = cxjk + d, j = 1, 2, . . . , n, provide that the constants a
and c have the same sign.
14
Example 1.2 (The arrays x̄, Sn and R for bivariate data) Consider the
data introduced in Example 1.1.Each receipt yields a pair of measurements,
total dollar sales, and number of books sold. Find the array x̄, Sn and R.
15
Graphical Techniques
Variable 1 (x1): 3 4 2 6 8 2 5
Variable 2 (x2): 5 5.5 4 7 10 5 7.5
16
Variable 1 (x1): 5 4 6 2 2 8 3
Variable 2 (x2): 5 5.5 4 7 10 5 7.5
17
Example 1.3 (The effect of unusual observations on sample
correlations) Some financial data representing jobs and productivity for the
16 largest publishing firms appeared in an article in Forbes magazine on April
30,1990. The data for the pair of variable x1 = employees(jobs) and x2=profit
per employee (productivity) are graphed in Figure 1.3. We have labeled two
“unusual” observations. Dun&Bradstreet is the largest firm in term of number
of employees, but is “typical” in terms of profits per employee. Time Warner
has a “typical” number of employees, but comparatively small (negative) profit
per employee.
18
The sample correlation coefficient computed from the values of x1 and x2 is
−0.39 for all 16 firms
−0.56 for all firms but Dun and & Bradstreet
r12 =
−0.39 for all firms but Time Warner
−0.50 for all firms but Dun&Bradstreet and Time Warner
Example 1.4 (A scatter plot for baseball data ) In a July 17, 1978, article
on money in sports, Sports Illustrated magazine provided data on x1 =player
payroll for National League East baseball teams.
We have added data on x2 =won-lost percentage for 1977. The results are
given in Table 1.1
19
x2
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 x1
1 1.5 2 2.5 3 3.5 4
6
Payer payroll in millions of dollars x 10
20
Example 1.5 (Multiple scatter plot for paper strength measurement)
Paper is manufactured in continuous sheets several feet wide. Because of the
orientation of fibers within the paper, it has a different strength when measured
in the direction produced by the machine than when measured across, or at
right angles to, the machine direction. The measured values includes
21
22
Example 1.6 (Looking for lower-dimensional structure) A zoologist
obtained measurement on n = 25 lizard known scientifically as Cophosaurus
texanus. The weight, or mass, is given in millimeters. The data are displayed
in Table 1.3.
23
24
25
Example 1.7 (Looking for group structure in three dimensions) Referring
to Example 1.6, it is interesting to see if male and female lizard occupy different
parts of three dimensional space containing the size data. The gender, by row,
for the lizard data in Table 1.3 are
f m f f m f m f m f m f m m m m f m m m f f m f f
26
27
Data Display and Pictorial Representations
Linking Multiple Two-Dimensional Scatter Plots
Example 1.8 (Linked scatter plots and brushing) To illustrate linked two-
dimensional scatter plots, we refer to the paper-quality data in Example 1.5.
These data represent measurements on the variables x1 =density, x2 =strength
in the machine direction, and x3 =strength in the cross direction.
28
29
30
31
32
Example 1.9 (Rotated plots in three dimensions) Four different
measurements of lumber stiffness are given. Specimen (broad) 16 and possibly
specimen (broad) 9 are identified as unusual observations.
33
Graphs of Growth Curves
Example 1.10 (Array of growth curves) The Alaska Fish and Game
Department monitor grizzly bears with the goad of maintaining a health
population. Bears are shot with a dart to induce sleep and weighted on a
scale hanging from a tripod. Measurements of length are taken with a steel
tape. The following Table gives the weights (wt) in kilograms and lengths
(lngth) in centimeters of seven female bears at 2,3,4 and 5 years of age.
34
35
36
37
Stars
Example 1.11 ( Utility data as stars) Stars representing the first 5 of the
22 public utility firms data are shown in the following figure. There are eight
variables; consequently, the stars are distorted octagons.
38
Chernoff Faces
People react to faces. Chernoff suggest representing p-dimensional observation
as a two-dimenional face whose characteristics ( face shape, mouth curvature,
nose length, eye size, pupil position, and so forth ) are determined by the
measurements on the p variables.
Chernoff faces appear to be most useful for verifying (1) an initial grouping
suggested by subject-matter knowledge and intuition or (2) final groupings
produced by clustering algorithms.
39
Example 1.12 (Utility data as Chernoff faces) The 22 public utility companies
data were represented as chernoff faces. We have the following correspondences:
40
41
Example 1.14 (Using Chernoff faces to show changes over time ) The
following figure illustrates an additional use of Chernoff faces. In the figure,
the faces are used to track the financial well-being of a company over time.
As indicated, each facial feature represent a single financial indicator, and the
longitudinal changes in these indicators are thus evident at a glance
42
1.5 Distance
43
• Statistical distance
44
• Standardize coordinates.
Suppose we have n pairs of measurements on two variables x1, x2each having
mean zero.
∗ √ ∗ √
x1 = x1/ s11 and x1 = x1/ s22.
Hence a statistical distance of the point P = (x1, x2) from the origin
O = (0, 0) can be defined as
s
x21 x22
q
d(O, P ) = (x∗1 )2 + (x∗2 )2 = + .
s11 s22
• All points which have coordinates (x1, x2) and are constant square distance
c2 from origin muse satisfy
x21 x22
+ = c2
s11 s22
• All point P that are a constant squared distance from Q lie on a hyper-
ellipsoid centered at Q whose major and minor axes are parallel to the
coordinate axes.
46
The distances defined above does not include most of the important cases
we shall encounter, because of the assumption of independent coordinates. See
the following scatter plot
47
• Rotate x1 and x2 directions to directions x̃1 and x̃2.
• Define the distance from the point P = (x̃1, x̃2) to the origin O = (0, 0) as
s
x̃21 x̃22
d(O, P ) = + .
s̃11 s̃22
where s̃11 and s̃22 denote the sample variances computed with the x̃1 and x̃2
measurements.
• The relation between the original coordinates (x1, x2) and the rotated
coordinates (x̃1, x̃2) is provided by
48
• After some straightforward algebraic manipulations, the distance from P =
(x̃1, x̃2) to origin O = (0, 0) can be written in term of the original coordinates
x1 and x2 q
d(O, P ) = a11x21 + 2a12x1x2 + a22x22
where the a0s are numbers such that the distance is nonnegative for all
possible variables of x1 and x2.
• In general, the statistical distance of the point P = (x1, x2) from the fixed
point Q = (y1, y2) for the situation in which the variables are correlated has
the general form
p
d(P, Q) = a11(x1 − y1)2 + 2a12(x1 − y1)(x2 − y2) + a22(x2 − y2)2
(c) d(P, Q) = 0 if P = Q
50