Multivariate Data Analysis
Multivariate Data Analysis
Work Plan
Quantitative Methods
Multivariate Data Analysis
2016/2017
Pedro Campos / Paula Brito
DATA ARRAY
n "individuals" in rows
p variables (attributes) in columns
1
15/11/2016
DATA ARRAY
X Y1 Y2 ... Yj ... Yp
I1 x11 x12 ... x1j ... x1p
I2 x21 x22 ... x2j ... x2p
... ... ... ... ... ... ...
Ii xi1 xi2 ... xij ... xip
... ... ... ... ... ... ...
In xn1 xn2 ... xnj ... xnp
4
2
15/11/2016
VARIABLES
• Numerical (Quantitative)
When their values are real numbers
VARIABLES
• Numerical
3
15/11/2016
VARIABLES
• Categorical
If their values - categories, modalities -
are not real numbers, although numerical codes may
be used
4
15/11/2016
• Dependence methods
• Interdependence methods
5
15/11/2016
6
15/11/2016
Interdependence Methods
• Factor Analysis
Quantitative Data • Principal Component Analysis
• Canonical Correlation
• Cluster Analysis
• Log-linear models
Qualitative Data • Multiple Correspondence Analysis
(HOMALS)
• CatPCA
Supervised Methods
Dependence Interdependence
Variables Cases
Several dependent variables in One dependent variable in
one relation one relation
- Factor Cluster
Analysis Analysis
- Principal
Type of dependent Type of dependent
Component
variables variable
Analysis
- Canonical
Quantitative Correlation
Qualitative Quantitative Qualitative
Quantitative Multivariate
Analysis of
Multivariate Variance
Regression (MANOVA)
7
15/11/2016
Multiple Regression
Goal: explain the behaviour of one or more variables
according to other variables
Dependent variables : quantitative
Independent variables : quantitative or qualitative -
changed to binary (dummy)
Example:
Create a model to predict the profitability according to equity, return
on equity and solvability.
Multiple Regression
Model:
Y = β0 + β1 X1+ β2 X2 + …+ βp Xp + ε
Hypotheses :
Only Y is affected by measurement errors
Residuals εi are random, independent, with Normal distribution
with zero mean and constant variance: εi ~ N(0, σ)
Residuals εi are non correlated with independent variables X1,…, Xp
8
15/11/2016
Example :
To verify if sales of specific products depend on size of
store and location
Hypotheses: we assume that numeric (dependent) variables follow
Normal distribution in each population, and that variances in
different populations are equal.
Examples :
A Marketing department wants to find certain parameters of
customers, to distinguish buyers from non-buyers of some products,
and to use this information to predict the behaviour of new customers.
A bank needs to find parameters that identify successful firms (and
those who fail), and use this information to take decisions about loans.
9
15/11/2016
10
15/11/2016
Factorial Analysis
Applies to quantitative (numerical) variables
Factorial Analysis
11
15/11/2016
Factorial Analysis
Factorial Analysis
It is assumed that :
a) The observed variables Yj, the common factors and
the specific factors have null mean ;
b) The specific factors are not correlated among
themselves, nor with the common factors ;
Orthogonal Model :
c) The common factors are not correlated among
themselves and have unit variance
12
15/11/2016
Factorial Analysis
Example (Sharma) :
Example - cont.
Correlation matrix between marks (given) :
M P C E H F
M 1
P 0,62 1
C 0,54 0,51 1
E 0,32 0,38 0,36 1
H 0,284 0,351 0,336 0,686 1
F 0,37 0,43 0,405 0,73 0,735 1
26
13
15/11/2016
Factorial Analysis
Factorial Analysis
14
15/11/2016
Factorial Analysis
New variables
Linear combinations of the original variables, non-
correlated, and that maximize variance
15
15/11/2016
16
15/11/2016
CLUSTER ANALYSIS
Marketing:
Potential clients :
socio-economic characteristics, preferences
→ IdenTficaTon of market segments
Finance:
Companies : financial indicators
→ Typology of companies ?
17
15/11/2016
CLUSTER ANALYSIS
Applies to elements described by numerical or binary
variables (not simultaneaously)
Objective :
Given : n objects described by p variables
Potential clients socio-economic charac., past expenses
Companies financial indicators
Cities social structure, facilities
...
Determine a CLUSTERING :
Structure the objects in classes
CLUSTER ANALYSIS
The objective is grouping the objects in classes, such that
18
15/11/2016
Clustering Models
Partition
Clustering Models
Hierarchical Models
19
15/11/2016
Comparing elements
It is necessary to select a comparison measure between
pairs of elements of the set to be clustered
Examples of measures for numerical data:
- Euclidean distance
- Manhattan, or City-Block distance
- Mahalanobis distance
- …
Consider standardization
Hierarchical Clustering
Hierarchical model:
Set of nested partitions
Dendrogram
20
15/11/2016
Example
BUYING HOTEL AVERAGE
CITY BASKET RENT TAXI
POWER NIGHT INCOME
Amsterdam 78,00 1339,00 520,00 10,58 286,00 16486,00
Caracas 14,30 795,00 210,00 2,96 148,00 1910,00
Chicago 99,70 1474,00 900,00 5,00 218,00 25129,00
Helsinki 54,80 1597,00 570,00 7,56 194,00 13463,00
Houston 96,30 1314,00 430,00 6,00 149,00 21997,00
Jakarta 18,10 1035,00 980,00 1,42 245,00 3253,00
London 59,90 1354,00 810,00 7,16 375,00 13348,00
Luxembourg 114,00 1371,00 1080,00 8,92 227,00 24564,00
RiodeJaneiro 22,20 1067,00 450,00 2,58 194,00 3900,00
Zurich 100,00 1946,00 740,00 14,36 287,00 32420,00
Example
21
15/11/2016
Example
Class 1 : Amsterdam, Chicago, Helsinki, Houston,
Luxembourg, Zurich
Non-Hierarchical Clustering
Objective :
Determine (directly) partitions P = {C1,… , Ck}, i.e.,
families of k classes which do not intersect and that
jointly cover the whole :
22
15/11/2016
K-Means method
Fix the number of clusters – k
Starting from a set of k initial centers - elements of W -
assign each element to the class with nearest center.
After each assignment the cluster center is re-
computed.
After assigning all elements, the method may be
iterated.
Hierarchical VS
Non-hierarchical Clustering
Hierarchical Non-hierarchical
Series of “solutions” One single solution
No need to fix the number of Need to fix the number of
clusters clusters
Solution not improved Optimized solution
Computationally very heavy Computationally “lighter” :
less number of calculations
and comparisons
Not indicated for large Indicated for large datasets
datasets
46
23
15/11/2016
47
48
24