0% found this document useful (0 votes)
146 views

Multivariate Data Analysis

The document discusses quantitative methods for multivariate data analysis, including multiple regression, multivariate analysis of variance (MANOVA), and factor analysis. It provides examples of using these methods to predict variables like profitability based on other variables, and to analyze if numeric variables like sales depend on categorical variables like store size and location. The document also outlines the basic steps, assumptions, and models for multivariate data analysis techniques.

Uploaded by

Denise Maciel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views

Multivariate Data Analysis

The document discusses quantitative methods for multivariate data analysis, including multiple regression, multivariate analysis of variance (MANOVA), and factor analysis. It provides examples of using these methods to predict variables like profitability based on other variables, and to analyze if numeric variables like sales depend on categorical variables like store size and location. The document also outlines the basic steps, assumptions, and models for multivariate data analysis techniques.

Uploaded by

Denise Maciel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

15/11/2016

Work Plan
Quantitative Methods
Multivariate Data Analysis
2016/2017
Pedro Campos / Paula Brito

DATA ARRAY
n "individuals" in rows
p variables (attributes) in columns

nb. children weight gender education


I1 2 52 1 2
I2 1 55 1 3
I3 0 50 1 2
I4 3 60 2 1

1
15/11/2016

Data array - Example


The table below records, for some portuguese towns, the values
of % of workers in industry, nb.. of ATM machines and nb. of
available sportive facilities (old data...)

Workers in industry ATM Sportive


(%) Machines Facilities
Aveiro 47,07 36 81
Beja 10,35 12 52
Braga 46,81 50 125
Guimarães 79,36 34 103
Portimão 6,07 19 104

DATA ARRAY
X Y1 Y2 ... Yj ... Yp
I1 x11 x12 ... x1j ... x1p
I2 x21 x22 ... x2j ... x2p
... ... ... ... ... ... ...
Ii xi1 xi2 ... xij ... xip
... ... ... ... ... ... ...
In xn1 xn2 ... xnj ... xnp
4

2
15/11/2016

VARIABLES
• Numerical (Quantitative)
When their values are real numbers

• Discrete : if the value set is finite or infinite


but countable
Ex. : nb. children
nb. times you use the cell phone each day

• Continuous : if the value set is infinite and non-


countable
Ex. : height, weight, temperature
5

VARIABLES
• Numerical

• Interval scale : if they do not have an absolute zero


Ex: temperature

• Ratio scale : it is possible to define an exact relation


between variable values, since the scale has an
absolute zero
Ex: weight

3
15/11/2016

VARIABLES
• Categorical
If their values - categories, modalities -
are not real numbers, although numerical codes may
be used

• Ordinal : if the values are naturally ordered


Ex.: education level

• Nominal : if the values are not ordered


Ex.: nationality, job

Multivariate Data Analysis

Multivariate data analysis comprises a set of


statistical methods, which are used to analyse
together several variables, observed for each
individual or object.

4
15/11/2016

Multivariate Data Analysis


Steps for a multivariate data analysis:
1. Establish the objectives of the analysis
2. Design the analysis (sample size, type of variables,
statistical methods,…);
3. Check the hypothesis/assumptions of the selected
methods
4. Perform the analysis (estimation of the multivariate
method);
5. Interpret the obtained results (this often leads to a
reformulation of the model);
6. Validate the results.

Multivariate Data Analysis

There are 2 large groups of multivariate methods :

• Dependence methods
• Interdependence methods

5
15/11/2016

Main Techniques of Multivariate Analysis


The dependence methods assume the division of the
variables in two groups, the dependent and the independent
variables, and the objective is to assess whether the
independent variables have some influence in the dependent
ones, and how.

The interdependence methods make no distinction between


dependent and independent variables, and the objective is to
determine which are related, how they are related, and why.

Examples of Dependence Methods

Quantitative • Multiple Linear Regression


Dependent • Multivariate Analysis of Variance
Variable (MANOVA)

Qualitative • Discriminant Analysis


Dependent • Logistic Regression
Variable

6
15/11/2016

Interdependence Methods

• Factor Analysis
Quantitative Data • Principal Component Analysis
• Canonical Correlation
• Cluster Analysis

• Log-linear models
Qualitative Data • Multiple Correspondence Analysis
(HOMALS)
• CatPCA

One view of multivariate methods…


Nonsupervised methods
What type of relation?

Supervised Methods
Dependence Interdependence

How many variables to predict? Is a relation between:

Variables Cases
Several dependent variables in One dependent variable in
one relation one relation
- Factor Cluster
Analysis Analysis
- Principal
Type of dependent Type of dependent
Component
variables variable
Analysis
- Canonical
Quantitative Correlation
Qualitative Quantitative Qualitative

What is the type of Multiple - Discriminant


independente Regression Analysis
variables? - Logistic
Regression
Qualitative

Quantitative Multivariate
Analysis of
Multivariate Variance
Regression (MANOVA)

7
15/11/2016

Multiple Regression
Goal: explain the behaviour of one or more variables
according to other variables
Dependent variables : quantitative
Independent variables : quantitative or qualitative -
changed to binary (dummy)

Example:
Create a model to predict the profitability according to equity, return
on equity and solvability.

Linear Regression: the relation between the variables may be


described through a linear function
(if there is only one independent variable → a line)

Multiple Regression
Model:
Y = β0 + β1 X1+ β2 X2 + …+ βp Xp + ε

For each case:


Yi = β0 + β1 Xi1 + β2 Xi2 + …+ βp Xip + εi (i=1,… n)
βj – regression coefficients; εi – residuals

Hypotheses :
Only Y is affected by measurement errors
Residuals εi are random, independent, with Normal distribution
with zero mean and constant variance: εi ~ N(0, σ)
Residuals εi are non correlated with independent variables X1,…, Xp

8
15/11/2016

Analysis of Variance – (M)ANOVA


Goal: verify if the behaviour of one (or more) numerical
variables depends on qualitative variables - factors
Dependent variables: quantitative
Independent variables: qualitative

Example :
To verify if sales of specific products depend on size of
store and location
Hypotheses: we assume that numeric (dependent) variables follow
Normal distribution in each population, and that variances in
different populations are equal.

Discriminant Analysis (Linear)


Dependent variable: qualitative, categories: groups to
discriminate
Independent variables : quantitative

Examples :
A Marketing department wants to find certain parameters of
customers, to distinguish buyers from non-buyers of some products,
and to use this information to predict the behaviour of new customers.
A bank needs to find parameters that identify successful firms (and
those who fail), and use this information to take decisions about loans.

9
15/11/2016

Discriminant Analysis (Linear)


Goals:
Identify the variables that most distinguish groups
Use these variables to build an index to briefly represent
difference between groups
Use the identified variables and the index to create a rule
that allows classifying future observations in one of the
groups.
Hypothesis :
Explicative (independent) variables have multivariate
Normal distribution in each group
Variance-Covariances matrices are equal in all groups

Discriminant Analysis (Linear)


When assumptions are not met:

Variance - covariance matrix IS NOT equal in all the


groups:
→ Quadratic Discriminant Analysis. Requires large samples.

Explicative (independent) variables deviate a lot from


Normal Distribution:
→ Logistic regression

10
15/11/2016

Factorial Analysis
Applies to quantitative (numerical) variables

Objective : to identify a small number of factors that


allow explaining the relations between variables.

Example : sales values of different products may be


explained by common factors such as quality, utility, etc.

Factorial Analysis allows identifying underlying factors,


which cannot be directly observed.

Factorial Analysis

The observed correlations between variables are then


due to the fact that they "share" these factors.

Analysis of the correlation matrix :


the factorial model only makes sense if the variables
are indeed correlated ;
if correlations are very low, it is unlikely that the
variables share common factors.

11
15/11/2016

Factorial Analysis

In general, the model is written as :

Yj = aj1 F1 + aj2 F2 +... + ajk Fk + Uj

F1, F2 , ... , Fk - common factors


Uj - specific factor

aj1 , aj2 , ..., ajk : loadings

Factorial Analysis
It is assumed that :
a) The observed variables Yj, the common factors and
the specific factors have null mean ;
b) The specific factors are not correlated among
themselves, nor with the common factors ;

Orthogonal Model :
c) The common factors are not correlated among
themselves and have unit variance

12
15/11/2016

Factorial Analysis
Example (Sharma) :

Consider students’ marks on 6 subjects :


mathematics, physics, chemistry, english, history and
french.
Each mark may be written as a function of
- the student’s intelligence/capacity - common factor
- the oposition between quantitative capacity and
verbal capacity - common factor
- aptitute to the subject – specific factor

Example - cont.
Correlation matrix between marks (given) :

M P C E H F
M 1
P 0,62 1
C 0,54 0,51 1
E 0,32 0,38 0,36 1
H 0,284 0,351 0,336 0,686 1
F 0,37 0,43 0,405 0,73 0,735 1

26

13
15/11/2016

Factorial Analysis

M = 0,675 F1 + 0,557 F2 + ApM


F = 0,717 F1 + 0,447 F2 + ApF
Q = 0,683 F1 + 0,418 F2 + ApQ
I = 0,793 F1 - 0,410 F2 + ApI
H = 0,774 F1 - 0,461 F2 + ApH
Fr = 0,837 F1 - 0,359 F2 + ApFr

Factorial Analysis

The correlations between the observed variables


and the common factors (standardized principal
components) are given by the pattern loadings.

→ Interpretation of the factors

14
15/11/2016

Factorial Analysis

Methods for factor extraction :


Principal Components
Principal Axis
Non-weighted mean-squares
Generalized mean-squares
Maximum Likelihood
Alpha Method
Image Factoring

Principal Component Analysis


Principal Components :

New variables
Linear combinations of the original variables, non-
correlated, and that maximize variance

They are obtained from the eigenvectors of the


correlation matrix, associated with the largest
eigenvalues

15
15/11/2016

Principal Component Analysis

If an important part of the dispersion is explained by a


small number of principal components, then we may
use just some of them for interpretation and future
analysis, instead of the original p.

How many components should be kept?


Which percentage of dispersion are we ready to
sacrifice ?
How much is just “noise” ?

Principal Component Analysis


1) Pearson’s criterion:
Keep a number q of components such that they explain at
least 80% of the total dispersion.
2) Observe the graphical representation of the eigenvalues
and keep those λα for which : λα- λα -1 > ε (ε relatively
small) - “elbow’s rule”.
3) Kaiser proposed to only keep the eigenvalues above 1 -
i.e., the principal components which are “more
informative” than the original variables, i.e., whose
variance is above the original variables’ variance.

16
15/11/2016

Factorial Analysis with Qualitative Variables

Specific methods for qualitative variables

• Multiple Correspondence Analysis


• CatPCA

CLUSTER ANALYSIS
Marketing:
Potential clients :
socio-economic characteristics, preferences
→ IdenTficaTon of market segments

Finance:
Companies : financial indicators
→ Typology of companies ?

17
15/11/2016

CLUSTER ANALYSIS
Applies to elements described by numerical or binary
variables (not simultaneaously)
Objective :
Given : n objects described by p variables
Potential clients socio-economic charac., past expenses
Companies financial indicators
Cities social structure, facilities
...
Determine a CLUSTERING :
Structure the objects in classes

CLUSTER ANALYSIS
The objective is grouping the objects in classes, such that

- elements of a given class are quite similar among each


other – homogeneous classes

- classes are "relatively distinct" from each other –


well separated classes

18
15/11/2016

Clustering Models
Partition

Disjoint classes which together cover the whole set to


be clustered

Clustering Models
Hierarchical Models

Classes are organized in a nested structure

19
15/11/2016

Comparing elements
It is necessary to select a comparison measure between
pairs of elements of the set to be clustered
Examples of measures for numerical data:
- Euclidean distance
- Manhattan, or City-Block distance
- Mahalanobis distance
- …
Consider standardization

Many measures for binary variables

Hierarchical Clustering
Hierarchical model:
Set of nested partitions

Dendrogram

20
15/11/2016

Example
BUYING HOTEL AVERAGE
CITY BASKET RENT TAXI
POWER NIGHT INCOME
Amsterdam 78,00 1339,00 520,00 10,58 286,00 16486,00
Caracas 14,30 795,00 210,00 2,96 148,00 1910,00
Chicago 99,70 1474,00 900,00 5,00 218,00 25129,00
Helsinki 54,80 1597,00 570,00 7,56 194,00 13463,00
Houston 96,30 1314,00 430,00 6,00 149,00 21997,00
Jakarta 18,10 1035,00 980,00 1,42 245,00 3253,00
London 59,90 1354,00 810,00 7,16 375,00 13348,00
Luxembourg 114,00 1371,00 1080,00 8,92 227,00 24564,00
RiodeJaneiro 22,20 1067,00 450,00 2,58 194,00 3900,00
Zurich 100,00 1946,00 740,00 14,36 287,00 32420,00

Example

21
15/11/2016

Example
Class 1 : Amsterdam, Chicago, Helsinki, Houston,
Luxembourg, Zurich

Class 2 : Caracas, Jakarta, London, Rio de Janeiro

BUYING HOTEL AVERAGE


CITY BASKET RENT TAXI
POWER NIGHT INCOME
class1 90,47 1506,83 706,67 8,74 226,83 22343,17
class2 28,63 1062,75 612,50 3,53 240,50 5602,75

Non-Hierarchical Clustering
Objective :
Determine (directly) partitions P = {C1,… , Ck}, i.e.,
families of k classes which do not intersect and that
jointly cover the whole :

22
15/11/2016

K-Means method
Fix the number of clusters – k
Starting from a set of k initial centers - elements of W -
assign each element to the class with nearest center.
After each assignment the cluster center is re-
computed.
After assigning all elements, the method may be
iterated.

Known as : moving-centers method

Hierarchical VS
Non-hierarchical Clustering
Hierarchical Non-hierarchical
Series of “solutions” One single solution
No need to fix the number of Need to fix the number of
clusters clusters
Solution not improved Optimized solution
Computationally very heavy Computationally “lighter” :
less number of calculations
and comparisons
Not indicated for large Indicated for large datasets
datasets

46

23
15/11/2016

Combining Factorial Analysis


and Clustering
Determine principal components

Select the relevant ones

Cluster the data with the values of the principal


components (or factor scores) instead of original
data

47

Clustering with qualitative data

Do not apply directly a clustering method!


Perform Multiple Correspondence Analysis
Select the relevant ones
Cluster the data with the values of the principal
components (or factor scores) instead of original
data

48

24

You might also like