0% found this document useful (0 votes)
20 views13 pages

5th Module SDS

Multivariate analysis is a statistical technique that analyzes datasets with multiple variables simultaneously. It explores relationships between variables to uncover patterns that may not be apparent when looking at variables individually. Objectives include data reduction, organization, understanding interdependencies, and hypothesis construction. Common techniques are multiple regression analysis, discriminant analysis, multivariate analysis of variance (MANOVA), factor analysis, and cluster analysis. Multivariate analysis aims to provide meaningful interpretation of relationships and patterns in complex data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views13 pages

5th Module SDS

Multivariate analysis is a statistical technique that analyzes datasets with multiple variables simultaneously. It explores relationships between variables to uncover patterns that may not be apparent when looking at variables individually. Objectives include data reduction, organization, understanding interdependencies, and hypothesis construction. Common techniques are multiple regression analysis, discriminant analysis, multivariate analysis of variance (MANOVA), factor analysis, and cluster analysis. Multivariate analysis aims to provide meaningful interpretation of relationships and patterns in complex data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

MODULE 5 MULTIVARIATE ANALYSIS

WHAT IS MULTIVARIATE ANALYSIS? WHERE IT IS POPULARLY


USED)
Multivariate means the involvement of multiple variables
As an analyst you are asked to analyze the sales figures of a company
You cannot simply say that X is a factor which will affect the sales
We know that there are multiple aspects or variables which will impact
sales
Multivariate analysis is a very important part of explanatory data
analysis

OBJECTIVES OF MULTIVARIATE ANALYSIS:


The important objectives are

1)Data Reduction: Simplifying the data without sacrificing the


valuable information.

2)Data Organization: Sorting and grouping of data depending on


certain characteristics.

3)Data interdependency: Understanding the relationship between


variables.

4)Hypothesis Construction: Helps validate assumptions or to


reinforce prior convictions.

MULTIVARIATE ANALYSIS TECHNIQUES:


1) Multiple regression analysis:
1) Multivariate Regression analysis is one of the most commonly used
multivariate technique
2) It assess the relationship between a single dependent variable and
multiple independent variables.
3)With this a linear relationship is determined to provide the final
forecasting ability.

2)Discriminant Analysis:
1) Discriminant analysis primarily works by classifying the observations
into multiple groups
2) A linear discriminant function is built to classify the observations.
3) It provides us the ability to understand which variables has more
impact on the discriminant function.

3)Multivariate analysis of variance (MAN0VA):


Multivariate Analysis of variance (MANOVA)is a popular technique used
to assess relationship between many categorical independent variables
and multiple metric based dependent variables( variables that are
measured using quantitative metrics or scales)
Very popularly used to assess market trends.

4)FACTOR ANALYSIS: Factor analysis is very popular when there


are multiple variables that need to be assessed. The technique provides
with the structure of the data rather than relationship principal
component analysis is a good example of technique.

5)CLUSTER ANALYSIS:
Cluster analysis helps us to reduce a very large data set into small
groups or individual data elements.
This is done based on similarity and other characteristics.
Very popular in market segmentation analysis.
ADVANTAGES OF MULTIVARIATE ANALYSIS:
There are numerous advantages but here are the highlights
1) Arriving at accurate data-driven conclusions and insights.
2) Checking for Data anomalies and consistency (Mean unexpected or
irregular pattern inconsistence or errors within a dataset.
3) Feature engineering. (Process of creating or selecting relevant and
informative features from raw data to improve performance of the
machine.
4) Data cleaning (preprocessing)
5) Handling under fitting and over fitting.

NATURE OF THE MULTIVARIATE ANALYSIS


Multivariate analysis is a statistical technique used to analyze data sets
with multiple variables or factors simultaneously. It explores the
relationships and interactions among multiple variables to uncover
patterns, associations and dependencies that might not be apparent
when considering each variable in isolation.
The nature of the multivariate analysis can be summarized as
follows
1) Multiple variables: Multivariate analysis deals with data sets
that contains two or more variables. These variables can be continuous
(ex: height , temperature)or categorical( ex: gender, occupation) and
they may have different measurement scale.

2) Inter dependence: Multivariate Analysis recognizes that the


variables in data set are interdependent. That is they are not analyzed
separately but are considered together to understand how they relate
to each other and collectively influence the phenomenon under
investigation.

3) Simultaneous Analysis: Multivariate analysis allows for


simultaneous analysis of multiple variables. It examines the joint
distribution of the variables to identify patterns and relationships that
exists among them.

4) Complex relationships: Multivariate analysis can capture


complex relationships between variables. It goes beyond simple
correlations and can uncover non linear associations ,interactions and
dependencies that may exist between variables.

5) Dimensionality Reduction: Multivariate analysis often


involves techniques for reducing the dimensionality of the data by
summarizing the information contained in multiple variables into a
small number of variables or components. It helps simplifying the
analysis and facilitates interpretation.

6) Statistical models: Multivariate analysis employs various


statistical models and methods to analyze the data. These include
techniques such as multivariate regression, factor analysis, cluster
analysis, discriminant analysis, principal component analysis and
structural equation modelling among others.

7) Interpretation: Multivariate analysis aims to provide meaningful


interpretation of the relationships and patterns found in data. It helps
researchers to understand the underlying structure and factors that
drive the observed data patterns and enables them to make inferences
and predictions based on the analysis results.

8)variables. It allows researchers to gain insight into the


interrelationships and dependencies among the variables , leading to
deeper understanding of the phenomena under Overall multivariate
analysis is a powerful tool for exploring complex data sets with multiple
investigation.

CLASSIFICATION OF MULTIVARIATE TECHNIQUES


Multivariate techniques refer to a broad range of statistical methods
that are used to analyze datasets with multiple variables. These
techniques allow researchers and analysis to explore relationships,
patterns and structures within complex data sets. Multivariate
techniques can be broadly classified into two categories. 1) descriptive
techniques 2)Inferential techniques.

Descriptive Techniques: Descriptive techniques are used to


summarize and visualize the characteristics of multivariate data. Them
aim to provide a comprehensive understanding of the dataset without
making inferences or predictions. Some commonly used descriptive
techniques include

1) Principal Component Analysis(PCA): PCA is used to identify the


underlying structure and patterns in a multivariate dataset. It reduces
the dimensionality of the data by transforming it into a uncorrelated
variables called principal components.

2) Factor Analysis: Factor analysis is used to identify the underlying


factors that explain the correlation among a set of observed variables.
It helps in reducing the dimensionality of the data and understanding
the latent variables driving the observed variables.

3) Cluster Analysis: Cluster analysis is used to identify groups or


clusters within a dataset. It groups similar observations together based
on their characteristic similarities or dissimilarities. There are various
algorithms for cluster analysis including k-mean clustering and
hierarchical clustering.

4) Correspondence Analysis: Correspondence analysis is used


to explore the association between categorical variable. It visualizes
the relationships between rows and columns of a contingency table by
mapping them into a low dimensional space.

5) Inferential techniques: Inferential techniques are used to


make statistical inferences and draw conclusions about populations
based on sample data. These techniques allow researchers to test
hypothesis, estimate parameters and make predictions. Some
commonly used inferential techniques include

6)Multiple Regression analysis: Multiple regression analysis is


used to examine the relationship between a dependent variable and
multiple independent variables. It helps in understanding how the
independent variables collectively influence the dependent variable.

Multivariate Analysis of variance(MANOVA):


MANOVA Is an extension of univariate analysis of variance (ANOVA)
and is used when there are multiple dependent variables. It tests
whether the means of the dependent variables differ significant across
the groups.

Discriminant analysis: Discriminant analysis is used to classify


observations into predefined groups based on their characteristics. It
identifies the linear combination of variables that best discriminates
between the groups.

Structural Equation Modelling(SEM): SEM is used to test and


estimate complex relationships among variables. It incorporates both
observed and latest variables and allows researchers to test
hypothesized models and assess the goodness-of-fit.
There are just a few examples of the many multivariate
techniques available. The selection of the technique depends on the
nature of the data, research objectives and the specific questions being
addressed.

Similarity measures used in cluster analysis: cluster


analysis is a popular technique used in data mining and Machine
learning to group the similar data points into clusters. Several similarity
measures are commonly used in cluster analysis to quantify the
similarity or dissimilarity between data points.

1)Euclidean distance: It is a widely used measure that calculates


the straight line distance between two data points in a
multidimensional space. It is defined as the square root the sum of the
squared differences between corresponding attributes of the two
points.

2) Manhattan Distance: also known as city block distance or L1


distance. It measures the sum of the absolute differences between
corresponding attributes of two data points. It calculates the distance
by summing the differences along each dimension.

3) Cosine similarity: It measures the cosine of the angle between


two vectors and is often used when analyzing text documents or higher
dimensional data. Cosine similarity ranges from -1 to 1 where 1
indicates the identical vectors. ‘0’ indicates orthogonal vector and -1
indicates completely opposite direction.

4) Jaccard similarity: It is commonly used for comparing binary or


categorical data. It calculates the size of the intersection divided by the
size of the union of two sets. Jaccard similarity ranges from 0 to 1
where 1 indicates complete similarity and 0 indicates no similarity

5)Pearson correlation coefficient: It measures the linear


correlation between the two variables. It ranges from -1 to 1 where 1
indicates a perfect positive correlation -1 indicates perfect negative
correlation and 0 indicates no correlation.

6)Hamming distance: It is primarily used for comparing strings of


equal length. It calculates the number of positions at which
corresponding elements of two strings differ.

7)Minkowski Distance: It is generalization of the Euclidean and


Manhattan distance and allows for tuning the distance calculation by
changing the parameter value. The Minkowski distance with a
parameter of 1 is equivalent to Manhattan distance and with parameter
of 2 it is equivalent to Euclidean distance.
These are just a few examples of similarity measures commonly used
in cluster analysis. The choice of similarity measure depends on the
nature of data and the specific requirements of the clustering problem
at hand.

Explain about Discriminant analysis: Discriminant analysis is


a statistical method for classifying objects into predefined groups based
on input features. Linear Discriminant Analysis(LDA)assumes Normality
and finds a linear combination of features maximizing between group
variance. Quadratic Discriminant Analysis(QDA)allows different
covariance matrices for each group. LDA and QDA aim to create distinct
separations between groups. LDA finds transformation matrix based on
Eigen vectors ,while QDA calculates individual covariance
matrices .Both methods project data onto new subspaces for
classification .They are used in biology, finance, marketing and more for
tasks like medical diagnosis and customer segmentation. Assumptions
of Normality and similar covariance matrices are essential for effective
results. regularized discriminant analysis helps high-dimensional data.

MANOVA:
MANOVA or Multivariate Analysis of variance is statistical technique
used to analyze the relationships between multiple dependent
variables and one or more independent variables. It extends the
analysis of variance (ANOVA) to cases where there are two or more
dependent varibles.
In a MANOVA model the dependent variables are typically
continuous variables and the independent variables are categorical(or
grouping)variables.The main objective of MANOVA is to determine
whether there are significant differences between groups on the
combination of dependent variables. It allows researchers to
investigate whether the groups differ not only on individual dependent
variable but also on their joint relationship.
The MONOVA model assumes that the dependent variables are
multivariate Normally distributed within each group and have equal
covariance matrices across groups. It also assumes that there is a linear
relationship between dependent variables and independent variables.
The analysis begins with testing of hypothesis.
Procedure for conducting MANOVA
Set up the Hypotheses:

Step1: Null Hpotheses:H0: There is no significant difference in


the means of the dependent variables across the groups.
Step 2:Alternate Hypotheses (H1): There is a significant
difference in the means of the dependent variables across the groups.

Step3: Data collection and organization: collect the data on


the multiple dependent variables and categorize the data according to
the groups defined by the independent variables.

Step4: Assumptions:
i)Independence: observations within each group and between groups
should be independent.
ii)Multivariate normality: The dependent variables should be normally
Lawley
iii) Homogeneity of variance –Covariance matrices: The variance-
Covariance matrices of the dependent variables should be equal across
groups.
iv)Homoginity of regression: The relationship between the
independent variables and the dependent variables should be linear
and homogenous across groups.
v)Select appropriate test statistic: Wilks Lambda (Ʌ) is the most
common test statistic used in MANOVA. It measures the properties of
variance in the dependent variables that is not accounted for by the
group differences.
Other test statistic like Pillai’s trace,Hotelling Lawlley trace and Roy’s
largest root can also be used depending on the specific research
question and assumptions.

Step6) Conduct MANOVA Test:


Calculate the test statistic (ex:Wilk’s Lambda) and its associated degree
of freedom.

Step 7: Determine P-value using appropriate


distribution.(ex:F-Distribution for Wilk’s Lambda)compare P-value
to the chosen specific significance level(commonly 0.05)to assess
statistical significance.

Step 8)Interprete the Results:


If the P-value is less than the chosen significance level (ex:p<0.05)you
can reject the Null hypotheses and conclude that there are significant
differences in the means of the dependent variables across the groups.
If the P-value is greater than the significance level you fail to reject
the null hypotheses indicating no significant differences.
It is essential to ensure that the assumptions of MANOVA are met
before interpreting the results.
Violation of assumptions may affect the validity of the test and
alternative methods are data transformation may be needed in such
cases.

Distinguish between ANOVA and MANOVA


1)ANOVA compares means of a single continues variable where as
MANOVA analyses patterns of means across groups for multiple
continuous variables.
2)ANOVA is a univariate focusing on a single dimension where as
MANOVA is multivariate ,considering multiple dimensions.
3)ANOVA does not consider inter correlations between variables where
as MANOVA accounts for relationships among dependent variables.
4)ANOVA answers questions about differences in means whereas
MANOVA examines joint effects of variables on group differences.
5)ANOVA is suitable for single variable research questions where as
MANOVA is used for multiple variable research questions.
In summary ANOVA and MANOVA differ in the number of variables
analyzed. The nature of dependent variables and the research
questions they address.

Principal component analysis: Principal component


analysis is a statistical technique used to reduce the dimentionality of
the data set while retaining the most important information. The first
principal component (pc1)is the direction in the data space along which
the data varies the most.
To derive the first principal component follow the steps
Step 1:Standardize the data:
If the variables in the dataset have different scales, its essential to
standardize them before performig PCA Standardization involves
subtracting the mean from each variable and dividing by the standard
deviation.This step ensures that each varaibale has comparable scale.
Step2: Compute Covariance matric:
Compute the covariance matrix of the standardized data. The
Covariance matrix measures the pairwise covariance between variables
and its elements reflect the relationships between variables.
Step3: Calculate Eigen values and Eigen vectors :
Find Eigen values and Eigen vectors of the Covariance matrix. The Eigen
vector represents the variance explained by each Eigen vector. The
Eigen vector with highest Eigen value corresponds to the first principal
component.
Step4:Derive the first principal component:
The first principal component is obtained by taking the linear
combination of the standardized variables.

You might also like