0% found this document useful (0 votes)
18 views13 pages

STAT502

The document discusses Principal Component Analysis (PCA) and Factor Analysis, both statistical techniques used for data reduction and analysis. PCA aims to reduce dimensionality while retaining variance, and is widely used in exploratory data analysis, machine learning, and data visualization. Factor Analysis identifies underlying relationships among variables, often used in social sciences, and includes methods like Exploratory and Confirmatory Factor Analysis.

Uploaded by

Adarsh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

STAT502

The document discusses Principal Component Analysis (PCA) and Factor Analysis, both statistical techniques used for data reduction and analysis. PCA aims to reduce dimensionality while retaining variance, and is widely used in exploratory data analysis, machine learning, and data visualization. Factor Analysis identifies underlying relationships among variables, often used in social sciences, and includes methods like Exploratory and Confirmatory Factor Analysis.

Uploaded by

Adarsh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Assignment

On
Principle component analysis, Factor analysis and
Multidimensional scaling

Course Title: Statistical Methods For Applied Sciences


Course Code: STAT 502

Submitted by: Submitted To:


Id no. 3103- Sundaram Tiwari Dr. Gaurav Shukla
3104-Khayti Singh Assistant Professor

3105- Sachin Yadav


3106 -Vidhu Bhooshan Yadav
3107-Nancy Sharma
3108-Abhishek Chaudhari
3109-Adarsh
3110-Vikas Singh
Principle Component Analysis
Introduction
Principal component analysis is the oldest and best known technique of multivariate data analysis.
It was first coined by Karl Pearson in (1901), Principal Component Analysis (PCA) is the general
name for a technique which uses sophisticated underlying mathematical principles to transforms a
number of possibly correlated variables into a smaller number of variables called principal
components. The origins of PCA lie in multivariate data analysis; however, it has a wide range of
other applications. PCA has been called one of the most important results from applied linear
algebra and perhaps its most common use is as the first step in trying to analyze large data sets.
Some of the other common applications include; denoising signals, blind source separation and
data compression. In general terms, PCA uses a vector space transform to reduce the
dimensionality of large data sets. Using mathematical projection, the original data set, which may
have involved many variables, can often be interpreted in just a few variables (i.e. the principal
components). The central idea of principal component analysis is to reduce the dimensionality of
a data set in which there are a large number of interrelated variables, while retaining as much as
possible of the variation present in the data set. This reduction is achieved by transforming to a
new set of variables, the principal components, which are uncorrelated, and which are ordered so
that the first few retain most of the variation present in all of the original variables. Computation
of the principal components reduces to the solution of an eigenvalue-eigenvector problem for a
positive-semi-definite symmetric matrix. Thus, the definition and computation of principal
components are straightforward but, as will be seen, this apparently simple technique has a wide
variety of different applications, as well as a number of different derivations.

Definition
The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data
set consisting of a large number of interrelated variables, while retaining as much as possible of
the variation present in the data set. This is achieved by transforming to a new set of variables, the
principal components (PCs), which are uncorrelated, and which are ordered so that the first few
retain most of the variation present in all of the original variables.

"Or"

It is a way of identifying patterns in data, and expressing the data in such a way as to highlight
their similarities and differences. Since patterns in data can be hard to find in data of high
dimension, where the luxury of graphical representation is not available, PCA is a powerful tool
for analyzing data.

Goals of PCA
The goals of PCA are to:
1. extract the most important information from the data table;
2. compress the size of the data set by keeping only this important information;
3. simplify the description of the data set; and
4. Analyze the structure of the observations and the variables.
5. Compress the data, by reducing the number of dimensions, without much loss of
information.
6. This technique used in image compression.

Principal Component Analysis (PCA)


Is a statistical procedure that uses an orthogonal transformation that converts a set of
correlated variables to a set of uncorrelated variables. PCA is the most widely used tool in
exploratory data analysis and in machine learning for predictive models. Moreover,
➢ Principal Component Analysis (PCA) is an unsupervised learning
➢ algorithm technique used to examine the interrelations among a set of variables. It is
also known as a general factor analysis where regression determines a line of best fit.
➢ The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality
of a dataset while preserving the most important patterns or relationships between the
variables without any prior knowledge of the target variables.

Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by
finding a new set of variables, smaller than the original set of variables, retaining most of the
sample’s information, and useful for the regression and classification of data.

Principle Component Analysis


1. Principal Component Analysis (PCA) is a technique for dimensionality reduction
that identifies a set of orthogonal axes, called principal components, that capture the
maximum variance in the data. The principal components are linear combinations of
the original variables in the dataset and are ordered in decreasing order of importance.
The total variance captured by all the principal components is equal to the total
variance in the original dataset.
2. The first principal component captures the most variation in the data, but the second
principal component captures the maximum variance that is orthogonal to the first
principal component, and so on.
3. Principal Component Analysis can be used for a variety of purposes, including data
visualization, feature selection, and data compression. In data visualization, PCA can
be used to plot high-dimensional data in two or three dimensions, making it easier to
interpret. In feature selection, PCA can be used to identify the most important
variables in a dataset. In data compression, PCA can be used to reduce the size of a
dataset without losing important information.
4. In Principal Component Analysis, it is assumed that the information is carried in the
variance of the features, that is, the higher the variation in a feature, the more
information that features carries.

Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets,
making them easier to understand and work with.

Step-By-Step Explanation of PCA (Principal Component Analysis)

Step 1: Standardization
First,
we need to standardize our dataset to ensure that each variable has a mean of 0 and a standard
deviation of 1.
(𝑿−𝝁)
Z=
𝝈

• μ is the mean of independent features


μ={μ1,μ2,⋯,μm}μ={μ1,μ2,⋯,μm}

σ is the standard deviation of independent features σ={σ1,σ2,⋯,σm}σ={σ1,σ2,⋯,σm}


Step2: Covariance Matrix Computation
Covariance measures the strength of joint variability between two or more variables, indicating
how much they change in relation to each other. To find the covariance we can use the formula:
𝟏𝐧(𝐱𝟏𝐢−𝐱𝟏ˉ)(𝐱𝟐𝐢−𝐱𝟐ˉ)
cov(x1,x2)= ∑𝐢 =
𝐧−𝟏
The value of covariance can be positive, negative, or zeros.

➢ Positive: As the x1 increases x2 also increases.

➢ Negative: As the x1 increases x2 also decreases.

➢ Zeros: No direct relation

Step 3: Compute Eigenvalues and Eigenvectors of Covariance Matrix


to Identify Principal Components
Let A be a square nXn matrix and X be a non-zero vector for which
AX=λX
for some scalar values λ. then λ is known as the eigenvalue of matrix A and X is known as
the eigenvector of matrix A for the corresponding eigenvalue.

It can also be written as :


AX−λX=0
(A−λI)X=0
where I am the identity matrix of the same shape as matrix A. And the above conditions will be
true only if (A–λI will be non-invertible (i.e. singular matrix).
That means,
∣A–λI∣=0
From the above equation, we can find the eigenvalues \lambda, and therefore
corresponding eigenvector can be found using the equation AX=λX
Advantages of Principal Component Analysis

1. Dimensionality Reduction: Principal Component Analysis is a popular


technique used for dimensionality reduction, which is the process of reducing the
number of variables in a dataset. By reducing the number of variables, PCA simplifies
data analysis, improves performance, and makes it easier to visualize data.
2. Feature Selection: Principal Component Analysis can be used for feature
selection, which is the process of selecting the most important variables in a dataset.
This is useful in machine learning, where the number of variables can be very large,
and it is difficult to identify the most important variables.
3. Data Visualization: Principal Component Analysis can be used for data
visualization. By reducing the number of variables, PCA can plot high-dimensional
data in two or three dimensions, making it easier to interpret.
4. Multicollinearity: Principal Component Analysis can be used to deal
with multicollinearity, which is a common problem in a regression analysis where
two or more independent variables are highly correlated. PCA can help identify the
underlying structure in the data and create new, uncorrelated variables that can be
used in the regression model.
5. Noise Reduction: Principal Component Analysis can be used to reduce the noise
in data. By removing the principal components with low variance, which are assumed
to represent noise, Principal Component Analysis can improve the signal-to-noise
ratio and make it easier to identify the underlying structure in the data.
6. Data Compression: Principal Component Analysis can be used for data
compression. By representing the data using a smaller number of principal
components, which capture most of the variation in the data, PCA can reduce the
storage requirements and speed up processing.
7. Outlier Detection: Principal Component Analysis can be used for outlier
detection. Outliers are data points that are significantly different from the other data
points in the dataset. Principal Component Analysis can identify these outliers by
looking for data points that are far from the other points in the principal component
space.

Disadvantages of Principal Component Analysis

1. Interpretation of Principal Components: The principal components


created by Principal Component Analysis are linear combinations of the original
variables, and it is often difficult to interpret them in terms of the original variables.
This can make it difficult to explain the results of PCA to others.
2. Data Scaling: Principal Component Analysis is sensitive to the scale of the data.
If the data is not properly scaled, then PCA may not work well. Therefore, it is
important to scale the data before applying Principal Component Analysis.
3. Information Loss: Principal Component Analysis can result in information loss.
While Principal Component Analysis reduces the number of variables, it can also lead
to loss of information. The degree of information loss depends on the number of
principal components selected. Therefore, it is important to carefully select the
number of principal components to retain.
4. Non-linear Relationships: Principal Component Analysis assumes that the
relationships between variables are linear. However, if there are non-linear
relationships between variables, Principal Component Analysis may not work well.
5. Computational Complexity: Computing Principal Component Analysis can
be computationally expensive for large datasets. This is especially true if the number
of variables in the dataset is large.
6. Overfitting: Principal Component Analysis can sometimes result in overfitting,
which is when the model fits the training data too well and performs poorly on new
data. This can happen if too many principal components are used or if the model is
trained on a small dataset
FACTOR ANALYSIS
Factor analysis is a statistical technique that reduces a set of variables by extracting
all their commonalities into a smaller number of factors. It can also be called data
reduction.
When observing vast numbers of variables, some common patterns emerge, which
are known as factors. These serve as an index of all the variables involved and can be
utilized for later analysis.

Factor analysis uses several assumptions:

➢ The variables’ linear relationships


➢ Absence of multicollinearity
➢ Relevance of the variables
➢ The existence of a true correlation between factors and variables

Factor analysis a statistical tool used to examine the inter relationships among various
variables. It investigates several variables simultaneously and tries to locate them into
a small number of dimensions that are referred to as factors.
Factor analysis is a very useful and popular method of multivariate technique, mostly
used in social and behavioral sciences. This technique applicable when there is a
systematic interdependence among a set of observed manifest variables, and the
researcher is interested in finding out something more fundamental or latent which
creates this communality (commonness). For example, we may have data on farmers'
education, occupation, land, house, farm power, material possession, social
participation etc.

Features of factor analysis

While studying customer satisfaction related to a product, a researcher will usually


pose several questions about the product through a survey. These questions will
consist of variables regarding the product’s features, ease of purchase, usability,
pricing, visual appeal, and so forth. These are typically quantified on a numeric scale.
But, what a researcher looks for is the underlying dimensions or “factors” regarding
customer satisfaction. These are mostly psychological or emotional factors toward
the product that cannot be directly measured. Factor analysis uses the variables from
the survey to determine them indirectly.

Types of factor analysis


There are essentially two types of factor analysis:
1. Exploratory Factor Analysis
2. Confirmatory Factor Analysis
Exploratory Factor Analysis: In exploratory factor analysis, the
researcher does not make any assumptions about prior relationships between factors.
In this method, any variable can be related to any factor. This helps identify complex
relationships among variables and group them based on common factors.
The main objectives of the Exploratory Factor Analysis are:
1. To identifying the underlying dimensions or factors that explain the variation (or
correlations) among the set of variables.
2. To obtain a new smaller set of uncorrelated variables to replace the original set of
correlated variables in subsequent analysis.
3. To obtain a smaller set of salient variables from a large set for use in subsequent
analysis.
Steps in Exploratory Factor Analysis:
1. Collect data: choose relevant variables.
2. Extract initial factors (via principal component).
3. Choose number of factors to retain.
4. Choose estimation method, estimate model.
5. Rotate and interpret.
6. (a) Decide on changes need to be made ( e.g. drop items include items) (b) Repeat
(4), (5).
7. Construct scales and use on further analysis.

Confirmatory Factor Analysis: The confirmatory factor analysis, on the


other hand, assumes that variables are related to specific factors and uses pre-
established theory to confirm its expectations of the model.
Confirmatory Factor Analysis (CFA) assesses the fit of the hypothesized model to the
actual data, examining how well the observed variables align with the proposed factor
structure.
This method allows for the evaluation of relationships between observed variables
and unobserved factors, and it can accommodate measurement error.
Researchers hypothesize the relationships between variables and factors before
conducting the analysis, and the model is tested against empirical data to determine
its validity.

Assumptions of factor analysis:-

Factor analysis makes use of several assumptions in order to produce the outcomes:
➢ There will not be any outliers in the data.
➢ The sample size will be greater than the size of the factor.
➢ Since the method is interdependent, there will be no perfect multicollinearity between
any of the variables.
➢ When in a sequence of random variables, all the variables have the same finite variance,
known as being homoscedastic. Since factor analysis works as a linear function, it will
not need homoscedasticity between variables.
➢ There is the assumption of linearity. This means that even non-linear variables can be
used, but once transferred, they become linear variables.

There is also the assumption of interval data

o How factor analysis is used


➢ Business marketing
o In a business model, factor analysis is used to explain complex variables or data
using the matrix of association.
➢ Automotive industry
o The use of factor analysis in the automotive industry was mentioned as far back
as 1997 in an article by Professor Emeritus Richard B. Darlington of Cornell
University. He explained how a study could be used to identify all the variables
that apply to the decision-making of purchasing a car—size, pricing, options,
accessories, and more.
➢ Human resources
o There are many factors that go into a company’s hiring process. With statistics,
human resource professionals will be able to create a comfortable and productive
working environment.
➢ Education
o When hiring teachers and deciding on a curriculum for the school year, factor
analysis plays a huge role. It is used to determine classroom sizes, staffing limits,
salary distribution, and a wide range of other requirements necessary for the
school year to run smoothly.
Multidimensional Scaling (MDS)
Multidimensional Scaling (MDS) is a statistical tool that helps discover the
connections among objects in lower dimensional space using the canonical similarity
or dissimilarity data analysis technique. The article aims to delve into the
fundamentals of multidimensional scaling.

Multidimensional Scaling (MDS) is a class of procedures for representing perception


and preferences of respondents spatially by means of a visual display. Perceived or
psychological relationships among points in a multidimensional space. These
geometric representations are often called spatial maps. The axes of the spatial map
are assumed to denote the psychological bases or underlying dimensions respondents
use to form perceptions and preferences for stimuli. MDS has been used in marketing
to identify:

1. The number and nature of dimensions consumers use to perceive different brands
in the marketplace
2. The positioning of current brands on these dimensions
3. The positioning of consumers ideal brand on these dimensions

Basic Concepts and Principles of MDS

1. MDS simplifies complex high-dimensional data into a lower-dimensional representation,


making it easier to visualize and interpret. The primary goal is to create a spatial
representation where the distances between points accurately reflect their original
similarities or differences.
2. The technique strives to maintain the original proximities between datasets; objects that
are similar are positioned closer together, while dissimilar objects are placed further apart
in the reduced space.
3. MDS utilizes advanced optimization algorithms to minimize the discrepancy between the
original high-dimensional distances and the distances in the reduced space. This involves
adjusting the positions of points so that the distances in the lower-dimensional
representation are as close as possible to the actual dissimilarities measured in the
original high-dimensional space.
4. By revealing patterns and relationships in data through a visual framework, MDS assists
researchers and analysts in uncovering meaningful insights about data structure. These
insights are instrumental in crafting strategies across various domains, from cognitive
studies and geographic information analysis to market trend analysis and brand
positioning.
Types of multidimensional scaling
1. Classical Multidimensional Scaling:-
Classical Multidimensional Scaling is a technique that takes an input matrix
representing dissimilarities between pairs of items and produces a coordinate matrix
that minimizes the strain.

2. Metric multidimensional scaling:-


Metric Multidimensional Scaling generalizes the optimization procedure to various
loss functions and input matrices with known distances and weights. It minimizes a
cost function called "stress," often minimized using a procedure called stress
majorization.

3. Non-metric multidimensional Scaling:-

Non-metric Multidimensional Scaling finds a non-parametric monotonic


relationship between dissimilarities and Euclidean distances between items, along
with the location of each item in the low-dimensional space.

Applications of multidimensional scaling

1. Psychology and Cognitive Science: the process of decision making. It, on the other
hand, helps the psychologists to realize the mechanism of the perception of the
similarities or the differences between the stimuli, for example, the words, the images, or
the sounds.

2. Market Research and Marketing: Market research applies MDS to the tasks of brand
positioning, product positioning, and market segmentation. The marketers employ the
MDS to visualize and interpret the consumer perceptions of the brands, products or
services, which is hence they to make the decisions strategically and for the marketing
campaigns.

3. Geography and Cartography: MDS is employed in geography and cartography to see


and learn the spatial relationships between places, areas, or geographical features. It
permits the cartographers to make maps that are true to the actual nature of the
geographical entities and their close proximity to each other.

4. Biology and Bioinformatics: In biology, MDS is mostly applied for phylogenetic


analysis, protein structure prediction and comparative genomics. Bioinformaticians
employ MDS to represent and comprehend the similar or different genetic sequences,
protein structures or evolutionary relationships among the different species.
5. Social Sciences and Sociology: MDS is utilized in sociology and the social sciences for
the analysis of the social networks, intergroup relationships, and cultural differences. The
sociologists employ the MDS to the survey data, the questionnaire responses or the
relational data to understand the social structures and dynamics.

Advantages of multidimensional scaling


➢ Reduces the dimensionality of the original relationships between objects while
preserving the original information, hence, helping to understand the objects better
without the loss of crucial information.
➢ The adaptable nature of the scheme makes it suitable for various disciplines and data
types, thus, allowing it to fit into any research category.
➢ It assists in discovering the hidden structures inside the data, thus, revealing the
underlying patterns and relationships which may not be easily noticed.
➢ It helps to the hypothesis testing and the clustering analysis, thus the data-driven
decision-making which is the basis of the scales.

Limitations of multidimensional scaling


➢ Sensitivity to outliers: The MDS results can be distorted by outliers, which in turn
can affect the image or the interpretation of the connections.
➢ Computational complexity: MDS can be quite a process that demands a lot of
computational resources and time, especially when it comes to large datasets.
➢ Subjectivity in interpretation: The process of interpreting MDS outcomes may be
a matter of subjective decision of the meaning of the spatial arrangements which can
result in the possible bias.
➢ Difficulty in determining the optimal number of dimensions: The right number of
dimensions for the reduced space to be identified can be a difficult task and may
necessitate of the experimentation.

You might also like