Data Analytics
Data Analytics
2.Data Visualization: Can help visualize high-dimensional data in a lower-dimensional space, making it easier to
identify patterns and relationships between variables.
3.Noise Reduction: Can help filter out noise and focus on capturing the most significant sources of variation in the
data, leading to more robust models.
4.Feature Selection: Used for feature selection, as it identifies the most important features (principal components)
that contribute to the variance in the data.
5.Independent Components: PCA produces orthogonal (uncorrelated) components, which can simplify the
interpretation of the data and the relationships between variables.
DISADVANTAGES OF PRINCIPAL
COMPONENT ANALYSIS (PCA):
1.Loss of Interpretability: PCA creates linear combinations of the original variables, making the resulting components
less interpretable than the original variables. This can make it challenging to understand the meaning of the
components in the context of the original data.
2.Assumption of Linearity: PCA assumes that the underlying relationships in the data are linear. If the relationships
are non-linear, PCA may not capture the true underlying structure of the data effectively.
3.Sensitive to Outliers: PCA is sensitive to outliers, as outliers can have a disproportionate influence on the calculation
of the principal components. This can lead to components that do not accurately represent the majority of the data.
4.Difficulty in Determining the Number of Components: Determining the optimal number of components to retain in
PCA can be subjective and may require additional criteria or validation methods, such as the scree plot or
cross-validation.
5.Loss of Information: While PCA reduces the dimensionality of the data, it does so by discarding information in the
less important components. This loss of information may be undesirable in some cases where retaining all the
information is important.
6.Not Suitable for Categorical Data: PCA is primarily designed for continuous data and may not be suitable for
categorical or binary data without appropriate transformations.
ASSUMPTIONS OF PCA
1.Linearity: PCA assumes that the relationships between variables are linear. If the relationships are
non-linear, PCA may not be appropriate.
2.Large variances indicate important features: PCA assumes that variables with larger variances
are more important than those with smaller variances. It focuses on capturing the variance in the
data.
3.Variables are standardized: PCA assumes that variables are standardized (i.e., have a mean of 0
and a standard deviation of 1) or that the scale of the variables is meaningful and consistent.
4.Variables are continuous: PCA is most appropriate for continuous variables. It may not work well
with categorical or ordinal variables without appropriate transformations.
5.No outliers: PCA is sensitive to outliers, which can distort the principal components and affect the
results. Outliers should be identified and addressed before performing PCA if possible.
6.Variables are uncorrelated or weakly correlated: While not a strict assumption, PCA works best
when variables are uncorrelated or weakly correlated. Strong correlations among variables can lead
to unstable solutions or make interpretation more challenging.
STEPS OF PRINCIPAL COMPONENT
ANALYSIS
STEP 1: STANDARDIZATION
X= value in a data set
n= number of values in the data set
Example:
Calculate the Mean and Standard Deviation for each feature
STEP 2: COVARIANCE MATRIX COMPUTATION Note: (COV(X, X)=Var(X)), (COV(X,
Y)=COV(Y, X))
Eg: COV(F2,F1)=
(5-4)(1-3)+(2-4)(4-3)+(4-4)(1-3)+(4-4)(4-3)+(5-4(5-3)/5-1
STEP 4: FEATURE VECTOR
det(A- λI)
=0
Then, substitute each eigen value in (A-λI)ν=0
equation and solve the same for different
eigen vectors v1, v2, v3 and v4.
v1 = 0.515514
v2 = -0.616625 Aν = λν
v3 = 0.399314
v4 = 0.441098
APPLICATIONS OF CFA – It is often used in the field of psychology and social sciences
CFA is often used in research to test theories and hypotheses
It can help researchers uncover the underlying structure of complex datasets and provide insights into the nature of
the relationships among variables.
THE IMPORTANCE OF FACTOR ANALYSIS
• Finding Hidden Patterns And Identifying Extra-Dimensionality
• Simplifying Data And Selecting Variables
Benefits Of Factor Analysis
• Data reduction and enhanced interpretability. By reducing the dimensionality of data, you can more
easily analyze and interpret complex datasets.
• Multivariable selection and analysis. Factor analysis aids in variable selection by identifying the most
important variables that contribute to the factors. This is especially valuable when working with large
datasets. Crucially, factor analysis is a form of multivariate analysis, which is essential in use cases that
require examining relationships between multiple variables simultaneously.
Factor Analysis Examples
Market Research
Market researchers often use factor analysis to identify the key factors that influence consumer preferences. For example, a survey may collect data on
various product attributes like price, brand reputation, quality, and customer service. Factor analysis can help determine which factors have the most
significant impact on consumers’ product choices. By identifying underlying factors, businesses can tailor their product development and marketing
strategies to meet consumer needs more effectively.
Financial Risk Analysis
Factor analysis is commonly used in finance to analyze and manage financial risk. By examining various economic indicators, asset returns, and market
conditions, factor analysis helps investors and portfolio managers understand how different factors contribute to the overall risk and return of an investment
portfolio.
Customer Segmentation
Businesses often use factor analysis to identify customer segments based on their purchasing behavior, preferences, and demographic information. By
analyzing these factors, companies can create better targeted marketing strategies and product offerings.
Employee Engagement
Factor analysis can be used to identify the underlying factors that contribute to employee engagement and job satisfaction. This information helps
businesses improve workplace conditions and increase employee retention.
Brand Perception
Companies may employ factor analysis to understand how customers perceive their brand. By analyzing factors like brand image, trust, and quality,
businesses can make informed decisions to strengthen their brand and reputation.
Product Quality Controls
In manufacturing, factor analysis can help identify the key factors affecting product quality. This analysis can lead to process improvements and quality
control measures, ultimately reducing defects and enhancing customer satisfaction.
ADVANTAGES OF COMMON FACTOR
ANALYSIS (CFA):
1.Construct Validation: It helps confirm whether the observed variables are good indicators of the
underlying constructs they are intended to measure.
2.Hypothesis Testing: CFA can be used to test hypotheses about the underlying structure of the
data. By comparing different models, researchers can determine which model best fits the data and
provides the most meaningful interpretation.
3.Reliability and Validity Assessment: CFA can help assess the reliability and validity of a
measurement instrument, such as a survey or questionnaire, by examining the relationships
between observed variables and underlying factors.
4.Interpretability: CFA provides more interpretable factors than PCA, as the factors are designed to
represent specific underlying constructs rather than just capturing variance in the data.
5.Accounting for Measurement Error: CFA can help account for measurement error in observed
variables, providing more accurate estimates of the relationships between variables and underlying
constructs.
DISADVANTAGES OF COMMON FACTOR
ANALYSIS (CFA):
1.Assumption of Common Factors: CFA assumes that the observed variables are influenced by a smaller number of common
factors. If this assumption is not met, CFA may not accurately capture the underlying structure of the data.
2.Model Complexity: CFA models can be complex, especially when the number of observed variables and factors is large. This
complexity can make it challenging to interpret the results and may require more sophisticated statistical techniques for model
estimation and evaluation.
3.Difficulty in Model Specification: If the specified model does not adequately represent the data, the results of the analysis may
be unreliable.
4.Sensitive to Model Misspecification: CFA results can be sensitive to model misspecification, such as incorrect assumptions
about the number of factors or the pattern of factor loadings. This can lead to biased parameter estimates and inaccurate
conclusions.
5.Complex Interpretation: Interpreting the results of CFA can be challenging, especially when there are correlated factors or
cross-loadings (variables loading on multiple factors). This complexity can make it difficult to extract meaningful insights from the
analysis.
6.Sample Size Requirements: CFA requires a relatively large sample size to produce reliable estimates of the model parameters.
Small sample sizes can lead to unstable parameter estimates and unreliable results.
7.Limited to Linear Relationships: Like PCA, CFA assumes linear relationships between the observed variables and the latent
factors. If the relationships are non-linear, CFA may not be the most appropriate technique.
ASSUMPTIONS OF COMMON FACTOR
ANALYSIS
1.Linearity: The relationships between observed variables and latent factors are
assumed to be linear.
4.Sufficient variability: There should be sufficient variability in the observed
variables to allow for the extraction of common factors.
5.Large sample size: CFA tends to perform better with larger sample sizes to
provide more reliable estimates.
6.Normality: The variables are assumed to be normally distributed.
7.No outliers: The presence of outliers can affect the results of CFA, so it is
advisable to check for and address outliers in the data.
STEPS OF COMMON FACTOR
ANALYSIS
1.Formulate the Research Question: Clearly define the research question and determine the variables that will be included in the analysis.
2.Data Collection: Collect data on the selected variables from a sample that is representative of the population of interest.
3.Data Screening: Check the data for missing values, outliers, and normality. Address any issues identified.
4.Select the Number of Factors: Determine the number of factors to extract based on theoretical considerations, scree plot, eigenvalues, or other
criteria (e.g., Kaiser's criterion).
5.Factor Extraction: Use a factor extraction method (e.g., principal component analysis, maximum likelihood) to extract the factors from the data.
6.Factor Rotation: Rotate the extracted factors to achieve a simpler and more interpretable factor structure. Common rotation methods include
varimax, promax, and oblimin.
7.Factor Interpretation: Interpret the rotated factor matrix to understand the relationships between the factors and the observed variables. Name the
factors based on the variables that load most strongly on each factor.
8.Assess Factorial Validity: Evaluate the validity of the factor solution using goodness-of-fit indices such as the chi-square test, Comparative Fit
Index (CFI), Tucker-Lewis Index (TLI), and Root Mean Square Error of Approximation (RMSEA).
9.Interpret the Results: Interpret the results in the context of the research question and theoretical framework. Discuss the implications of the factor
structure for theory and practice.
10.Report Findings: Present the findings in a clear and concise manner, including the factor loadings, factor correlations, and any other relevant
results.
NO OF FACTORS
1.Kaiser's Criterion: Kaiser's criterion suggests retaining only factors with eigenvalues greater than 1.0.
Eigenvalues represent the amount of variance explained by each factor. Factors with eigenvalues less than
1.0 are considered to explain less variance than a single original variable, and thus are typically not retained.
2.Scree Plot: A scree plot is a plot of the eigenvalues against the factor number. The point where the slope of
the plot levels off (the "elbow" of the plot) indicates the number of factors to retain. Factors before the elbow
are retained, while those after are typically discarded.
3.Parallel Analysis: Parallel analysis compares the eigenvalues obtained from the actual data with the
eigenvalues obtained from randomly generated data (using Monte Carlo simulation).
4.Percentage of Variance Explained: Another approach is to examine the cumulative percentage of variance
explained by the factors. A common rule of thumb is to retain enough factors to explain at least 70-80% of the
total variance in the data.
5.Theory and Interpretability: Finally, theory and interpretability should also be considered when deciding the
number of factors to retain. Factors should make sense in the context of the research question and be
interpretable as meaningful constructs.
FACTOR EXTRACTION
1.PCA:
1. Objective: PCA aims to extract a set of orthogonal (uncorrelated) components that explain the maximum amount of
variance in the original data.
2. Extraction Method: In PCA, factors are extracted based on the eigenvectors (principal components) of the
correlation or covariance matrix of the original variables. Each eigenvector represents a principal component, and
the corresponding eigenvalue indicates the amount of variance explained by that component.
3. Number of Factors: The number of factors extracted in PCA is typically equal to the number of original variables.
However, to reduce dimensionality, only a subset of the components (often those with the largest eigenvalues) are
retained.
2.CFA:
1. Objective: CFA aims to extract a smaller number of factors that account for the correlations among the observed
variables, reflecting the underlying latent constructs.
2. Extraction Method: In CFA, factors are extracted based on the common variance shared among the observed
variables. This is done using methods like Maximum Likelihood (ML), Principal Axis Factoring (PAF), or Principal
Component Analysis (PCA) with a different rotation method.
3. Number of Factors: The number of factors in CFA is determined based on statistical criteria (e.g., Kaiser's criterion,
scree plot) or theoretical considerations. CFA focuses on retaining factors that are meaningful and interpretable in
the context of the research.
ORTHOGONAL AND OBLIQUE ROTATION
1.Orthogonal Rotation:
1. Definition: Orthogonal rotation results in factors or components that are uncorrelated (orthogonal) to each
other.
2. Use in PCA: In PCA, orthogonal rotation is the default, as PCA aims to extract uncorrelated components that
explain the maximum amount of variance in the data. Varimax is a commonly used orthogonal rotation
method in PCA.
3. Use in CFA: While CFA assumes that the factors are correlated, orthogonal rotation can still be used in CFA
to simplify the interpretation of the factors, especially when the factors are expected to be largely
uncorrelated. However, oblique rotation is more commonly used in CFA.
2.Oblique Rotation:
1. Definition: Oblique rotation allows factors or components to be correlated with each other.
2. Use in PCA: Oblique rotation is not typically used in PCA, as PCA aims to extract uncorrelated components.
However, in some cases where the factors are expected to be correlated, oblique rotation could be applied in
PCA.
3. Use in CFA: In CFA, oblique rotation is commonly used, as it allows for factors that are correlated, reflecting
the underlying structure of the data more accurately. Methods like Promax or Oblimin are commonly used for
FACTOR INTERPRETATION
1.PCA:
1. Interpretation: interpreting PCA factors involves understanding which original variables contribute most to
each component.
2. Interpretation Example: For example, if a PCA component has high loadings (coefficients) for variables
related to customer satisfaction (e.g., product quality, customer service), it could be interpreted as a
"customer satisfaction" component.
2.CFA:
1. Interpretation: Factor interpretation in CFA involves understanding the underlying constructs that the
factors represent, based on the pattern of loadings of the observed variables on each factor.
2. Interpretation Example: For example, if a CFA factor has high loadings for variables related to depression
symptoms (e.g., sadness, loss of interest), it could be interpreted as a "depression" factor representing the
underlying construct of depression.
FACTOR MATRIX
1.PCA:
1. Factor Matrix: In PCA, the factor matrix (also known as the loading matrix) contains the coefficients (loadings) that
represent the relationship between each original variable and each principal component.
2. Interpretation: Each column of the factor matrix corresponds to a principal component, and each row corresponds to
an original variable. The value in each cell represents the strength and direction of the relationship between the
variable and the component.
3. Interpretation Example: A high loading (positive or negative) for a variable on a component indicates that the
variable contributes significantly to that component and is important for explaining the variance in the data captured
by that component.
2.CFA:
1. Factor Matrix: In CFA, the factor matrix contains the coefficients (loadings) that represent the relationship between
each original variable and each underlying factor.
2. Interpretation: Each column of the factor matrix corresponds to a factor, and each row corresponds to an original
variable. The value in each cell represents the strength and direction of the relationship between the variable and the
factor.
3. Interpretation Example: A high loading (positive or negative) for a variable on a factor indicates that the variable is
strongly related to that factor and contributes to the underlying construct represented by that factor.
FACTOR LOADINGS
1.PCA:
1. Definition: In PCA, factor loadings represent the correlation between each original variable and each
principal component.
2. Interpretation: High factor loadings (either positive or negative) indicate that the variable has a strong
relationship with the corresponding principal component. Variables with high loadings contribute more to
that component and are considered important for explaining the variance in the data captured by that
component.
3. Usage: PCA uses factor loadings to determine the contribution of each variable to the principal components
and to interpret the meaning of the components in terms of the original variables.
2.CFA:
1. Definition: In CFA, factor loadings represent the correlation between each original variable and the
underlying latent factor.
2. Interpretation: High factor loadings (either positive or negative) indicate that the variable is strongly related
to the underlying construct represented by the factor. Variables with high loadings are considered to be
good indicators of that latent construct.
3. Usage: CFA uses factor loadings to assess the relationship between observed variables and latent factors,
to evaluate the model fit, and to interpret the underlying constructs represented by the factors.
PROBLEMS THAT CAN ARISE WHEN USING
PRINCIPAL COMPONENT ANALYSIS (PCA) AND
COMMON FACTOR ANALYSIS (CFA):
1. Overfitting: PCA and CFA models can overfit the data if the number of components or factors retained is too high.
2. Incorrect Model Specification: Specifying the wrong number of components or factors, or choosing an inappropriate rotation method, can result in
biased parameter estimates and inaccurate conclusions.
3. Non-normal Data: PCA and CFA assume that the data are normally distributed. If the data are highly skewed or have outliers, the results of the
analysis may be unreliable.
4. Multicollinearity: High multicollinearity among the original variables can make it difficult to identify meaningful components or factors in PCA and CFA.
5. Interpretability Issues: The components or factors extracted in PCA and CFA may not always be easy to interpret, especially when there are
cross-loadings or complex relationships between variables.
6. Sample Size: PCA and CFA require a relatively large sample size to produce reliable estimates of the model parameters. Small sample sizes can lead
to unstable results.
7. Model Selection: Choosing the appropriate number of components or factors to retain can be challenging and may require additional criteria or
validation methods.
8. Non-linear Relationships: PCA and CFA assume linear relationships between variables and components or factors. If the relationships are non-linear,
these methods may not be appropriate.
THANK YOU