0% found this document useful (0 votes)
16 views12 pages

ASM-BDM - Module 3 - Notes

The document discusses dimensionality reduction techniques, specifically focusing on Principal Component Analysis (PCA) and Factor Analysis, which are essential for simplifying datasets with numerous features while retaining important information. It outlines the benefits and disadvantages of these techniques, various approaches for feature selection and extraction, and their applications in machine learning and data visualization. Additionally, it covers graphical representations of principal components, including scree plots and biplots, and highlights the importance of large sample inference in PCA.

Uploaded by

25.thejasgowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

ASM-BDM - Module 3 - Notes

The document discusses dimensionality reduction techniques, specifically focusing on Principal Component Analysis (PCA) and Factor Analysis, which are essential for simplifying datasets with numerous features while retaining important information. It outlines the benefits and disadvantages of these techniques, various approaches for feature selection and extraction, and their applications in machine learning and data visualization. Additionally, it covers graphical representations of principal components, including scree plots and biplots, and highlights the importance of large sample inference in PCA.

Uploaded by

25.thejasgowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

MODULE 3: DIMENSION REDUCTION TECHNIQUES- PRINCIPAL COMPONENTS AND

COMMON FACTOR ANALYSIS


Population and sample principal components, their uses and applications, large sample inferences, graphical
representation of principal components, Biplots, the orthogonal factor model, dimension reduction, estimation
of factor loading and factor scores, interpretation of factor analysis.

Dimension Reduction Techniques

The number of input features, variables, or columns present in a given dataset is known as dimensionality, and
the process to reduce these features is called dimensionality reduction.

A dataset contains a huge number of input features in various cases, which makes the predictive modeling task
more complicated. Because it is very difficult to visualize or make predictions for the training dataset with a
high number of features, for such cases, dimensionality reduction techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions
dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques are
widely used in machine learning for obtaining a better fit predictive model while solving the classification and
regression problems. It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data visualization, noise
reduction, cluster analysis, etc.

Dimension reduction techniques are used in data science and machine learning to reduce the number of
variables or features in a dataset while retaining the most important information. There are two main types of
dimensionality reduction techniques:

1. Feature selection: In feature selection, a subset of the original features is selected and used for
modeling. This is typically done by ranking the features based on their relevance or importance to the
outcome variable.

2. Feature extraction: In feature extraction, a new set of features is created that combines the original
features in a meaningful way. This is typically done using linear algebra techniques such as principal
component analysis (PCA) or singular value decomposition (SVD).

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given dataset are given below:

▪ By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
Balachandramurthy E Mob: 7411089040
Asst. Professor
▪ Less Computation training time is required for reduced dimensions of features.

▪ Reduced dimensions of features of the dataset help in visualizing the data quickly.

▪ It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality reduction, which are given below:

• Some data may be lost due to dimensionality reduction.

• In the PCA dimensionality reduction technique, sometimes the principal components required to
consider are unknown.

Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique, which are given below:

Feature Selection

Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the
optimal features from the input dataset. Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some
common techniques of filters method are:

• Correlation • ANOVA

• Chi-Square Test • Information Gain, etc.

2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The
performance decides whether to add those features or remove to increase the accuracy of the model. This
method is more accurate than the filtering method but complex to work. Some common techniques of wrapper
methods are:

• Forward Selection
Balachandramurthy E Mob: 7411089040
Asst. Professor
• Backward Selection
• Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training iterations of the machine learning
model and evaluate the importance of each feature. Some common techniques of Embedded methods are:

• LASSO
• Elastic Net
• Ridge Regression, etc.

Feature Extraction:

Feature extraction is the process of transforming the space containing many dimensions into space with fewer
dimensions. This approach is useful when we want to keep the whole information but use fewer resources
while processing the information.

Some common feature extraction techniques are:

• Principal Component Analysis

• Linear Discriminant Analysis

• Kernel PCA

• Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction

a. Principal Component Analysis j. Auto-Encoder

b. Backward Elimination

c. Forward Selection

d. Score comparison

e. Missing Value Ratio

f. Low Variance Filter

g. High Correlation Filter

h. Random Forest

i. Factor Analysis
Balachandramurthy E Mob: 7411089040
Asst. Professor
Principal Component Analysis (PCA)

Principal Component Analysis is a statistical process that converts the observations of


correlated features into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal Components. It
is one of the popular tools that is used for exploratory data analysis and predictive
modeling.

PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels.

Backward Feature Elimination

The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this technique to
reduce the dimensionality or in feature selection:

• In this technique, firstly, all the n variables of the given dataset are taken to
train the model.
• The performance of the model is checked.
• Now we will remove one feature each time and train the model on n-1 features
for n times, and will compute the performance of the model.
• We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.
• Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the machine
learning algorithms.

Forward Feature Selection

Balachandramurthy E
Mob: 7411089040
Asst. Professor
Forward feature selection follows the inverse process of the backward elimination
process. It means, in this technique, we don't eliminate the feature; instead, we will find
the best features that can produce the highest increase in the performance of the model.
Below steps are performed in this technique:

• We start with a single feature only, and progressively we will add each feature at
a time.

• Here we will train the model on each feature separately.

• The feature with the best performance is selected.

• The process will be repeated until we get a significant increase in the performance
of the model.

Missing Value Ratio

If a dataset has too many missing values, then we drop those variables as they do not
carry much useful information. To perform this, we can set a threshold level, and if a
variable has missing values more than that threshold, we will drop that variable. The
higher the threshold value, the more efficient the reduction.

Low Variance Filter

As same as missing value ratio technique, data columns with some changes in the data
have less information. Therefore, we need to calculate the variance of each variable, and
all data columns with variance lower than a given threshold are dropped because low
variance features will not affect the target variable.

High Correlation Filter

High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value of the
correlation coefficient. If this value is higher than the threshold value, we can remove one
of the variables from the dataset. We can consider those variables or features that show a
high correlation with the target variable.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
Random Forest

Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do not
need to program it separately. In this technique, we need to generate a large set of trees
against the target variable, and with the help of usage statistics of each attribute, we need
to find the subset of features.

Random forest algorithm takes only numerical variables, so we need to convert the input
data into numeric data using hot encoding.

Factor Analysis

Factor analysis is a technique in which each variable is kept within a group according to
the correlation with other variables, it means variables within a group can have a high
correlation between themselves, but they have a low correlation with variables of other
groups.

We can understand it by an example, such as if we have two variables Income and spend.
These two variables have a high correlation, which means people with high income
spends more, and vice versa. So, such variables are put into a group, and that group is
known as the factor. The number of these factors will be reduced as compared to the
original dimension of the dataset.

Auto-encoders

One of the popular methods of dimensionality reduction is auto-encoder, which is a type


of ANN or artificial neural network, and its main aim is to copy the inputs to their
outputs. In this, the input is compressed into latent-space representation, and output is
occurred using this representation. It has mainly two parts:

• Encoder: The function of the encoder is to compress the input to form the latent-
space representation.

• Decoder: The function of the decoder is to recreate the output from the latent-
space representation.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
Population and sample principal components, their uses and applications

Principal component analysis (PCA) is a commonly used dimensionality reduction


technique in statistics and machine learning. PCA can be applied to both populations and
samples of data.

A population in statistics refers to the entire set of individuals, objects, or events that we
are interested in studying. A sample is a smaller subset of the population that is used to
make inferences about the population.

In PCA, the principal components are computed using the covariance matrix of the data.
The first principal component captures the direction of greatest variance in the data, the
second principal component captures the direction of second greatest variance that is
orthogonal to the first component, and so on.

Population principal components are the principal components computed using the entire
population data. They can be used to understand the structure of the population data and
can be used for prediction or inference about new data points that come from the same
population.

Sample principal components, on the other hand, are the principal components computed
using a sample of the population data. They are used to reduce the dimensionality of the
sample data and can be used for exploratory data analysis, visualization, or as input to
other models.

Some uses and applications of population and sample principal components are:

1. Data exploration and visualization: PCA can be used to visualize high-


dimensional data in two or three dimensions by plotting the data points in the
space defined by the first two or three principal components.

2. Feature selection: PCA can be used to identify the most important features or
variables that explain the most variance in the data. This can be useful for
reducing the number of features in a dataset for further analysis.

Balachandramurthy E
Mob: 7411089040
Asst. Professor
3. Data compression: PCA can be used to compress the data by retaining only the
first few principal components, which capture most of the variation in the data.
This can be useful for reducing the storage requirements of the data.

4. Clustering and classification: PCA can be used as a preprocessing step for


clustering or classification algorithms to reduce the dimensionality of the data and
improve the performance of the algorithms.

In summary, both population and sample principal components have various uses and
applications in data analysis, machine learning, and statistical modeling.

Large Sample Inferences

In principal component analysis (PCA), large sample inference can be used to make
statistical inferences about population parameters based on a large sample of data.
Specifically, large sample inference can be used to test hypotheses about the principal
components and to construct confidence intervals around the principal component scores.

One common application of large sample inference in PCA is to test whether a particular
principal component is statistically significant. This can be done using a large sample
test, such as a t-test or a z-test, to compare the sample mean of the principal component to
its expected value under the null hypothesis. If the test statistic is sufficiently large, the
null hypothesis can be rejected, indicating that the principal component is statistically
significant.

Another application of large sample inference in PCA is to construct confidence intervals


around the principal component scores. Confidence intervals provide a range of plausible
values for the population parameter based on the sample data. In PCA, confidence
intervals can be constructed around the principal component scores using large sample
methods, such as the t-distribution or the normal distribution, depending on the sample
size and the distributional properties of the data.

It is important to note that large sample inference in PCA relies on the assumption that
the sample size is sufficiently large for the central limit theorem to apply. In general, a
sample size of at least 30 is recommended for large sample inference to be valid.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
Additionally, it is important to carefully consider the assumptions underlying the
statistical tests and to verify that the data satisfies these assumptions, such as normality
and independence.

Graphical Representation of Principal Components

Principal component analysis (PCA) can be represented graphically in several ways,


which can help in understanding the structure and relationships among variables in a
dataset. Here are some common graphical representations of principal components:

1. Scree plot: A scree plot is a graphical representation of the eigenvalues of the


principal components. It is a plot of the eigenvalues against the number of
principal components. The scree plot can help in determining the number of
principal components to retain for analysis. Typically, we look for the point on
the plot where the eigenvalues start to level off, indicating that the remaining
principal components explain little additional variation.

2. Biplot: A biplot is a two-dimensional plot that shows the relationship between


variables and principal components. It can help in visualizing how variables
contribute to the principal components and how variables are related to each
other. In a biplot, each variable is represented as a vector, and the length and
direction of the vector show the contribution of the variable to the principal
components.

Biplots are a type of data visualization that allows us to simultaneously display


the patterns in two sets of variables. In other words, biplots show the relationships
between two different types of variables in a single plot.

In a biplot, each observation is represented by a point, and each variable is


represented by a vector. The length and direction of the vector represent the
magnitude and direction of the variable's contribution to the overall pattern of the
data. The position of the point relative to the vectors indicates the relationship
between the observation and the variables.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
Biplots can be used to explore the relationships between variables and
observations in a variety of different fields, such as ecology, genetics, and
marketing research. They can also be used in multivariate data analysis
techniques, such as principal component analysis (PCA) and correspondence
analysis (CA), to visualize and interpret the results of these analyses.

Overall, biplots are a useful tool for understanding and communicating complex
patterns in data, and can help to identify important relationships between different
variables.

3. Score plot: A score plot is a two-dimensional plot that shows the scores of
observations on the first two principal components. It can help in visualizing the
clustering and separation of observations based on their scores on the principal
components. In a score plot, each observation is represented as a point, and the
location of the point shows its scores on the first two principal components.

4. Loading plot: A loading plot is a graphical representation of the loadings of the


variables on the principal components. It can help in visualizing which variables
are most strongly associated with each principal component. In a loading plot,
each variable is represented as a vector, and the length and direction of the vector
show the magnitude and direction of the loading.

These graphical representations of principal components can help in interpreting the


results of a PCA and in communicating the findings to others. They can also provide
insights into the relationships among variables and can help in identifying patterns or
outliers in the data.

The Orthogonal Factor Model

The orthogonal factor model is a statistical method used in multivariate analysis to


explore the relationships between variables. It assumes that the variables are related to
each other through a set of underlying, unobserved factors, and that these factors are
orthogonal, meaning they are uncorrelated with each other.

Balachandramurthy E
Mob: 7411089040
Asst. Professor
In this model, each variable is represented as a linear combination of the underlying
factors. The goal is to identify the underlying factors that explain the most variance in the
data, and to use these factors to understand the relationships between the variables.

The orthogonal factor model is often used in the field of psychology to study personality
traits. For example, researchers might use this model to identify the underlying factors
that contribute to a person's extroversion, conscientiousness, and openness to experience.
The factors identified in this way can then be used to better understand how these
personality traits are related to other aspects of a person's life, such as their career choices
or social behavior.

Overall, the orthogonal factor model is a powerful tool for exploring the relationships
between variables and identifying the underlying factors that drive those relationships.

An example of the orthogonal factor model is in the analysis of the stock market. Let's
say we have data on the daily closing prices of several stocks over a period of time. The
prices of these stocks are likely to be correlated with each other, meaning that if one
stock goes up, the others are likely to follow suit. However, the exact nature of these
correlations is not immediately apparent.

To apply the orthogonal factor model, we first calculate the correlation matrix of the
stock prices. This gives us a measure of the linear relationship between each pair of
stocks. We then use a statistical method called principal component analysis (PCA) to
identify the underlying factors that explain the most variance in the data.

In this case, the factors might represent things like market trends, industry-specific
factors, or macroeconomic variables that affect all stocks. By identifying these factors,
we can better understand the relationships between the stocks and potentially make more
informed investment decisions.

Overall, the orthogonal factor model is a useful tool for identifying hidden patterns in
complex data sets and uncovering the underlying factors that drive those patterns. It has
applications in many fields, from finance to psychology to biology.

Estimation of Factor Loading and Factor Scores


Balachandramurthy E
Mob: 7411089040
Asst. Professor
Factor analysis is a statistical technique that is used to identify underlying factors in a set
of observed variables. The factor loading and factor score are two key components of
factor analysis that are used to estimate the relationships between the observed variables
and the underlying factors.

Factor loadings represent the strength of the relationship between each observed variable
and the underlying factor. They indicate how much of the variation in the observed
variable can be explained by the factor. Factor loadings range from -1 to 1, with values
closer to 1 indicating a stronger relationship between the variable and the factor.

To estimate the factor loadings, factor analysis typically uses maximum likelihood
estimation or principal component analysis. The factor loading estimates can be
interpreted to identify which observed variables are most strongly associated with each
factor.

Factor scores, on the other hand, represent the values of the underlying factors for each
observation in the dataset. They are calculated by multiplying the observed variables by
their corresponding factor loadings and summing over all variables. Factor scores are
useful because they provide a way to summarize the information contained in the
observed variables into a smaller number of variables that capture the essential
information.

To estimate the factor scores, several methods can be used, such as regression-based
methods, Bartlett's method, Anderson-Rubin method, and others. The estimated factor
scores can be used for subsequent analyses, such as regression or cluster analysis.

Overall, factor loading and factor score estimation are key components of factor analysis
that provide insights into the relationships between observed variables and the underlying
factors.

Balachandramurthy E
Mob: 7411089040
Asst. Professor

You might also like