ASM-BDM - Module 3 - Notes
ASM-BDM - Module 3 - Notes
The number of input features, variables, or columns present in a given dataset is known as dimensionality, and
the process to reduce these features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes the predictive modeling task
more complicated. Because it is very difficult to visualize or make predictions for the training dataset with a
high number of features, for such cases, dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions
dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques are
widely used in machine learning for obtaining a better fit predictive model while solving the classification and
regression problems. It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data visualization, noise
reduction, cluster analysis, etc.
Dimension reduction techniques are used in data science and machine learning to reduce the number of
variables or features in a dataset while retaining the most important information. There are two main types of
dimensionality reduction techniques:
1. Feature selection: In feature selection, a subset of the original features is selected and used for
modeling. This is typically done by ranking the features based on their relevance or importance to the
outcome variable.
2. Feature extraction: In feature extraction, a new set of features is created that combines the original
features in a meaningful way. This is typically done using linear algebra techniques such as principal
component analysis (PCA) or singular value decomposition (SVD).
Some benefits of applying dimensionality reduction technique to the given dataset are given below:
▪ By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
Balachandramurthy E Mob: 7411089040
Asst. Professor
▪ Less Computation training time is required for reduced dimensions of features.
▪ Reduced dimensions of features of the dataset help in visualizing the data quickly.
There are also some disadvantages of applying the dimensionality reduction, which are given below:
• In the PCA dimensionality reduction technique, sometimes the principal components required to
consider are unknown.
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the
optimal features from the input dataset. Three methods are used for the feature selection:
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some
common techniques of filters method are:
• Correlation • ANOVA
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The
performance decides whether to add those features or remove to increase the accuracy of the model. This
method is more accurate than the filtering method but complex to work. Some common techniques of wrapper
methods are:
• Forward Selection
Balachandramurthy E Mob: 7411089040
Asst. Professor
• Backward Selection
• Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the machine learning
model and evaluate the importance of each feature. Some common techniques of Embedded methods are:
• LASSO
• Elastic Net
• Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into space with fewer
dimensions. This approach is useful when we want to keep the whole information but use fewer resources
while processing the information.
• Kernel PCA
b. Backward Elimination
c. Forward Selection
d. Score comparison
h. Random Forest
i. Factor Analysis
Balachandramurthy E Mob: 7411089040
Asst. Professor
Principal Component Analysis (PCA)
PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels.
The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this technique to
reduce the dimensionality or in feature selection:
• In this technique, firstly, all the n variables of the given dataset are taken to
train the model.
• The performance of the model is checked.
• Now we will remove one feature each time and train the model on n-1 features
for n times, and will compute the performance of the model.
• We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.
• Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the machine
learning algorithms.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
Forward feature selection follows the inverse process of the backward elimination
process. It means, in this technique, we don't eliminate the feature; instead, we will find
the best features that can produce the highest increase in the performance of the model.
Below steps are performed in this technique:
• We start with a single feature only, and progressively we will add each feature at
a time.
• The process will be repeated until we get a significant increase in the performance
of the model.
If a dataset has too many missing values, then we drop those variables as they do not
carry much useful information. To perform this, we can set a threshold level, and if a
variable has missing values more than that threshold, we will drop that variable. The
higher the threshold value, the more efficient the reduction.
As same as missing value ratio technique, data columns with some changes in the data
have less information. Therefore, we need to calculate the variance of each variable, and
all data columns with variance lower than a given threshold are dropped because low
variance features will not affect the target variable.
High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value of the
correlation coefficient. If this value is higher than the threshold value, we can remove one
of the variables from the dataset. We can consider those variables or features that show a
high correlation with the target variable.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do not
need to program it separately. In this technique, we need to generate a large set of trees
against the target variable, and with the help of usage statistics of each attribute, we need
to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the input
data into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to
the correlation with other variables, it means variables within a group can have a high
correlation between themselves, but they have a low correlation with variables of other
groups.
We can understand it by an example, such as if we have two variables Income and spend.
These two variables have a high correlation, which means people with high income
spends more, and vice versa. So, such variables are put into a group, and that group is
known as the factor. The number of these factors will be reduced as compared to the
original dimension of the dataset.
Auto-encoders
• Encoder: The function of the encoder is to compress the input to form the latent-
space representation.
• Decoder: The function of the decoder is to recreate the output from the latent-
space representation.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
Population and sample principal components, their uses and applications
A population in statistics refers to the entire set of individuals, objects, or events that we
are interested in studying. A sample is a smaller subset of the population that is used to
make inferences about the population.
In PCA, the principal components are computed using the covariance matrix of the data.
The first principal component captures the direction of greatest variance in the data, the
second principal component captures the direction of second greatest variance that is
orthogonal to the first component, and so on.
Population principal components are the principal components computed using the entire
population data. They can be used to understand the structure of the population data and
can be used for prediction or inference about new data points that come from the same
population.
Sample principal components, on the other hand, are the principal components computed
using a sample of the population data. They are used to reduce the dimensionality of the
sample data and can be used for exploratory data analysis, visualization, or as input to
other models.
Some uses and applications of population and sample principal components are:
2. Feature selection: PCA can be used to identify the most important features or
variables that explain the most variance in the data. This can be useful for
reducing the number of features in a dataset for further analysis.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
3. Data compression: PCA can be used to compress the data by retaining only the
first few principal components, which capture most of the variation in the data.
This can be useful for reducing the storage requirements of the data.
In summary, both population and sample principal components have various uses and
applications in data analysis, machine learning, and statistical modeling.
In principal component analysis (PCA), large sample inference can be used to make
statistical inferences about population parameters based on a large sample of data.
Specifically, large sample inference can be used to test hypotheses about the principal
components and to construct confidence intervals around the principal component scores.
One common application of large sample inference in PCA is to test whether a particular
principal component is statistically significant. This can be done using a large sample
test, such as a t-test or a z-test, to compare the sample mean of the principal component to
its expected value under the null hypothesis. If the test statistic is sufficiently large, the
null hypothesis can be rejected, indicating that the principal component is statistically
significant.
It is important to note that large sample inference in PCA relies on the assumption that
the sample size is sufficiently large for the central limit theorem to apply. In general, a
sample size of at least 30 is recommended for large sample inference to be valid.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
Additionally, it is important to carefully consider the assumptions underlying the
statistical tests and to verify that the data satisfies these assumptions, such as normality
and independence.
Overall, biplots are a useful tool for understanding and communicating complex
patterns in data, and can help to identify important relationships between different
variables.
3. Score plot: A score plot is a two-dimensional plot that shows the scores of
observations on the first two principal components. It can help in visualizing the
clustering and separation of observations based on their scores on the principal
components. In a score plot, each observation is represented as a point, and the
location of the point shows its scores on the first two principal components.
Balachandramurthy E
Mob: 7411089040
Asst. Professor
In this model, each variable is represented as a linear combination of the underlying
factors. The goal is to identify the underlying factors that explain the most variance in the
data, and to use these factors to understand the relationships between the variables.
The orthogonal factor model is often used in the field of psychology to study personality
traits. For example, researchers might use this model to identify the underlying factors
that contribute to a person's extroversion, conscientiousness, and openness to experience.
The factors identified in this way can then be used to better understand how these
personality traits are related to other aspects of a person's life, such as their career choices
or social behavior.
Overall, the orthogonal factor model is a powerful tool for exploring the relationships
between variables and identifying the underlying factors that drive those relationships.
An example of the orthogonal factor model is in the analysis of the stock market. Let's
say we have data on the daily closing prices of several stocks over a period of time. The
prices of these stocks are likely to be correlated with each other, meaning that if one
stock goes up, the others are likely to follow suit. However, the exact nature of these
correlations is not immediately apparent.
To apply the orthogonal factor model, we first calculate the correlation matrix of the
stock prices. This gives us a measure of the linear relationship between each pair of
stocks. We then use a statistical method called principal component analysis (PCA) to
identify the underlying factors that explain the most variance in the data.
In this case, the factors might represent things like market trends, industry-specific
factors, or macroeconomic variables that affect all stocks. By identifying these factors,
we can better understand the relationships between the stocks and potentially make more
informed investment decisions.
Overall, the orthogonal factor model is a useful tool for identifying hidden patterns in
complex data sets and uncovering the underlying factors that drive those patterns. It has
applications in many fields, from finance to psychology to biology.
Factor loadings represent the strength of the relationship between each observed variable
and the underlying factor. They indicate how much of the variation in the observed
variable can be explained by the factor. Factor loadings range from -1 to 1, with values
closer to 1 indicating a stronger relationship between the variable and the factor.
To estimate the factor loadings, factor analysis typically uses maximum likelihood
estimation or principal component analysis. The factor loading estimates can be
interpreted to identify which observed variables are most strongly associated with each
factor.
Factor scores, on the other hand, represent the values of the underlying factors for each
observation in the dataset. They are calculated by multiplying the observed variables by
their corresponding factor loadings and summing over all variables. Factor scores are
useful because they provide a way to summarize the information contained in the
observed variables into a smaller number of variables that capture the essential
information.
To estimate the factor scores, several methods can be used, such as regression-based
methods, Bartlett's method, Anderson-Rubin method, and others. The estimated factor
scores can be used for subsequent analyses, such as regression or cluster analysis.
Overall, factor loading and factor score estimation are key components of factor analysis
that provide insights into the relationships between observed variables and the underlying
factors.
Balachandramurthy E
Mob: 7411089040
Asst. Professor