Unit No.02 - Feature Extraction and Selection
Unit No.02 - Feature Extraction and Selection
02
Feature Extraction and Selection
known as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes
the predictive modeling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
similar information." These techniques are widely used in machine learning for obtaining
a better fit predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be used for
as the curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of features
increases, the number of samples also gets increased proportionally, and the chance of
overfitting also increases. If the machine learning model is trained on high-dimensional data,
dimensionality reduction.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are
given below:
o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
There are also some disadvantages of applying the dimensionality reduction, which are
given below:
Methods of Feature
Reduction
-Correlation
-Principal Component Analysis -Chi-Square Test
-Linear Discriminant Analysis -Forward Selection
-Kernel PCA -Backward Selection
-Quadratic Discriminant Analysis -Information Gain
-LASSO & Ridge Regression
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
Feature Extraction: -
Feature extraction is usually used when the original raw data is very different and we
cannot use raw data for Machine Learning modeling then we transform raw data into the
desired form. Feature extraction is the method for creating a new and smaller set of
features that captures most of the useful information of raw data.
When we actually work in real world machine learning problem then we rarely get data in
shape of CSV So we have to extract the useful features from the raw data. Some of the
popular types of raw data from which features (new feature creation) can be extracted.
Texts
Images
Geospatial data
Date and time
Web data
Sensor Data
Statistical features-
Statistics is a branch of mathematics that deals with collecting, analyzing,
interpreting, and visualizing empirical data. Descriptive statistics and inferential statistics
are the two major areas of statistics. Descriptive statistics are for describing the properties
of sample and population data (what has happened). Inferential statistics use those
properties to test hypotheses, reach conclusions, and make predictions (what can you
expect).
Median
Mode
Mean
Percentile
Bias
Variance
Standard Deviation (S.D)
Mean Absolute Deviation (M.A.D)
Z- Score
Skewness
Kurtosis
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
Mean- Mean is the most commonly used measure of central tendency. It actually
represents the average of the given collection of data.
It is equal to the sum of all the values in the collection of data divided by the total
number of values.
Suppose we have n values in a set of data namely as x1, x2, x3, …, xn, then the mean of
Median- Generally median represents the mid-value of the given set of data when
arranged in a particular order. Middle Value is called as Median.
Mode- The most frequent number occurring in the data set is known as the mode.
Percentile- a percentile is a term that describes how a score compares to other scores
25th 75th
Percentile Percentile
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
Calculation-
Bias- when there is a systematic difference between the true parameters and the
mean. It can also be treated as a measure that tell us how far the data points are
spread out.
symmetry. A distribution, or data set, is symmetric if it looks the same to the left and
right of the center point.
1. Standardize the data: The first step in PCA is to standardize the data so that each
variable has zero mean and unit variance. This is done to ensure that all variables
are on the same scale.
2. Calculate the covariance matrix: The next step is to calculate the covariance matrix,
which is a measure of the linear relationship between two variables. The covariance
matrix is calculated by multiplying the transpose of the data matrix by the data
matrix itself.
3. Calculate the eigenvectors and eigenvalues of the covariance matrix: The
eigenvectors and eigenvalues of the covariance matrix are calculated to find the
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
directions of maximum variance in the dataset. Eigenvectors are the directions and
eigenvalues are the magnitudes of the variance.
4. Select the principal components: The eigenvectors are sorted by their
corresponding eigenvalues in descending order. The eigenvectors with the largest
eigenvalues are called the principal components. These principal components
represent the directions of maximum variance in the dataset.
5. Project the data onto the principal components: Finally, the data is projected onto
the principal components to reduce the dimensionality of the dataset. The number
of principal components to be selected depends on the desired amount of variance
to be retained in the data.
PCA is useful in several applications such as image compression, feature extraction, data
visualization, and data analysis. PCA can also be used in combination with other machine
learning algorithms to improve their performance by reducing the dimensionality of the
dataset.
Feature Selection
Feature Selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features.”
While developing the machine learning model, only a few variables in the dataset
are useful for building the model, and the rest features are either redundant or
irrelevant.
If we input the dataset with all these redundant and irrelevant features, it may
negatively impact and reduce the overall performance and accuracy of the model.
Hence it is very important to identify and select the most appropriate features from
the data and remove the irrelevant or less important features, which is done with
the help of feature selection in machine learning.
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search problem,
in which different combinations are made, evaluated, and compared with other
combinations. It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
o Forward selection - Forward selection is an iterative process, which begins with an empty
set of features. After each iteration, it keeps adding on a feature and evaluates the
performance to check whether it is improving the performance or not. The process continues
until the addition of a new variable/feature does not improve the performance of the model.
o Backward elimination - Backward elimination is also an iterative approach, but it is the
opposite of forward selection. This technique begins the process by considering all the
features and removes the least significant feature. This elimination process continues until
removing the features does not improve the performance of the model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature
selection methods, which evaluates each feature set as brute-force. It means this method tries
& make each possible combination of features and return the best performing feature set.
o Recursive Feature Elimination-Recursive feature elimination is a recursive greedy
optimization approach, where features are selected by recursively taking a smaller and
smaller subset of features. Now, an estimator is trained with each set of features, and the
importance of each feature is determined using coef_attribute or through
a feature_importances_attribute.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method does
not depend on the learning algorithm and chooses the features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns from the model
by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does not
overfit the data.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Chi-square Test: Chi-square test is a technique to determine the relationship between the
categorical variables. The chi-square value is calculated between each feature and the target
variable, and the desired number of features with the best chi-square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection. It returns the
rank of the variable on the fisher's criteria in descending order. Then we can select the
variables with a large fisher's score.
The value of the missing value ratio can be used for evaluating the feature set against the
threshold value. The formula for obtaining the missing value ratio is the number of missing
values in each column divided by the total number of observations. The variable is having
more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter method.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
These methods are also iterative, which evaluates each iteration, and optimally finds the
most important features that contribute the most to training in a particular iteration. Some
techniques of embedded methods are:
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
o Below diagram explains the general structure of a decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
Example- Decision Tree
Let's say we have a dataset of weather conditions and corresponding whether people go outside or
not:
In this case, the goal is to predict whether someone will go outside or not given the
weather conditions. A decision tree can be used to model this problem.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
Entropy-
Entropy is a measure of the impurity or randomness of a dataset in decision tree classification. It
is used to determine which feature is the best to split the data on. Entropy is calculated by
considering the proportion of each class in the dataset, and the more evenly the classes are
distributed, the higher the entropy.
The formula for entropy is:
H(S) = -∑(p_i)log2(p_i)
where p_i is the proportion of samples in the dataset that belong to the ith class.
Let's take an example to understand entropy better. Suppose we have a dataset of 100 samples,
where 70 samples belong to class A and 30 samples belong to class B. We want to calculate the
entropy of this dataset.
Step 1: Calculate the probability of each class
p_A = 70 / 100 = 0.7 p_B = 30 / 100 = 0.3
Step 2: Calculate the entropy
H(S) = -[0.7log2(0.7) + 0.3log2(0.3)] = 0.881
This means that the dataset has an entropy of 0.881. If the entropy is closer to 0, it means that the
dataset is more pure and easier to classify, while if the entropy is closer to 1, it means that the
dataset is more impure and harder to classify. Therefore, when building a decision tree, we want
to choose the feature that gives the lowest entropy, as this will result in the best split of the data
into the different classes.
Information Gain-
Information gain is a measure of the amount of information gained by splitting a dataset on a
particular feature in decision tree classification. It is calculated by comparing the entropy or Gini
index of the dataset before and after the split. The feature that gives the highest information gain
is chosen as the feature to split on.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
Step 2: Calculate the entropy after splitting on "color"
For red:
H(S_red) = -[0.5log2(0.5) + 0.5log2(0.5)] = 1.0
For blue:
H(S_blue) = -[0.333log2(0.333) + 0.667log2(0.667)] = 0.918
Step 3: Calculate the information gain
IG(color) = H(S) - [(20/40)*H(S_red) + (20/40)*H(S_red) + (30/60)*H(S_blue) +
(30/60)*H(S_blue)]
= 0.971 - [0.51.0 + 0.51.0 + 0.50.918 + 0.50.918]
= 0.029
This means that the information gain for splitting on the "color" feature is 0.029. If we compare
this with the information gain for other features, we can determine which feature is the best to split
on. The higher the information gain, the better the feature is for splitting the dataset.
Therefore, in summary, information gain measures the reduction in entropy or Gini index achieved
by splitting a dataset on a particular feature. The feature that provides the highest information gain
is chosen as the splitting feature for a decision tree.
Gini Index-
Gini index is a metric used to measure the impurity of a dataset in decision tree classification. A
dataset is considered pure if all of its samples belong to the same class. If a dataset is impure, it
means that there is more than one class present in the dataset. The Gini index measures the
probability of misclassifying a randomly chosen sample in the dataset.
The Gini index can be calculated using the following formula:
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
The advantages of greedy backward selection include that it can be faster than greedy forward
selection and may find a better feature subset by removing irrelevant or redundant features.
However, like greedy forward selection, it may be prone to local optima and may not find the
globally optimal feature subset.
Overall, both greedy forward and backward selection methods are simple and effective methods
for feature selection, and their choice depends on the specific problem and dataset. It is
recommended to use cross-validation to evaluate the performance of the selected feature subset
and to avoid overfitting.
Predictive Maintenance:
Predictive maintenance is the process of predicting when machines or equipment are likely to fail,
so that maintenance can be scheduled before a failure occurs. Feature extraction and selection
algorithms can be used to extract relevant features from sensors such as accelerometers,
thermocouples, and vibration sensors. These features can be used to train machine learning models
that can predict when equipment is likely to fail, and schedule maintenance accordingly. This can
help reduce downtime and maintenance costs.
Energy Efficiency:
Feature extraction and selection algorithms are also used in the field of energy efficiency. For
example, in heating, ventilation, and air conditioning (HVAC) systems, features such as
temperature, humidity, and airflow can be extracted from sensors to predict energy consumption
and optimize system performance. These features can be used to train machine learning models
that can adjust system parameters to minimize energy consumption while maintaining comfort
levels.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede
Robotics:
Feature extraction and selection algorithms are also used in robotics to extract relevant features
from sensor data such as lidar and camera images. These features can be used to enable
autonomous navigation, object recognition, and grasping. For example, machine learning models
can be trained to recognize objects in a cluttered environment, which can help robots pick and
place objects accurately.
In conclusion, feature extraction and selection algorithms have numerous applications in
Mechanical Engineering, ranging from structural health monitoring to robotics. These algorithms
can help extract relevant features from sensor data, and enable the development of machine
learning models that can improve safety, quality, efficiency, and productivity in various
mechanical engineering applications.
Dr. D.Y. Patil Institute of Technology, Pimpri, Pune. Prof. J.S. Narkhede