0% found this document useful (0 votes)

30 views7 pages

Feature Engineering and Normalization

Uploaded by

Niharika Khanna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views7 pages

Feature Engineering and Normalization

Uploaded by

Niharika Khanna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Feature Engineering

What is a feature?

Generally, all machine learning algorithms take input data to generate the output. The input
data remains in a tabular form consisting of rows (instances or observations) and columns
(variable or attributes), and these attributes are often known as features. For example, an
image is an instance in computer vision, but a line in the image could be the feature.
Similarly, in NLP, a document can be an observation, and the word count could be the
feature. So, we can say a feature is an attribute that impacts a problem or is useful for
the problem.

What is Feature Engineering?

Feature engineering is the pre-processing step of machine learning, which extracts

features from raw data. It helps to represent an underlying problem to predictive models in
a better way, which as a result, improve the accuracy of the model for unseen data. The
predictive model contains predictor variables and an outcome variable, and while the feature
engineering process selects the most useful predictor variables for the model.

Since 2016, automated feature engineering is also used in different machine learning software
that helps in automatically extracting features from raw data. Feature engineering in ML
contains mainly four processes: Feature Creation, Transformations, Feature Extraction,
and Feature Selection.

These processes are described as below:

1. Feature Creation: Feature creation is finding the most useful variables to be used in
a predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.
2. Transformations: The transformation step of feature engineering involves adjusting
the predictor variable to improve the accuracy and performance of the model. For
example, it ensures that the model is flexible to take input of the variety of data; it
ensures that all the variables are on the same scale, making the model easier to
understand. It improves the model's accuracy and ensures that all the features are
within the acceptable range to avoid any computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering process
that generates new variables by extracting them from the raw data. The main aim of
this step is to reduce the volume of data so that it can be easily used and managed for
data modelling. Feature extraction methods include cluster analysis, text analytics,
edge detection algorithms, and principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features are
either redundant or irrelevant. If we input the dataset with all these redundant and
irrelevant features, it may negatively impact and reduce the overall performance and
accuracy of the model. Hence it is very important to identify and select the most
appropriate features from the data and remove the irrelevant or less important
features, which is done with the help of feature selection in machine
learning. "Feature selection is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant, irrelevant, or
noisy features."

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.

o It helps in the simplification of the model so that the researchers can easily interpret it.
o It reduces the training time.
o It reduces overfitting hence enhancing the generalization.

Need for Feature Engineering in Machine Learning

In machine learning, the performance of the model depends on data pre-processing and data
handling. But if we create a model without pre-processing or data handling, then it may not
give good accuracy. Whereas, if we apply feature engineering on the same model, then the
accuracy of the model is enhanced. Hence, feature engineering in machine learning improves
the model's performance. Below are some points that explain the need for feature
engineering:
o Better features mean flexibility.
In machine learning, we always try to choose the optimal model to get good results.
However, sometimes after choosing the wrong model, still, we can get better
predictions, and this is because of better features. The flexibility in features will
enable you to select the less complex models. Because less complex models are faster
to run, easier to understand and maintain, which is always desirable.
o Better features mean simpler models.
If we input the well-engineered features to our model, then even after selecting the
wrong parameters (Not much optimal), we can have good outcomes. After feature
engineering, it is not necessary to do hard for picking the right model with the most
optimized parameters. If we have good features, we can better represent the complete
data and use it to best characterize the given problem.
o Better features mean better results.
As already discussed, in machine learning, as data we will provide will get the same
output. So, to obtain better results, we must need to use better features.

Steps in Feature Engineering

The steps of feature engineering may vary as per different data scientists and ML engineers.
However, there are some common steps that are involved in most machine learning
algorithms, and these steps are as follows:

o Data Preparation: The first step is data preparation. In this step, raw data acquired
from different resources are prepared to make it in a suitable format so that it can be
used in the ML model. The data preparation may contain cleaning of data, delivery,
data augmentation, fusion, ingestion, or loading.
o Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA) is an
important step of features engineering, which is mainly used by data scientists. This
step involves analysis, investing data set, and summarization of the main
characteristics of data. Different data visualization techniques are used to better
understand the manipulation of data sources, to find the most appropriate statistical
technique for data analysis, and to select the best features for the data.
o Benchmark: Benchmarking is a process of setting a standard baseline for accuracy to
compare all the variables from this baseline. The benchmarking process is used to
improve the predictability of the model and reduce the error rate.

Feature Engineering Techniques

Some of the popular feature engineering techniques include:

1. Imputation

Feature engineering deals with inappropriate data, missing values, human interruption,
general errors, insufficient data sources, etc. Missing values within the dataset highly affect
the performance of the algorithm, and to deal with them "Imputation" technique is
used. Imputation is responsible for handling irregularities within the dataset.

For example, removing the missing values from the complete row or complete column by a
huge percentage of missing values. But at the same time, to maintain the data size, it is
required to impute the missing data, which can be done as:

o For numerical data imputation, a default value can be imputed in a column, and
missing values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.

2. Handling Outliers

Outliers are the deviated values or data points that are observed too away from other data
points in such a way that they badly affect the performance of the model. Outliers can be
handled with this feature engineering technique. This technique first identifies the outliers
and then remove them out.

Standard deviation can be used to identify the outliers. For example, each value within a
space has a definite to an average distance, but if a value is greater distant than a certain
value, it can be considered as an outlier. Z-score can also be used to detect outliers.

3. Binning

In machine learning, overfitting is one of the main issues that degrade the performance of the
model and which occurs due to a greater number of parameters and noisy data. However, one
of the popular techniques of feature engineering, "binning", can be used to normalize the
noisy data. This process involves segmenting different features into bins.

4. Feature Split

As the name suggests, feature split is the process of splitting features intimately into two or
more parts and performing to make new features. This technique helps the algorithms to
better understand and learn the patterns in the dataset.

The feature splitting process enables the new features to be clustered and binned, which
results in extracting useful information and improving the performance of the data models.

5. One hot encoding

One hot encoding is the popular encoding technique in machine learning. It is a technique
that converts the categorical data in a form so that they can be easily understood by machine
learning algorithms and hence can make a good prediction. It enables group the of categorical
data without losing any information.

Normalization in Machine Learning

Normalization is one of the most frequently used data preparation techniques, which helps
us to change the values of numeric columns in the dataset to use a common scale.

Although Normalization is no mandate for all datasets available in machine learning, it is

used whenever the attributes of the dataset have different ranges. It helps to enhance the
performance and reliability of a machine learning model.

What is Normalization in Machine Learning?

Normalization is a scaling technique in Machine Learning applied during data preparation to

change the values of numeric columns in the dataset to use a common scale. It is not
necessary for all datasets in a model. It is required only when features of machine learning
models have different ranges.

Mathematically, we can calculate normalization with the below formula:

Xn = (X - Xminimum) / ( Xmaximum - Xminimum)

o Xn = Value of Normalization
o Xmaximum = Maximum value of a feature
o Xminimum = Minimum value of a feature

Example: Let's assume we have a model dataset having maximum and minimum values of
feature as mentioned above. To normalize the machine learning model, values are shifted and
rescaled so their range can vary between 0 and 1.

Normalization techniques in Machine Learning

Although there are so many feature normalization techniques in Machine Learning, few of
them are most frequently used. These are as follows:

o Min-Max Scaling: This technique is also referred to as scaling. As we have already

discussed above, the Min-Max scaling method helps the dataset to shift and rescale
the values of their attributes, so they end up ranging between 0 and 1

Standardization scaling:
Standardization scaling is also known as Z-score normalization, in which values are centered
around the mean with a unit standard deviation, which means the attribute becomes zero and
the resultant distribution has a unit standard deviation. Mathematically, we can calculate the
standardization by subtracting the feature value from the mean and dividing it by standard
deviation.

Hence, standardization can be expressed as follows:

Here, µ represents the mean of feature value, and σ represents the standard deviation of
feature values.

However, unlike Min-Max scaling technique, feature values are not restricted to a specific
range in the standardization technique.

This technique is helpful for various machine learning algorithms that use distance measures
such as KNN, K-means clustering, and Principal component analysis, etc. Further, it is
also important that the model is built on assumptions and data is normally distributed.

Difference between Normalization and Standardization

Normalization Standardization

This technique uses minimum and max values This technique uses mean and standard deviation for
for scaling of model. scaling of model.

It is helpful when features are of different scales. It is helpful when the mean of a variable is set to 0 and
the standard deviation is set to 1.

Scales values ranges between [0, 1] or [-1, 1]. Scale values are not restricted to a specific range.

It got affected by outliers. It is comparatively less affected by outliers.

Scikit-Learn provides a transformer called Scikit-Learn provides a transformer called

MinMaxScaler for Normalization. StandardScaler for Normalization.

It is also called Scaling normalization. It is known as Z-score normalization.

It is useful when feature distribution is It is useful when feature distribution is normal.

unknown.

When to use Normalization or Standardization?

Which is suitable for our machine learning model, Normalization or Standardization? This is
probably a big confusion among all data scientists as well as machine learning engineers.
Although both terms have the almost same meaning choice of using normalization or
standardization will depend on your problem and the algorithm you are using in models.

1. Normalization is a transformation technique that helps to improve the performance as well

as the accuracy of your model better. Normalization of a machine learning model is useful
when you don't know feature distribution exactly. In other words, the feature distribution of
data does not follow a Gaussian (bell curve) distribution. Normalization must have an
abounding range, so if you have outliers in data, they will be affected by Normalization.

Further, it is also useful for data having variable scaling techniques such as KNN, artificial
neural networks. Hence, you can't use assumptions for the distribution of data.

2. Standardization in the machine learning model is useful when you are exactly aware of the
feature distribution of data or, in other words, your data follows a Gaussian distribution.
However, this does not have to be necessarily true. Unlike Normalization, Standardization
does not necessarily have a bounding range, so if you have outliers in your data, they will not
be affected by Standardization.

Further, it is also useful when data has variable dimensions and techniques such as linear
regression, logistic regression, and linear discriminant analysis.

Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
Unit - 3 Feature Engineering
No ratings yet
Unit - 3 Feature Engineering
29 pages
Feature Engineering
No ratings yet
Feature Engineering
2 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
Unit 4
No ratings yet
Unit 4
25 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
Machine - Learning Note Modul2
No ratings yet
Machine - Learning Note Modul2
20 pages
CSC407 - Chapter 4
No ratings yet
CSC407 - Chapter 4
28 pages
AI6322 - Module 4 - Feature Engineering - MODULE
No ratings yet
AI6322 - Module 4 - Feature Engineering - MODULE
25 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
04 - Feature Engineering
No ratings yet
04 - Feature Engineering
28 pages
Model Selection and Feature Engineering
No ratings yet
Model Selection and Feature Engineering
64 pages
UNIT04
No ratings yet
UNIT04
35 pages
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
No ratings yet
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
29 pages
Unit-I, Part-2 Feature Engineering
No ratings yet
Unit-I, Part-2 Feature Engineering
21 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
Feature Engineering
No ratings yet
Feature Engineering
11 pages
NN 7
No ratings yet
NN 7
26 pages
AI Feature Engineering in Detail
No ratings yet
AI Feature Engineering in Detail
12 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
7 pages
Rajat Agarwal-21bcon630
No ratings yet
Rajat Agarwal-21bcon630
13 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
Data
No ratings yet
Data
36 pages
Explore Feature Engineering
No ratings yet
Explore Feature Engineering
10 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Feature Engineering
No ratings yet
Feature Engineering
6 pages
Life Lesson
No ratings yet
Life Lesson
13 pages
Steps Assignment
No ratings yet
Steps Assignment
6 pages
NOTES
No ratings yet
NOTES
9 pages
DM - MOD - 1 Part III
No ratings yet
DM - MOD - 1 Part III
12 pages
Unit 2 Part 2
No ratings yet
Unit 2 Part 2
6 pages
DA Assignmnet 3 Based On Format Solu
No ratings yet
DA Assignmnet 3 Based On Format Solu
9 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Deep Learning Vocabulary
No ratings yet
Deep Learning Vocabulary
6 pages
Data Acquisition
No ratings yet
Data Acquisition
28 pages
Feature Pruning and Normalization
No ratings yet
Feature Pruning and Normalization
8 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
No ratings yet
2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
7 pages
What Is Feature Engineering
No ratings yet
What Is Feature Engineering
2 pages
Feature Engineering
No ratings yet
Feature Engineering
2 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
E-Note 14653 Content Document 20231228101402AM
No ratings yet
E-Note 14653 Content Document 20231228101402AM
10 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Dsur Ea2352001010391 W2
No ratings yet
Dsur Ea2352001010391 W2
2 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
NLP 2
No ratings yet
NLP 2
1 page
CELF 5 Metalinguistics
No ratings yet
CELF 5 Metalinguistics
20 pages
Introducing Communication Research Paths of Inquiry 3rd Edition Donald F. Treadwell - Download The Ebook Now and Read Anytime, Anywhere
No ratings yet
Introducing Communication Research Paths of Inquiry 3rd Edition Donald F. Treadwell - Download The Ebook Now and Read Anytime, Anywhere
69 pages
Binomial Distribution (Taburan Binomial) : Example 1
No ratings yet
Binomial Distribution (Taburan Binomial) : Example 1
14 pages
F Engineering
No ratings yet
F Engineering
5 pages
A Critical View On ISO Standard 13528: Wim Coucke EQALM Symposium, Dublin, 20th of October 2017
No ratings yet
A Critical View On ISO Standard 13528: Wim Coucke EQALM Symposium, Dublin, 20th of October 2017
22 pages
CTC Math Formula Book-WithCover-7!15!21
No ratings yet
CTC Math Formula Book-WithCover-7!15!21
85 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
AIRs-LM - Math 10 QUARTER 4-Weeks 6-7 - Module 5
100% (5)
AIRs-LM - Math 10 QUARTER 4-Weeks 6-7 - Module 5
20 pages
UE20CS312 Unit2 Slides
No ratings yet
UE20CS312 Unit2 Slides
206 pages
Lecture-3&4 - Measure of Centeral
No ratings yet
Lecture-3&4 - Measure of Centeral
67 pages
Statistics and Probability: Normal Distribution
100% (1)
Statistics and Probability: Normal Distribution
44 pages
Statistics and Probability: Lesson 4
No ratings yet
Statistics and Probability: Lesson 4
17 pages
Strategies in Operations Management
No ratings yet
Strategies in Operations Management
27 pages
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Symbol Digit Modalities Test: How To Calculate Standard Scores For The
No ratings yet
Symbol Digit Modalities Test: How To Calculate Standard Scores For The
20 pages
Zhou Et - Al 2022 - KAM - OpniDisclaimer
No ratings yet
Zhou Et - Al 2022 - KAM - OpniDisclaimer
35 pages
OREAS 23a
No ratings yet
OREAS 23a
16 pages
Unit-8 Block 4 Statistics in Psychology
No ratings yet
Unit-8 Block 4 Statistics in Psychology
32 pages
Statistics For Managers Using Microsoft® Excel 5th Edition: Numerical Descriptive Measures
No ratings yet
Statistics For Managers Using Microsoft® Excel 5th Edition: Numerical Descriptive Measures
64 pages
M2. Understanding A Data Set II
No ratings yet
M2. Understanding A Data Set II
33 pages
Answers To Questions
No ratings yet
Answers To Questions
9 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Jet - Airways Case Study Analysis - Jitendra Singh
No ratings yet
Jet - Airways Case Study Analysis - Jitendra Singh
4 pages
Standard Scores
No ratings yet
Standard Scores
2 pages
Advt. Vidyut Sahayak Junior Engineer For PWD
No ratings yet
Advt. Vidyut Sahayak Junior Engineer For PWD
13 pages
Day 5 - CW - Normal Table Practice Problems
No ratings yet
Day 5 - CW - Normal Table Practice Problems
3 pages
BCT582 Module 2 Research Methodology (Part 2)
No ratings yet
BCT582 Module 2 Research Methodology (Part 2)
19 pages
Color Emotions For Single Colors
No ratings yet
Color Emotions For Single Colors
10 pages
Z - Scores: Why Is This Impor Tant?
No ratings yet
Z - Scores: Why Is This Impor Tant?
10 pages
Biostatistics and Research Methodology
100% (1)
Biostatistics and Research Methodology
3 pages
10 11.1 Day 3 Assignment 11 PDF
No ratings yet
10 11.1 Day 3 Assignment 11 PDF
3 pages
Module 2 Chap 2
No ratings yet
Module 2 Chap 2
7 pages
Fes Manual
0% (1)
Fes Manual
17 pages

Feature Engineering and Normalization

Uploaded by

Feature Engineering and Normalization

Uploaded by

Feature Engineering

What is Feature Engineering?

Feature engineering is the pre-processing step of machine learning, which extracts

These processes are described as below:

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.

Need for Feature Engineering in Machine Learning

Steps in Feature Engineering

Feature Engineering Techniques

Some of the popular feature engineering techniques include:

5. One hot encoding

Normalization in Machine Learning

Although Normalization is no mandate for all datasets available in machine learning, it is

What is Normalization in Machine Learning?

Normalization is a scaling technique in Machine Learning applied during data preparation to

Mathematically, we can calculate normalization with the below formula:

Xn = (X - Xminimum) / ( Xmaximum - Xminimum)

Normalization techniques in Machine Learning

o Min-Max Scaling: This technique is also referred to as scaling. As we have already

Hence, standardization can be expressed as follows:

Difference between Normalization and Standardization

It got affected by outliers. It is comparatively less affected by outliers.

Scikit-Learn provides a transformer called Scikit-Learn provides a transformer called

It is also called Scaling normalization. It is known as Z-score normalization.

It is useful when feature distribution is It is useful when feature distribution is normal.

When to use Normalization or Standardization?

1. Normalization is a transformation technique that helps to improve the performance as well

You might also like