0% found this document useful (0 votes)
13 views

Feature Engineering and Normalization

Uploaded by

Niharika Khanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Feature Engineering and Normalization

Uploaded by

Niharika Khanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Feature Engineering

What is a feature?

Generally, all machine learning algorithms take input data to generate the output. The input
data remains in a tabular form consisting of rows (instances or observations) and columns
(variable or attributes), and these attributes are often known as features. For example, an
image is an instance in computer vision, but a line in the image could be the feature.
Similarly, in NLP, a document can be an observation, and the word count could be the
feature. So, we can say a feature is an attribute that impacts a problem or is useful for
the problem.

What is Feature Engineering?

Feature engineering is the pre-processing step of machine learning, which extracts


features from raw data. It helps to represent an underlying problem to predictive models in
a better way, which as a result, improve the accuracy of the model for unseen data. The
predictive model contains predictor variables and an outcome variable, and while the feature
engineering process selects the most useful predictor variables for the model.

Since 2016, automated feature engineering is also used in different machine learning software
that helps in automatically extracting features from raw data. Feature engineering in ML
contains mainly four processes: Feature Creation, Transformations, Feature Extraction,
and Feature Selection.

ADVERTISEMENT

These processes are described as below:

1. Feature Creation: Feature creation is finding the most useful variables to be used in
a predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.
2. Transformations: The transformation step of feature engineering involves adjusting
the predictor variable to improve the accuracy and performance of the model. For
example, it ensures that the model is flexible to take input of the variety of data; it
ensures that all the variables are on the same scale, making the model easier to
understand. It improves the model's accuracy and ensures that all the features are
within the acceptable range to avoid any computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering process
that generates new variables by extracting them from the raw data. The main aim of
this step is to reduce the volume of data so that it can be easily used and managed for
data modelling. Feature extraction methods include cluster analysis, text analytics,
edge detection algorithms, and principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features are
either redundant or irrelevant. If we input the dataset with all these redundant and
irrelevant features, it may negatively impact and reduce the overall performance and
accuracy of the model. Hence it is very important to identify and select the most
appropriate features from the data and remove the irrelevant or less important
features, which is done with the help of feature selection in machine
learning. "Feature selection is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant, irrelevant, or
noisy features."

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that the researchers can easily interpret it.
o It reduces the training time.
o It reduces overfitting hence enhancing the generalization.

Need for Feature Engineering in Machine Learning

In machine learning, the performance of the model depends on data pre-processing and data
handling. But if we create a model without pre-processing or data handling, then it may not
give good accuracy. Whereas, if we apply feature engineering on the same model, then the
accuracy of the model is enhanced. Hence, feature engineering in machine learning improves
the model's performance. Below are some points that explain the need for feature
engineering:
o Better features mean flexibility.
In machine learning, we always try to choose the optimal model to get good results.
However, sometimes after choosing the wrong model, still, we can get better
predictions, and this is because of better features. The flexibility in features will
enable you to select the less complex models. Because less complex models are faster
to run, easier to understand and maintain, which is always desirable.
o Better features mean simpler models.
If we input the well-engineered features to our model, then even after selecting the
wrong parameters (Not much optimal), we can have good outcomes. After feature
engineering, it is not necessary to do hard for picking the right model with the most
optimized parameters. If we have good features, we can better represent the complete
data and use it to best characterize the given problem.
o Better features mean better results.
As already discussed, in machine learning, as data we will provide will get the same
output. So, to obtain better results, we must need to use better features.

Steps in Feature Engineering

The steps of feature engineering may vary as per different data scientists and ML engineers.
However, there are some common steps that are involved in most machine learning
algorithms, and these steps are as follows:

o Data Preparation: The first step is data preparation. In this step, raw data acquired
from different resources are prepared to make it in a suitable format so that it can be
used in the ML model. The data preparation may contain cleaning of data, delivery,
data augmentation, fusion, ingestion, or loading.
o Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA) is an
important step of features engineering, which is mainly used by data scientists. This
step involves analysis, investing data set, and summarization of the main
characteristics of data. Different data visualization techniques are used to better
understand the manipulation of data sources, to find the most appropriate statistical
technique for data analysis, and to select the best features for the data.
o Benchmark: Benchmarking is a process of setting a standard baseline for accuracy to
compare all the variables from this baseline. The benchmarking process is used to
improve the predictability of the model and reduce the error rate.

Feature Engineering Techniques

Some of the popular feature engineering techniques include:


1. Imputation

Feature engineering deals with inappropriate data, missing values, human interruption,
general errors, insufficient data sources, etc. Missing values within the dataset highly affect
the performance of the algorithm, and to deal with them "Imputation" technique is
used. Imputation is responsible for handling irregularities within the dataset.

For example, removing the missing values from the complete row or complete column by a
huge percentage of missing values. But at the same time, to maintain the data size, it is
required to impute the missing data, which can be done as:

o For numerical data imputation, a default value can be imputed in a column, and
missing values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.

2. Handling Outliers

Outliers are the deviated values or data points that are observed too away from other data
points in such a way that they badly affect the performance of the model. Outliers can be
handled with this feature engineering technique. This technique first identifies the outliers
and then remove them out.

Standard deviation can be used to identify the outliers. For example, each value within a
space has a definite to an average distance, but if a value is greater distant than a certain
value, it can be considered as an outlier. Z-score can also be used to detect outliers.

3. Binning

In machine learning, overfitting is one of the main issues that degrade the performance of the
model and which occurs due to a greater number of parameters and noisy data. However, one
of the popular techniques of feature engineering, "binning", can be used to normalize the
noisy data. This process involves segmenting different features into bins.

4. Feature Split

As the name suggests, feature split is the process of splitting features intimately into two or
more parts and performing to make new features. This technique helps the algorithms to
better understand and learn the patterns in the dataset.

The feature splitting process enables the new features to be clustered and binned, which
results in extracting useful information and improving the performance of the data models.

5. One hot encoding

One hot encoding is the popular encoding technique in machine learning. It is a technique
that converts the categorical data in a form so that they can be easily understood by machine
learning algorithms and hence can make a good prediction. It enables group the of categorical
data without losing any information.

Normalization in Machine Learning


Normalization is one of the most frequently used data preparation techniques, which helps
us to change the values of numeric columns in the dataset to use a common scale.

Although Normalization is no mandate for all datasets available in machine learning, it is


used whenever the attributes of the dataset have different ranges. It helps to enhance the
performance and reliability of a machine learning model.

What is Normalization in Machine Learning?

Normalization is a scaling technique in Machine Learning applied during data preparation to


change the values of numeric columns in the dataset to use a common scale. It is not
necessary for all datasets in a model. It is required only when features of machine learning
models have different ranges.

Mathematically, we can calculate normalization with the below formula:

Xn = (X - Xminimum) / ( Xmaximum - Xminimum)


o Xn = Value of Normalization
o Xmaximum = Maximum value of a feature
o Xminimum = Minimum value of a feature

Example: Let's assume we have a model dataset having maximum and minimum values of
feature as mentioned above. To normalize the machine learning model, values are shifted and
rescaled so their range can vary between 0 and 1.

Normalization techniques in Machine Learning

Although there are so many feature normalization techniques in Machine Learning, few of
them are most frequently used. These are as follows:

o Min-Max Scaling: This technique is also referred to as scaling. As we have already


discussed above, the Min-Max scaling method helps the dataset to shift and rescale
the values of their attributes, so they end up ranging between 0 and 1

Standardization scaling:
Standardization scaling is also known as Z-score normalization, in which values are centered
around the mean with a unit standard deviation, which means the attribute becomes zero and
the resultant distribution has a unit standard deviation. Mathematically, we can calculate the
standardization by subtracting the feature value from the mean and dividing it by standard
deviation.

Hence, standardization can be expressed as follows:

Here, µ represents the mean of feature value, and σ represents the standard deviation of
feature values.

However, unlike Min-Max scaling technique, feature values are not restricted to a specific
range in the standardization technique.

This technique is helpful for various machine learning algorithms that use distance measures
such as KNN, K-means clustering, and Principal component analysis, etc. Further, it is
also important that the model is built on assumptions and data is normally distributed.

Difference between Normalization and Standardization

Normalization Standardization

This technique uses minimum and max values This technique uses mean and standard deviation for
for scaling of model. scaling of model.

It is helpful when features are of different scales. It is helpful when the mean of a variable is set to 0 and
the standard deviation is set to 1.

Scales values ranges between [0, 1] or [-1, 1]. Scale values are not restricted to a specific range.

It got affected by outliers. It is comparatively less affected by outliers.

Scikit-Learn provides a transformer called Scikit-Learn provides a transformer called


MinMaxScaler for Normalization. StandardScaler for Normalization.

It is also called Scaling normalization. It is known as Z-score normalization.

It is useful when feature distribution is It is useful when feature distribution is normal.


unknown.

When to use Normalization or Standardization?

Which is suitable for our machine learning model, Normalization or Standardization? This is
probably a big confusion among all data scientists as well as machine learning engineers.
Although both terms have the almost same meaning choice of using normalization or
standardization will depend on your problem and the algorithm you are using in models.

1. Normalization is a transformation technique that helps to improve the performance as well


as the accuracy of your model better. Normalization of a machine learning model is useful
when you don't know feature distribution exactly. In other words, the feature distribution of
data does not follow a Gaussian (bell curve) distribution. Normalization must have an
abounding range, so if you have outliers in data, they will be affected by Normalization.

Further, it is also useful for data having variable scaling techniques such as KNN, artificial
neural networks. Hence, you can't use assumptions for the distribution of data.

2. Standardization in the machine learning model is useful when you are exactly aware of the
feature distribution of data or, in other words, your data follows a Gaussian distribution.
However, this does not have to be necessarily true. Unlike Normalization, Standardization
does not necessarily have a bounding range, so if you have outliers in your data, they will not
be affected by Standardization.

Further, it is also useful when data has variable dimensions and techniques such as linear
regression, logistic regression, and linear discriminant analysis.

You might also like