0% found this document useful (0 votes)
58 views6 pages

Feature Engineering

Uploaded by

navijegan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views6 pages

Feature Engineering

Uploaded by

navijegan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

What is Feature Engineering

Feature Engineering is the process of creating new features or transforming existing features
to improve the performance of a machine-learning model. It involves selecting relevant
information from raw data and transforming it into a format that can be easily understood by
a model. The goal is to improve model accuracy by providing more meaningful and relevant
information.

What is Feature Engineering?


Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.
The success of machine learning models heavily depends on the quality of the features used
to train them. Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones. These techniques help to highlight
the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively.

What is a Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an
individual measurable property or characteristic of a data point that is used as input for a
machine learning algorithm. Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the problem at hand.
 For example, in a dataset of housing prices, features could include the number of
bedrooms, the square footage, the location, and the age of the property. In a dataset of
customer demographics, features could include age, gender, income level, and
occupation.
 The choice and quality of features are critical in machine learning, as they can greatly
impact the accuracy and performance of the model.

Need for Feature Engineering in Machine Learning?


We engineer features for various reasons, and some of the main reasons include:
 Improve User Experience: The primary reason we engineer features is to enhance the user
experience of a product or service. By adding new features, we can make the product
more intuitive, efficient, and user-friendly, which can increase user satisfaction and
engagement.
 Competitive Advantage: Another reason we engineer features is to gain a competitive
advantage in the marketplace. By offering unique and innovative features, we can
differentiate our product from competitors and attract more customers.
 Meet Customer Needs: We engineer features to meet the evolving needs of customers. By
analyzing user feedback, market trends, and customer behavior, we can identify areas
where new features could enhance the product’s value and meet customer needs.
 Increase Revenue: Features can also be engineered to generate more revenue. For
example, a new feature that streamlines the checkout process can increase sales, or a
feature that provides additional functionality could lead to more upsells or cross-sells.
 Future-Proofing: Engineering features can also be done to future-proof a product or
service. By anticipating future trends and potential customer needs, we can develop
features that ensure the product remains relevant and useful in the long term.

Processes Involved in Feature Engineering


Feature engineering in Machine learning consists of mainly 5 processes:
 Feature Creation
 Feature Transformation
 Feature Extraction
 Feature Selection
 Feature Scaling.
It is an iterative process that requires experimentation and testing to find the best combination
of features for a given problem. The success of a machine learning model largely depends on
the quality of the features used in the model.

1. Feature Creation
Feature Creation is the process of generating new features based on domain knowledge or by
observing patterns in the data. It is a form of feature engineering that can significantly
improve the performance of a machine-learning model.

Types of Feature Creation:


1. Domain-Specific: Creating new features based on domain knowledge, such as creating
features based on business rules or industry standards.
2. Data-Driven: Creating new features by observing patterns in the data, such as calculating
aggregations or creating interaction features.
3. Synthetic: Generating new features by combining existing features or synthesizing new
data points.

Why Feature Creation?


1. Improves Model Performance: By providing additional and more relevant information to
the model, feature creation can increase the accuracy and precision of the model.
2. Increases Model Robustness: By adding additional features, the model can become more
robust to outliers and other anomalies.
3. Improves Model Interpretability: By creating new features, it can be easier to understand
the model’s predictions.
4. Increases Model Flexibility: By adding new features, the model can be made more
flexible to handle different types of data.

2. Feature Transformation
Feature Transformation is the process of transforming the features into a more suitable
representation for the machine learning model. This is done to ensure that the model can
effectively learn from the data.
Types of Feature Transformation:
1. Normalization: Rescaling the features to have a similar range, such as between 0 and 1, to
prevent some features from dominating others.
2. Scaling: Scaling is a technique used to transform numerical variables to have a similar
scale, so that they can be compared more easily. Rescaling the features to have a similar
scale, such as having a standard deviation of 1, to make sure the model considers all
features equally.
3. Encoding: Transforming categorical features into a numerical representation. Examples
are one-hot encoding and label encoding.
4. Transformation: Transforming the features using mathematical operations to change the
distribution or scale of the features. Examples are logarithmic, square root, and reciprocal
transformations.
Why Feature Transformation?
1. Improves Model Performance: By transforming the features into a more suitable
representation, the model can learn more meaningful patterns in the data.
2. Increases Model Robustness: Transforming the features can make the model more robust
to outliers and other anomalies.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By transforming the features, it can be easier to
understand the model’s predictions.

3. Feature Extraction
Feature Extraction is the process of creating new features from existing ones to provide more
relevant information to the machine learning model. This is done by transforming,
combining, or aggregating existing features.
Types of Feature Extraction:
1. Dimensionality Reduction: Reducing the number of features by transforming the data into
a lower-dimensional space while retaining important information. Examples
are PCA and t-SNE.
2. Feature Combination: Combining two or more existing features to create a new one. For
example, the interaction between two features.
3. Feature Aggregation: Aggregating features to create a new one. For example, calculating
the mean, sum, or count of a set of features.
4. Feature Transformation: Transforming existing features into a new representation. For
example, log transformation of a feature with a skewed distribution.
Why Feature Extraction?
1. Improves Model Performance: By creating new and more relevant features, the model can
learn more meaningful patterns in the data.
2. Reduces Overfitting: By reducing the dimensionality of the data, the model is less likely
to overfit the training data.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By creating new features, it can be easier to understand
the model’s predictions.

4. Feature Selection
Feature Selection is the process of selecting a subset of relevant features from the dataset to
be used in a machine-learning model. It is an important step in the feature engineering
process as it can have a significant impact on the model’s performance.
Types of Feature Selection:
1. Filter Method: Based on the statistical measure of the relationship between the feature
and the target variable. Features with a high correlation are selected.
2. Wrapper Method: Based on the evaluation of the feature subset using a specific machine
learning algorithm. The feature subset that results in the best performance is selected.
3. Embedded Method: Based on the feature selection as part of the training process of the
machine learning algorithm.
Why Feature Selection?
1. Reduces Overfitting: By using only the most relevant features, the model can generalize
better to new data.
2. Improves Model Performance: Selecting the right features can improve the accuracy,
precision, and recall of the model.
3. Decreases Computational Costs: A smaller number of features requires less computation
and storage resources.
4. Improves Interpretability: By reducing the number of features, it is easier to understand
and interpret the results of the model.

5. Feature Scaling
Feature Scaling is the process of transforming the features so that they have a similar scale.
This is important in machine learning because the scale of the features can affect the
performance of the model.
Types of Feature Scaling:
1. Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and 1, by
subtracting the minimum value and dividing by the range.
2. Standard Scaling: Rescaling the features to have a mean of 0 and a standard deviation of 1
by subtracting the mean and dividing by the standard deviation.
3. Robust Scaling: Rescaling the features to be robust to outliers by dividing them by the
interquartile range.
Why Feature Scaling?
1. Improves Model Performance: By transforming the features to have a similar scale, the
model can learn from all features equally and avoid being dominated by a few large
features.
2. Increases Model Robustness: By transforming the features to be robust to outliers, the
model can become more robust to anomalies.
3. Improves Computational Efficiency: Many machine learning algorithms, such as k-
nearest neighbors, are sensitive to the scale of the features and perform better with scaled
features.
4. Improves Model Interpretability: By transforming the features to have a similar scale, it
can be easier to understand the model’s predictions.

What are the Steps in Feature Engineering?


The steps for feature engineering vary per different Ml engineers and data scientists. Some of
the common steps that are involved in most machine-learning algorithms are:
1. Data Cleansing
 Data cleansing (also known as data cleaning or data scrubbing) involves identifying
and removing or correcting any errors or inconsistencies in the dataset. This step is
important to ensure that the data is accurate and reliable.
2. Data Transformation
3. Feature Extraction
4. Feature Selection
 Feature selection involves selecting the most relevant features from the dataset for use
in machine learning. This can include techniques like correlation analysis, mutual
information, and stepwise regression.
5. Feature Iteration
 eature iteration involves refining and improving the features based on the
performance of the machine learning model. This can include techniques like adding
new features, removing redundant features and transforming features in different
ways.
Overall, the goal of feature engineering is to create a set of informative and relevant features
that can be used to train a machine learning model and improve its accuracy and
performance. The specific steps involved in the process may vary depending on the type of
data and the specific machine-learning problem at hand.
Techniques Used in Feature Engineering
Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. There are various techniques that can be used in feature
engineering to create new features by combining or transforming the existing ones. The
following are some of the commonly used feature engineering techniques:
Imputation
Feature engineering deals with inappropriate data, missing values, human interruption,
general errors, insufficient data sources, etc. Missing values within the dataset highly affect
the performance of the algorithm, and to deal with them "Imputation" technique is
used. Imputation is responsible for handling irregularities within the dataset.
For example, removing the missing values from the complete row or complete column by a
huge percentage of missing values. But at the same time, to maintain the data size, it is
required to impute the missing data, which can be done as:
o For numerical data imputation, a default value can be imputed in a column, and
missing values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.

Handling Outliers
Outliers are the deviated values or data points that are observed too away from other data
points in such a way that they badly affect the performance of the model. Outliers can be
handled with this feature engineering technique. This technique first identifies the outliers
and then remove them out.
Standard deviation can be used to identify the outliers. For example, each value within a
space has a definite to an average distance, but if a value is greater distant than a certain
value, it can be considered as an outlier. Z-score can also be used to detect outliers.
Log transform
Logarithm transformation or log transform is one of the commonly used mathematical
techniques in machine learning. Log transform helps in handling the skewed data, and it
makes the distribution more approximate to normal after transformation. It also reduces the
effects of outliers on the data, as because of the normalization of magnitude differences, a
model becomes much robust.
Note: Log transformation is only applicable for the positive values; else, it will give an error.
To avoid this, we can add 1 to the data before transformation, which ensures transformation
to be positive.

One-Hot Encoding
One-hot encoding is a technique used to transform categorical variables into numerical values
that can be used by machine learning models. In this technique, each category is transformed
into a binary value indicating its presence or absence. For example, consider a categorical
variable “Colour” with three categories: Red, Green, and Blue. One-hot encoding would
transform this variable into three binary variables: Colour_Red, Colour_Green, and
Colour_Blue, where the value of each variable would be 1 if the corresponding category is
present and 0 otherwise.

Binning
Binning is a technique used to transform continuous variables into categorical variables. In
this technique, the range of values of the continuous variable is divided into several bins, and
each bin is assigned a categorical value. For example, consider a continuous variable “Age”
with values ranging from 18 to 80. Binning would divide this variable into several age groups
such as 18-25, 26-35, 36-50, and 51-80, and assign a categorical value to each age group.

Scaling
The most common scaling techniques are standardization and normalization. Standardization
scales the variable so that it has zero mean and unit variance. Normalization scales the
variable so that it has a range of values between 0 and 1.

Feature Split
Feature splitting is a powerful technique used in feature engineering to improve the
performance of machine learning models. It involves dividing single features into multiple
sub-features or groups based on specific criteria. This process unlocks valuable insights and
enhances the model’s ability to capture complex relationships and patterns within the data.

Text Data Preprocessing


Text data requires special preprocessing techniques before it can be used by machine learning
models. Text preprocessing involves removing stop words, stemming, lemmatization, and
vectorization. Stop words are common words that do not add much meaning to the text, such
as “the” and “and”. Stemming involves reducing words to their root form, such as converting
“running” to “run”. Lemmatization is similar to stemming, but it reduces words to their base
form, such as converting “running” to “run”. Vectorization involves transforming text data
into numerical vectors that can be used by machine learning models.

You might also like