Feature Engineering

Uploaded by

navijegan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views6 pages

Feature Engineering

Uploaded by

navijegan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

What is Feature Engineering

Feature Engineering is the process of creating new features or transforming existing features
to improve the performance of a machine-learning model. It involves selecting relevant
information from raw data and transforming it into a format that can be easily understood by
a model. The goal is to improve model accuracy by providing more meaningful and relevant
information.

What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.
The success of machine learning models heavily depends on the quality of the features used
to train them. Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones. These techniques help to highlight
the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively.

What is a Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an
individual measurable property or characteristic of a data point that is used as input for a
machine learning algorithm. Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the problem at hand.
 For example, in a dataset of housing prices, features could include the number of
bedrooms, the square footage, the location, and the age of the property. In a dataset of
customer demographics, features could include age, gender, income level, and
occupation.
 The choice and quality of features are critical in machine learning, as they can greatly
impact the accuracy and performance of the model.

Need for Feature Engineering in Machine Learning?

We engineer features for various reasons, and some of the main reasons include:
 Improve User Experience: The primary reason we engineer features is to enhance the user
experience of a product or service. By adding new features, we can make the product
more intuitive, efficient, and user-friendly, which can increase user satisfaction and
engagement.
 Competitive Advantage: Another reason we engineer features is to gain a competitive
advantage in the marketplace. By offering unique and innovative features, we can
differentiate our product from competitors and attract more customers.
 Meet Customer Needs: We engineer features to meet the evolving needs of customers. By
analyzing user feedback, market trends, and customer behavior, we can identify areas
where new features could enhance the product’s value and meet customer needs.
 Increase Revenue: Features can also be engineered to generate more revenue. For
example, a new feature that streamlines the checkout process can increase sales, or a
feature that provides additional functionality could lead to more upsells or cross-sells.
 Future-Proofing: Engineering features can also be done to future-proof a product or
service. By anticipating future trends and potential customer needs, we can develop
features that ensure the product remains relevant and useful in the long term.

Processes Involved in Feature Engineering

Feature engineering in Machine learning consists of mainly 5 processes:
 Feature Creation
 Feature Transformation
 Feature Extraction
 Feature Selection
 Feature Scaling.
It is an iterative process that requires experimentation and testing to find the best combination
of features for a given problem. The success of a machine learning model largely depends on
the quality of the features used in the model.

1. Feature Creation
Feature Creation is the process of generating new features based on domain knowledge or by
observing patterns in the data. It is a form of feature engineering that can significantly
improve the performance of a machine-learning model.

Types of Feature Creation:

1. Domain-Specific: Creating new features based on domain knowledge, such as creating
features based on business rules or industry standards.
2. Data-Driven: Creating new features by observing patterns in the data, such as calculating
aggregations or creating interaction features.
3. Synthetic: Generating new features by combining existing features or synthesizing new
data points.

Why Feature Creation?

1. Improves Model Performance: By providing additional and more relevant information to
the model, feature creation can increase the accuracy and precision of the model.
2. Increases Model Robustness: By adding additional features, the model can become more
robust to outliers and other anomalies.
3. Improves Model Interpretability: By creating new features, it can be easier to understand
the model’s predictions.
4. Increases Model Flexibility: By adding new features, the model can be made more
flexible to handle different types of data.

2. Feature Transformation
Feature Transformation is the process of transforming the features into a more suitable
representation for the machine learning model. This is done to ensure that the model can
effectively learn from the data.
Types of Feature Transformation:
1. Normalization: Rescaling the features to have a similar range, such as between 0 and 1, to
prevent some features from dominating others.
2. Scaling: Scaling is a technique used to transform numerical variables to have a similar
scale, so that they can be compared more easily. Rescaling the features to have a similar
scale, such as having a standard deviation of 1, to make sure the model considers all
features equally.
3. Encoding: Transforming categorical features into a numerical representation. Examples
are one-hot encoding and label encoding.
4. Transformation: Transforming the features using mathematical operations to change the
distribution or scale of the features. Examples are logarithmic, square root, and reciprocal
transformations.
Why Feature Transformation?
1. Improves Model Performance: By transforming the features into a more suitable
representation, the model can learn more meaningful patterns in the data.
2. Increases Model Robustness: Transforming the features can make the model more robust
to outliers and other anomalies.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By transforming the features, it can be easier to
understand the model’s predictions.

3. Feature Extraction
Feature Extraction is the process of creating new features from existing ones to provide more
relevant information to the machine learning model. This is done by transforming,
combining, or aggregating existing features.
Types of Feature Extraction:
1. Dimensionality Reduction: Reducing the number of features by transforming the data into
a lower-dimensional space while retaining important information. Examples
are PCA and t-SNE.
2. Feature Combination: Combining two or more existing features to create a new one. For
example, the interaction between two features.
3. Feature Aggregation: Aggregating features to create a new one. For example, calculating
the mean, sum, or count of a set of features.
4. Feature Transformation: Transforming existing features into a new representation. For
example, log transformation of a feature with a skewed distribution.
Why Feature Extraction?
1. Improves Model Performance: By creating new and more relevant features, the model can
learn more meaningful patterns in the data.
2. Reduces Overfitting: By reducing the dimensionality of the data, the model is less likely
to overfit the training data.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By creating new features, it can be easier to understand
the model’s predictions.

4. Feature Selection
Feature Selection is the process of selecting a subset of relevant features from the dataset to
be used in a machine-learning model. It is an important step in the feature engineering
process as it can have a significant impact on the model’s performance.
Types of Feature Selection:
1. Filter Method: Based on the statistical measure of the relationship between the feature
and the target variable. Features with a high correlation are selected.
2. Wrapper Method: Based on the evaluation of the feature subset using a specific machine
learning algorithm. The feature subset that results in the best performance is selected.
3. Embedded Method: Based on the feature selection as part of the training process of the
machine learning algorithm.
Why Feature Selection?
1. Reduces Overfitting: By using only the most relevant features, the model can generalize
better to new data.
2. Improves Model Performance: Selecting the right features can improve the accuracy,
precision, and recall of the model.
3. Decreases Computational Costs: A smaller number of features requires less computation
and storage resources.
4. Improves Interpretability: By reducing the number of features, it is easier to understand
and interpret the results of the model.

5. Feature Scaling
Feature Scaling is the process of transforming the features so that they have a similar scale.
This is important in machine learning because the scale of the features can affect the
performance of the model.
Types of Feature Scaling:
1. Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and 1, by
subtracting the minimum value and dividing by the range.
2. Standard Scaling: Rescaling the features to have a mean of 0 and a standard deviation of 1
by subtracting the mean and dividing by the standard deviation.
3. Robust Scaling: Rescaling the features to be robust to outliers by dividing them by the
interquartile range.
Why Feature Scaling?
1. Improves Model Performance: By transforming the features to have a similar scale, the
model can learn from all features equally and avoid being dominated by a few large
features.
2. Increases Model Robustness: By transforming the features to be robust to outliers, the
model can become more robust to anomalies.
3. Improves Computational Efficiency: Many machine learning algorithms, such as k-
nearest neighbors, are sensitive to the scale of the features and perform better with scaled
features.
4. Improves Model Interpretability: By transforming the features to have a similar scale, it
can be easier to understand the model’s predictions.

What are the Steps in Feature Engineering?

The steps for feature engineering vary per different Ml engineers and data scientists. Some of
the common steps that are involved in most machine-learning algorithms are:
1. Data Cleansing
 Data cleansing (also known as data cleaning or data scrubbing) involves identifying
and removing or correcting any errors or inconsistencies in the dataset. This step is
important to ensure that the data is accurate and reliable.
2. Data Transformation
3. Feature Extraction
4. Feature Selection
 Feature selection involves selecting the most relevant features from the dataset for use
in machine learning. This can include techniques like correlation analysis, mutual
information, and stepwise regression.
5. Feature Iteration
 eature iteration involves refining and improving the features based on the
performance of the machine learning model. This can include techniques like adding
new features, removing redundant features and transforming features in different
ways.
Overall, the goal of feature engineering is to create a set of informative and relevant features
that can be used to train a machine learning model and improve its accuracy and
performance. The specific steps involved in the process may vary depending on the type of
data and the specific machine-learning problem at hand.
Techniques Used in Feature Engineering
Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. There are various techniques that can be used in feature
engineering to create new features by combining or transforming the existing ones. The
following are some of the commonly used feature engineering techniques:
Imputation
Feature engineering deals with inappropriate data, missing values, human interruption,
general errors, insufficient data sources, etc. Missing values within the dataset highly affect
the performance of the algorithm, and to deal with them "Imputation" technique is
used. Imputation is responsible for handling irregularities within the dataset.
For example, removing the missing values from the complete row or complete column by a
huge percentage of missing values. But at the same time, to maintain the data size, it is
required to impute the missing data, which can be done as:
o For numerical data imputation, a default value can be imputed in a column, and
missing values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.

Handling Outliers
Outliers are the deviated values or data points that are observed too away from other data
points in such a way that they badly affect the performance of the model. Outliers can be
handled with this feature engineering technique. This technique first identifies the outliers
and then remove them out.
Standard deviation can be used to identify the outliers. For example, each value within a
space has a definite to an average distance, but if a value is greater distant than a certain
value, it can be considered as an outlier. Z-score can also be used to detect outliers.
Log transform
Logarithm transformation or log transform is one of the commonly used mathematical
techniques in machine learning. Log transform helps in handling the skewed data, and it
makes the distribution more approximate to normal after transformation. It also reduces the
effects of outliers on the data, as because of the normalization of magnitude differences, a
model becomes much robust.
Note: Log transformation is only applicable for the positive values; else, it will give an error.
To avoid this, we can add 1 to the data before transformation, which ensures transformation
to be positive.

One-Hot Encoding
One-hot encoding is a technique used to transform categorical variables into numerical values
that can be used by machine learning models. In this technique, each category is transformed
into a binary value indicating its presence or absence. For example, consider a categorical
variable “Colour” with three categories: Red, Green, and Blue. One-hot encoding would
transform this variable into three binary variables: Colour_Red, Colour_Green, and
Colour_Blue, where the value of each variable would be 1 if the corresponding category is
present and 0 otherwise.

Binning
Binning is a technique used to transform continuous variables into categorical variables. In
this technique, the range of values of the continuous variable is divided into several bins, and
each bin is assigned a categorical value. For example, consider a continuous variable “Age”
with values ranging from 18 to 80. Binning would divide this variable into several age groups
such as 18-25, 26-35, 36-50, and 51-80, and assign a categorical value to each age group.

Scaling
The most common scaling techniques are standardization and normalization. Standardization
scales the variable so that it has zero mean and unit variance. Normalization scales the
variable so that it has a range of values between 0 and 1.

Feature Split
Feature splitting is a powerful technique used in feature engineering to improve the
performance of machine learning models. It involves dividing single features into multiple
sub-features or groups based on specific criteria. This process unlocks valuable insights and
enhances the model’s ability to capture complex relationships and patterns within the data.

Text Data Preprocessing

Text data requires special preprocessing techniques before it can be used by machine learning
models. Text preprocessing involves removing stop words, stemming, lemmatization, and
vectorization. Stop words are common words that do not add much meaning to the text, such
as “the” and “and”. Stemming involves reducing words to their root form, such as converting
“running” to “run”. Lemmatization is similar to stemming, but it reduces words to their base
form, such as converting “running” to “run”. Vectorization involves transforming text data
into numerical vectors that can be used by machine learning models.

Unit - 3 Feature Engineering
No ratings yet
Unit - 3 Feature Engineering
29 pages
Nan & Ho 2017 - Effects of Clay Art Therapy On Adults Outpatients With MDD
100% (2)
Nan & Ho 2017 - Effects of Clay Art Therapy On Adults Outpatients With MDD
10 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Machine - Learning Note Modul2
No ratings yet
Machine - Learning Note Modul2
20 pages
Feature Engineering
No ratings yet
Feature Engineering
11 pages
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
No ratings yet
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
29 pages
Unit 4
No ratings yet
Unit 4
25 pages
NOTES
No ratings yet
NOTES
9 pages
What Is Feature Engineering
No ratings yet
What Is Feature Engineering
2 pages
Unit 2 Part 2
No ratings yet
Unit 2 Part 2
6 pages
Feature Engineering
No ratings yet
Feature Engineering
2 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Feature Engineering and Normalization
No ratings yet
Feature Engineering and Normalization
7 pages
Rajat Agarwal-21bcon630
No ratings yet
Rajat Agarwal-21bcon630
13 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
Dsur Ea2352001010391 W2
No ratings yet
Dsur Ea2352001010391 W2
2 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
AI6322 - Module 4 - Feature Engineering - MODULE
No ratings yet
AI6322 - Module 4 - Feature Engineering - MODULE
25 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Deep Learning Vocabulary
No ratings yet
Deep Learning Vocabulary
6 pages
CSC407 - Chapter 4
No ratings yet
CSC407 - Chapter 4
28 pages
Life Lesson
No ratings yet
Life Lesson
13 pages
Steps Assignment
No ratings yet
Steps Assignment
6 pages
Model Selection and Feature Engineering
No ratings yet
Model Selection and Feature Engineering
64 pages
NN 7
No ratings yet
NN 7
26 pages
Machine Learning Unit-2
No ratings yet
Machine Learning Unit-2
12 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
DM - MOD - 1 Part III
No ratings yet
DM - MOD - 1 Part III
12 pages
Unit II
No ratings yet
Unit II
119 pages
06 Feature Engineering
No ratings yet
06 Feature Engineering
24 pages
Unit 2
No ratings yet
Unit 2
91 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
7 pages
Feature Engineering
No ratings yet
Feature Engineering
21 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Feature Engineering and Dimensionality Reduction
No ratings yet
Feature Engineering and Dimensionality Reduction
146 pages
Feature Engineering Presentation
No ratings yet
Feature Engineering Presentation
40 pages
AI Feature Engineering in Detail
No ratings yet
AI Feature Engineering in Detail
12 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
NLP 2
No ratings yet
NLP 2
1 page
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
DA Assignmnet 3 Based On Format Solu
No ratings yet
DA Assignmnet 3 Based On Format Solu
9 pages
Explore Feature Engineering
No ratings yet
Explore Feature Engineering
10 pages
Expanded Feature Engineering
No ratings yet
Expanded Feature Engineering
7 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
UNIT04
No ratings yet
UNIT04
35 pages
1 What Is Feature Engineering - Kaggle
No ratings yet
1 What Is Feature Engineering - Kaggle
6 pages
Session 7 Feature Selection & Dimensionality Reduction
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
20 pages
Features
No ratings yet
Features
5 pages
Feature Engineering
No ratings yet
Feature Engineering
2 pages
Basics of Feature Engineering Marked
No ratings yet
Basics of Feature Engineering Marked
33 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Mystery Recursive Program
No ratings yet
Mystery Recursive Program
3 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
5 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
3 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Garbage Program
No ratings yet
Garbage Program
6 pages
Mbec S 25 00042
No ratings yet
Mbec S 25 00042
39 pages
Ijfcs S 25 00017
No ratings yet
Ijfcs S 25 00017
37 pages
Confusion Matrix in Machine Learning
No ratings yet
Confusion Matrix in Machine Learning
2 pages
Quantitative Research Discussion
No ratings yet
Quantitative Research Discussion
3 pages
Feduc 08 1179264
No ratings yet
Feduc 08 1179264
12 pages
Vikas - 2018 - A Study On The Street Food Dimensions and Its Effects On Consumer Attitude and Behavioural Intentions
No ratings yet
Vikas - 2018 - A Study On The Street Food Dimensions and Its Effects On Consumer Attitude and Behavioural Intentions
16 pages
Multivariate Missing Data in Hydrology - Review and Applications
No ratings yet
Multivariate Missing Data in Hydrology - Review and Applications
11 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Instant Download Primer of Applied Regression & Analysis of Variance 3rd Edition Edition Stanton A. Glantz PDF All Chapter
100% (4)
Instant Download Primer of Applied Regression & Analysis of Variance 3rd Edition Edition Stanton A. Glantz PDF All Chapter
66 pages
Bayesian Model Averaging: A Tutorial: Jennifer A. Hoeting, David Madigan, Adrian E. Raftery and Chris T. Volinsky
No ratings yet
Bayesian Model Averaging: A Tutorial: Jennifer A. Hoeting, David Madigan, Adrian E. Raftery and Chris T. Volinsky
36 pages
BookSlides 3A Data Exploration
No ratings yet
BookSlides 3A Data Exploration
43 pages
Titanic DS Callenge
No ratings yet
Titanic DS Callenge
24 pages
Missing Values Analysis & Data Imputation: Single User License. Do Not Copy or Post
No ratings yet
Missing Values Analysis & Data Imputation: Single User License. Do Not Copy or Post
26 pages
Handling Missing Data Analysis of A Challenging Data Set Using Multiple Imputation
No ratings yet
Handling Missing Data Analysis of A Challenging Data Set Using Multiple Imputation
20 pages
Data-Science-Report - Priyesh
No ratings yet
Data-Science-Report - Priyesh
32 pages
Time Series Data and Their Characteristics
No ratings yet
Time Series Data and Their Characteristics
14 pages
Module in Practical Research 2: Your Lesson For Today!
No ratings yet
Module in Practical Research 2: Your Lesson For Today!
20 pages
Latihan 1 Ibu Yunita
No ratings yet
Latihan 1 Ibu Yunita
17 pages
Missing Data Imputation by K Nearest Neighbours Based On Grey Relational Structure and Mutual Information
No ratings yet
Missing Data Imputation by K Nearest Neighbours Based On Grey Relational Structure and Mutual Information
22 pages
Introduction To Multiple Imputation: Francis Bursa
No ratings yet
Introduction To Multiple Imputation: Francis Bursa
16 pages
CARGOLIO - Problem Set #3 Estimating Precipitation Data
No ratings yet
CARGOLIO - Problem Set #3 Estimating Precipitation Data
3 pages
Lacourse 2003 DP
No ratings yet
Lacourse 2003 DP
15 pages
DMPA-2 Powerpoint Slides - Modified Audio
No ratings yet
DMPA-2 Powerpoint Slides - Modified Audio
38 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
To Find Columns With Missing Values in Excel
No ratings yet
To Find Columns With Missing Values in Excel
7 pages
Data Cleaning in Excel
No ratings yet
Data Cleaning in Excel
4 pages
Detection of Fake Accounts in Instagram Using Machine Learning
No ratings yet
Detection of Fake Accounts in Instagram Using Machine Learning
8 pages
R Programming For Clinical Trial Data Analysis
No ratings yet
R Programming For Clinical Trial Data Analysis
14 pages
Lead Score Case Study
No ratings yet
Lead Score Case Study
13 pages
Data Wrangling
No ratings yet
Data Wrangling
9 pages
Lecture Notes On STATISTICAL ANALYSIS OF WATER QUALITY DATA
No ratings yet
Lecture Notes On STATISTICAL ANALYSIS OF WATER QUALITY DATA
14 pages
Output Praktikum 3 Winda Khadijah A.T-dikonversi
No ratings yet
Output Praktikum 3 Winda Khadijah A.T-dikonversi
5 pages