Unit 6aics

Feature engineering is the process of transforming raw data into meaningful features that enhance machine learning model performance. It includes techniques such as feature selection, transformation, encoding, and creation, which improve model accuracy, reduce overfitting, and enhance interpretability. Proper handling of missing data and outliers, along with the creation of interaction features, further contributes to effective model training and predictive capabilities.

Uploaded by

anilsuma19061985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views25 pages

Unit 6aics

Uploaded by

anilsuma19061985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Unit-6

Feature Engineering
Prepared by
Prof. Kusuma
What is Feature Engineering?

 Feature engineering is the art of converting raw data into useful input
variables (features) that improve the performance of
machine learning models. It helps in choosing the most useful
features to enhance a model’s capacity to learn patterns & make
good predictions.
 Feature engineering encompasses methods like feature scaling,
encoding categorical variables, feature selection, and building
interaction terms.
 Why is Feature Engineering Important?
 Feature engineering is one of the most critical steps in
machine learning. Even the most advanced algorithms can fail if they
are trained on poorly designed features. Here’s why it matters:
 Improves Model Accuracy
 A well-engineered feature set allows a model to capture patterns more effectively,
leading to higher accuracy. For example, converting a date column into “day of the
week” or “holiday vs. non-holiday” can improve sales forecasting models.
 2. Reduces Overfitting and Underfitting
 By removing irrelevant or highly correlated features, feature engineering prevents the
model from memorizing noise (overfitting) and ensures it generalizes well on unseen
data.
 3. Enhances Model Interpretability
 Features that align with domain knowledge make the model’s decisions more
explainable. For instance, in fraud detection, a feature like “number of transactions per
hour” is more informative than raw timestamps.
 4. Boosts Training Efficiency
 Reducing the number of unnecessary features decreases computational complexity,
making training faster and more efficient.
 5. Handles Noisy and Missing Data
 Raw data is often incomplete or contains outliers. Feature engineering helps clean and
structure this data, ensuring better learning outcomes.
 Feature Selection
 Selecting the most relevant features while eliminating redundant,
irrelevant, or highly correlated variables helps improve model
efficiency and accuracy.
 Techniques:
 Filter Methods: Uses statistical techniques like correlation, variance
threshold, or mutual information to select important features.
 Wrapper Methods: Uses iterative techniques like Recursive Feature
Elimination (RFE) and stepwise selection.
 Embedded Methods: Feature selection is built into the algorithm,
such as Lasso Regression (L1 regularization) or decision tree-based
models.
 Example: Removing highly correlated features like “Total Sales” and
“Average Monthly Sales” if one can be derived from the other.
 Feature Transformation
 Transforms raw data to improve model learning by making it more
interpretable or reducing skewness.
 Techniques:
 Normalization (Min-Max Scaling): Rescales values between 0 and
1. Useful for distance-based models like k-NN.
 Standardization (Z-score Scaling): Transforms data to have a
mean of 0 and standard deviation of 1. Works well for gradient-based
models like logistic regression.
 Log Transformation: Converts skewed data into a normal
distribution.
 Power Transformation (Box-Cox, Yeo-Johnson): Used to stabilize
variance and make data more normal-like.
 Example: Scaling customer income before using it in a model to
prevent high-value dominance.
Feature Encoding
 Techniques:
 One-Hot Encoding (OHE): Creates binary columns for each
category (suitable for low-cardinality categorical variables).
 Label Encoding: Assigns numerical values to categories (useful for
ordinal categories like “low,” “medium,” “high”).
 Target Encoding: Replaces categories with the mean target value
(commonly used in regression models).
 Frequency Encoding: Converts categories into their occurrence
frequency in the dataset.
San
City New York Chicago
Francisco
NY 1 0 0
SF 0 1 0
 Feature Creation (Derived Features)
 Feature creation involves constructing new features from existing
ones to provide additional insights and improve model performance.
Well-crafted features can capture hidden relationships in data,
making patterns more evident to machine learning models.
 Techniques:
 Polynomial Features: Useful for models that need to capture non-
linear relationships between variables.
 Example: If a model struggles with a linear relationship, adding
polynomial terms like x², x³, or interaction terms (x1 * x2) can improve
performance.
 Use Case: Predicting house prices based on features like square footage
and number of rooms. Instead of just using square footage, a model could
benefit from an interaction term like square_footage * number_of_rooms.
 Binning (Discretization): Converts continuous variables into categorical
bins to simplify the relationship.
 Example: Instead of using raw age values (22, 34, 45), we can group them into
bins:
 Young (18-30)
 Middle-aged (31-50)
 Senior (51+)
 Use Case: Credit risk modeling, where different age groups have different risk
levels.

Ratio Features: Creating ratios between two related numerical values to
normalize the impact of scale.
 Example: Instead of using income and loan amount separately, use Income-to-
Loan Ratio = Income / Loan Amount to standardize comparisons across different
income levels.
 Use Case: Loan default prediction, where individuals with a higher debt-to-
income ratio are more likely to default.
 Time-based Features: Extracts meaningful insights from
timestamps, such as:
 Hour of the day (helps in traffic analysis)
 Day of the week (useful for sales forecasting)
 Season (important for retail and tourism industries)
 Use Case: Predicting e-commerce sales by analyzing trends based on
weekdays vs. weekends.
 Example

Timesta Day of Example:

Hour Month Season
mp Week
2024-02-
14 Thursday 2 Winter
15 14:30
2024-06-
8 Monday 6 Summer
10 08:15
Handling Missing Data

 Missing data is common in real-world datasets and can negatively

impact model performance if not handled properly. Instead of simply
dropping missing values, feature engineering techniques help retain
valuable information while minimizing bias.
 Techniques:
 Mean/Median/Mode Imputation:
 Fills missing values with the mean (for numerical data) or mode (for categorical data).
 Example: Filling missing salary values with the median salary of the dataset to prevent
skewing the distribution.
 Forward or Backward Fill (Time-Series Data):
 Forward fill: Uses the last known value to fill missing entries.
 Backward fill: Uses the next known value to fill missing entries.
 Use Case: Stock market data where missing prices can be filled with the previous day’s
prices.
 K-Nearest Neighbors (KNN) Imputation:
 Uses similar data points to estimate missing values.
 Example: If a person’s income is missing, KNN can predict it based on people with similar job
roles, education levels, and locations.
 Indicator Variable for Missingness:
 Creates a binary column (1 = Missing, 0 = Present) to retain missing data patterns.
 Use Case: Detecting fraudulent transactions where missing values themselves may indicate
suspicious activity.
Example:

Salary
Customer
Age Salary Missing
ID
Indicator
101 35 50,000 0
102 42 NaN 1
103 29 40,000 0
 Feature Extraction
 Feature extraction involves deriving new, meaningful representations from complex
data formats like text, images, and time-series. This is especially useful in high-
dimensional datasets.
 Techniques:
 Text Features: Converts textual data into numerical form for machine learning
models.
 Bag of Words (BoW): Represents text as word frequencies in a matrix.
 TF-IDF (Term Frequency-Inverse Document Frequency): Gives importance to words
based on their frequency in a document vs. overall dataset.
 Word Embeddings (Word2Vec, GloVe, BERT): Captures semantic meaning of words.
 Use Case: Sentiment analysis of customer reviews.
 Image Features: Extract essential patterns from images.
 Edge Detection: Identifies object boundaries in images (useful in medical imaging).
 Histogram of Oriented Gradients (HOG): Used in object detection.
 CNN-based Feature Extraction: Uses deep learning models like ResNet and VGG for
automatic feature learning.
 Use Case: Facial recognition, self-driving car object detection.
 Time-Series Features: Extract meaningful trends and seasonality
from time-series data.
 Rolling Averages: Smooth out short-term fluctuations.
 Seasonal Decomposition: Separates trend, seasonality, and residual
components.
 Autoregressive Features: Uses past values as inputs for predictive
models.
 Use Case: Forecasting electricity demand based on historical consumption
patterns.
 Dimensionality Reduction (PCA, t-SNE, UMAP):
 PCA (Principal Component Analysis) reduces high-dimensional data
while preserving variance.
 t-SNE and UMAP are useful for visualizing clusters in large datasets.
 Use Case: Reducing thousands of customer behavior variables into a few
principal components for clustering.
Example:
For text analysis, TF-IDF converts raw sentences into numerical form:

“AI is “AI is
Sentence transforming advancing
healthcare” research”
AI 0.4 0.3
transforming 0.6 0.0
research 0.0 0.7
Handling outliers
 Techniques:
 Winsorization: Replaces extreme values with a specified percentile
(e.g., capping values at the 5th and 95th percentile).
 Z-score Method: Removes values that are more than a certain
number of standard deviations from the mean (e.g., ±3σ).
 IQR (Interquartile Range) Method: Removes values beyond 1.5
times the interquartile range (Q1 and Q3).
 Transformations (Log, Square Root): Reduces the impact of
extreme values by adjusting scale.
:

Outlier (IQR
Employee Salary
Method)
A 50,000 No
B 52,000 No
C 200,000 Yes
Feature Interaction

 Feature interaction helps capture relationships between variables that

aren’t obvious in their raw form.
 Techniques:
 Multiplication or Division of Features:
 Example: Instead of using “Weight” and “Height” separately, create BMI =
Weight / Height².
 Polynomial Features:
 Example: Adding squared or cubic terms for better non-linear modelling.
 Clustering-Based Features:
 Assign cluster labels using k-means, which can be used as categorical
inputs.

Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
Feature Engineering
No ratings yet
Feature Engineering
2 pages
Life Lesson
No ratings yet
Life Lesson
13 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
No ratings yet
ML - Unit-2 FULL - Feature Engineering Theory-13!09!24-1
29 pages
Rajat Agarwal-21bcon630
No ratings yet
Rajat Agarwal-21bcon630
13 pages
DM - MOD - 1 Part III
No ratings yet
DM - MOD - 1 Part III
12 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Unit 4
No ratings yet
Unit 4
25 pages
Unit 2 Part 2
No ratings yet
Unit 2 Part 2
6 pages
Deep Learning Vocabulary
No ratings yet
Deep Learning Vocabulary
6 pages
Unit II
No ratings yet
Unit II
119 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
NOTES
No ratings yet
NOTES
9 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Feature Engineering
No ratings yet
Feature Engineering
11 pages
ML Week 8
No ratings yet
ML Week 8
12 pages
VIVA
No ratings yet
VIVA
5 pages
Machine - Learning Note Modul2
No ratings yet
Machine - Learning Note Modul2
20 pages
Module 4
No ratings yet
Module 4
44 pages
NN 7
No ratings yet
NN 7
26 pages
Feature Engineering
No ratings yet
Feature Engineering
6 pages
Unit 2
No ratings yet
Unit 2
91 pages
AI-Module 4 Updated
No ratings yet
AI-Module 4 Updated
42 pages
Session 7 Feature Selection & Dimensionality Reduction
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
20 pages
Features Selection and Featurs Generation
No ratings yet
Features Selection and Featurs Generation
5 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Dsur Ea2352001010391 W2
No ratings yet
Dsur Ea2352001010391 W2
2 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Final 1
No ratings yet
Final 1
6 pages
Unit-4 Part 3 Feature Engineering
No ratings yet
Unit-4 Part 3 Feature Engineering
29 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
AI6322 - Module 4 - Feature Engineering - MODULE
No ratings yet
AI6322 - Module 4 - Feature Engineering - MODULE
25 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Feature Engineering and Normalization
No ratings yet
Feature Engineering and Normalization
7 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
6 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
Machine Learning (Feature Engineering)
No ratings yet
Machine Learning (Feature Engineering)
10 pages
DA Assignmnet 3 Based On Format Solu
No ratings yet
DA Assignmnet 3 Based On Format Solu
9 pages
Data
No ratings yet
Data
36 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
AI Feature Engineering in Detail
No ratings yet
AI Feature Engineering in Detail
12 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
Pattern Recognition Unit 2
No ratings yet
Pattern Recognition Unit 2
24 pages
What Is Feature Engineering
No ratings yet
What Is Feature Engineering
2 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
AICS Assignments Unit-4
No ratings yet
AICS Assignments Unit-4
4 pages
SPM Assignments Unit-5
No ratings yet
SPM Assignments Unit-5
5 pages
SPM Assignments Unit-4
No ratings yet
SPM Assignments Unit-4
4 pages
BE AIML Attendance Sheet
No ratings yet
BE AIML Attendance Sheet
25 pages
EBCT 2024 STTP Registerations - Enrollment ID
No ratings yet
EBCT 2024 STTP Registerations - Enrollment ID
3 pages
JCB Report
No ratings yet
JCB Report
51 pages
Multiple Linear Regression Final
No ratings yet
Multiple Linear Regression Final
20 pages
Ali Research 3 (LSCM) - 1
No ratings yet
Ali Research 3 (LSCM) - 1
47 pages
Unit Iii Sas Procedures
No ratings yet
Unit Iii Sas Procedures
27 pages
Data Science Masters 2.0 - PW Skills
No ratings yet
Data Science Masters 2.0 - PW Skills
15 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
8 pages
Correlation 1
No ratings yet
Correlation 1
7 pages
Practical Research 2 - Set e
No ratings yet
Practical Research 2 - Set e
2 pages
Akshat Mehra Final Assignment BCom V Sem
No ratings yet
Akshat Mehra Final Assignment BCom V Sem
50 pages
14 - Case Study 1 - International Marketing Strategy in The Retail Banking Industry - The Case of ICICI Bank in Canada
No ratings yet
14 - Case Study 1 - International Marketing Strategy in The Retail Banking Industry - The Case of ICICI Bank in Canada
18 pages
Marketing Research: Lesson 3
No ratings yet
Marketing Research: Lesson 3
44 pages
CBC - Module of Instruction Basic 6
No ratings yet
CBC - Module of Instruction Basic 6
3 pages
Bismillah Salsa Sempro Fixxx
No ratings yet
Bismillah Salsa Sempro Fixxx
42 pages
A Critical Analysis of The Use of Financial Statements in Assessing The Performance of An Organization
No ratings yet
A Critical Analysis of The Use of Financial Statements in Assessing The Performance of An Organization
65 pages
ML CHeat Sheet
No ratings yet
ML CHeat Sheet
3 pages
Problem Set 1 Newhorh
No ratings yet
Problem Set 1 Newhorh
2 pages
Qualitative Data Analysis
100% (2)
Qualitative Data Analysis
62 pages
Introduction To Case Studies - The Yin Approach
No ratings yet
Introduction To Case Studies - The Yin Approach
24 pages
The Anatomy of Translation Problems: The Application of Minimal Deviation and The Proportionality Principle in The Translation of Economic Editorials
No ratings yet
The Anatomy of Translation Problems: The Application of Minimal Deviation and The Proportionality Principle in The Translation of Economic Editorials
99 pages
An Empirical Study of Marketing Environment, Strategy and Performance in The Property Market
No ratings yet
An Empirical Study of Marketing Environment, Strategy and Performance in The Property Market
488 pages
Chapter-1: The Study On Effectiveness of Dealer Promotional Strategy at Toms Pipes
No ratings yet
Chapter-1: The Study On Effectiveness of Dealer Promotional Strategy at Toms Pipes
12 pages
Measurement of Variability
No ratings yet
Measurement of Variability
11 pages
Correlation
No ratings yet
Correlation
9 pages
Chapter 6
No ratings yet
Chapter 6
3 pages
Jurnal Ecogen Rahmayuni Alfajri
No ratings yet
Jurnal Ecogen Rahmayuni Alfajri
9 pages
By: Derfff, Shaneee, Giannn
No ratings yet
By: Derfff, Shaneee, Giannn
15 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
7 pages
Marketing Research Theory and Concepts
No ratings yet
Marketing Research Theory and Concepts
5 pages
Difficulties in Solving Mathematical Problems
No ratings yet
Difficulties in Solving Mathematical Problems
21 pages

Unit 6aics

Uploaded by

Unit 6aics

Uploaded by

Unit-6

Timesta Day of Example:

 Missing data is common in real-world datasets and can negatively

 Feature interaction helps capture relationships between variables that

You might also like