0% found this document useful (0 votes)
16 views25 pages

Unit 6aics

Feature engineering is the process of transforming raw data into meaningful features that enhance machine learning model performance. It includes techniques such as feature selection, transformation, encoding, and creation, which improve model accuracy, reduce overfitting, and enhance interpretability. Proper handling of missing data and outliers, along with the creation of interaction features, further contributes to effective model training and predictive capabilities.

Uploaded by

anilsuma19061985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views25 pages

Unit 6aics

Feature engineering is the process of transforming raw data into meaningful features that enhance machine learning model performance. It includes techniques such as feature selection, transformation, encoding, and creation, which improve model accuracy, reduce overfitting, and enhance interpretability. Proper handling of missing data and outliers, along with the creation of interaction features, further contributes to effective model training and predictive capabilities.

Uploaded by

anilsuma19061985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Unit-6

Feature Engineering
Prepared by
Prof. Kusuma
What is Feature Engineering?

 Feature engineering is the art of converting raw data into useful input
variables (features) that improve the performance of
machine learning models. It helps in choosing the most useful
features to enhance a model’s capacity to learn patterns & make
good predictions.
 Feature engineering encompasses methods like feature scaling,
encoding categorical variables, feature selection, and building
interaction terms.
 Why is Feature Engineering Important?
 Feature engineering is one of the most critical steps in
machine learning. Even the most advanced algorithms can fail if they
are trained on poorly designed features. Here’s why it matters:
 Improves Model Accuracy
 A well-engineered feature set allows a model to capture patterns more effectively,
leading to higher accuracy. For example, converting a date column into “day of the
week” or “holiday vs. non-holiday” can improve sales forecasting models.
 2. Reduces Overfitting and Underfitting
 By removing irrelevant or highly correlated features, feature engineering prevents the
model from memorizing noise (overfitting) and ensures it generalizes well on unseen
data.
 3. Enhances Model Interpretability
 Features that align with domain knowledge make the model’s decisions more
explainable. For instance, in fraud detection, a feature like “number of transactions per
hour” is more informative than raw timestamps.
 4. Boosts Training Efficiency
 Reducing the number of unnecessary features decreases computational complexity,
making training faster and more efficient.
 5. Handles Noisy and Missing Data
 Raw data is often incomplete or contains outliers. Feature engineering helps clean and
structure this data, ensuring better learning outcomes.
 Feature Selection
 Selecting the most relevant features while eliminating redundant,
irrelevant, or highly correlated variables helps improve model
efficiency and accuracy.
 Techniques:
 Filter Methods: Uses statistical techniques like correlation, variance
threshold, or mutual information to select important features.
 Wrapper Methods: Uses iterative techniques like Recursive Feature
Elimination (RFE) and stepwise selection.
 Embedded Methods: Feature selection is built into the algorithm,
such as Lasso Regression (L1 regularization) or decision tree-based
models.
 Example: Removing highly correlated features like “Total Sales” and
“Average Monthly Sales” if one can be derived from the other.
 Feature Transformation
 Transforms raw data to improve model learning by making it more
interpretable or reducing skewness.
 Techniques:
 Normalization (Min-Max Scaling): Rescales values between 0 and
1. Useful for distance-based models like k-NN.
 Standardization (Z-score Scaling): Transforms data to have a
mean of 0 and standard deviation of 1. Works well for gradient-based
models like logistic regression.
 Log Transformation: Converts skewed data into a normal
distribution.
 Power Transformation (Box-Cox, Yeo-Johnson): Used to stabilize
variance and make data more normal-like.
 Example: Scaling customer income before using it in a model to
prevent high-value dominance.
Feature Encoding
 Techniques:
 One-Hot Encoding (OHE): Creates binary columns for each
category (suitable for low-cardinality categorical variables).
 Label Encoding: Assigns numerical values to categories (useful for
ordinal categories like “low,” “medium,” “high”).
 Target Encoding: Replaces categories with the mean target value
(commonly used in regression models).
 Frequency Encoding: Converts categories into their occurrence
frequency in the dataset.
San
City New York Chicago
Francisco
NY 1 0 0
SF 0 1 0
 Feature Creation (Derived Features)
 Feature creation involves constructing new features from existing
ones to provide additional insights and improve model performance.
Well-crafted features can capture hidden relationships in data,
making patterns more evident to machine learning models.
 Techniques:
 Polynomial Features: Useful for models that need to capture non-
linear relationships between variables.
 Example: If a model struggles with a linear relationship, adding
polynomial terms like x², x³, or interaction terms (x1 * x2) can improve
performance.
 Use Case: Predicting house prices based on features like square footage
and number of rooms. Instead of just using square footage, a model could
benefit from an interaction term like square_footage * number_of_rooms.
 Binning (Discretization): Converts continuous variables into categorical
bins to simplify the relationship.
 Example: Instead of using raw age values (22, 34, 45), we can group them into
bins:
 Young (18-30)
 Middle-aged (31-50)
 Senior (51+)
 Use Case: Credit risk modeling, where different age groups have different risk
levels.

Ratio Features: Creating ratios between two related numerical values to
normalize the impact of scale.
 Example: Instead of using income and loan amount separately, use Income-to-
Loan Ratio = Income / Loan Amount to standardize comparisons across different
income levels.
 Use Case: Loan default prediction, where individuals with a higher debt-to-
income ratio are more likely to default.
 Time-based Features: Extracts meaningful insights from
timestamps, such as:
 Hour of the day (helps in traffic analysis)
 Day of the week (useful for sales forecasting)
 Season (important for retail and tourism industries)
 Use Case: Predicting e-commerce sales by analyzing trends based on
weekdays vs. weekends.
 Example

Timesta Day of Example:


Hour Month Season
mp Week
2024-02-
14 Thursday 2 Winter
15 14:30
2024-06-
8 Monday 6 Summer
10 08:15
Handling Missing Data

 Missing data is common in real-world datasets and can negatively


impact model performance if not handled properly. Instead of simply
dropping missing values, feature engineering techniques help retain
valuable information while minimizing bias.
 Techniques:
 Mean/Median/Mode Imputation:
 Fills missing values with the mean (for numerical data) or mode (for categorical data).
 Example: Filling missing salary values with the median salary of the dataset to prevent
skewing the distribution.
 Forward or Backward Fill (Time-Series Data):
 Forward fill: Uses the last known value to fill missing entries.
 Backward fill: Uses the next known value to fill missing entries.
 Use Case: Stock market data where missing prices can be filled with the previous day’s
prices.
 K-Nearest Neighbors (KNN) Imputation:
 Uses similar data points to estimate missing values.
 Example: If a person’s income is missing, KNN can predict it based on people with similar job
roles, education levels, and locations.
 Indicator Variable for Missingness:
 Creates a binary column (1 = Missing, 0 = Present) to retain missing data patterns.
 Use Case: Detecting fraudulent transactions where missing values themselves may indicate
suspicious activity.
Example:

Salary
Customer
Age Salary Missing
ID
Indicator
101 35 50,000 0
102 42 NaN 1
103 29 40,000 0
 Feature Extraction
 Feature extraction involves deriving new, meaningful representations from complex
data formats like text, images, and time-series. This is especially useful in high-
dimensional datasets.
 Techniques:
 Text Features: Converts textual data into numerical form for machine learning
models.
 Bag of Words (BoW): Represents text as word frequencies in a matrix.
 TF-IDF (Term Frequency-Inverse Document Frequency): Gives importance to words
based on their frequency in a document vs. overall dataset.
 Word Embeddings (Word2Vec, GloVe, BERT): Captures semantic meaning of words.
 Use Case: Sentiment analysis of customer reviews.
 Image Features: Extract essential patterns from images.
 Edge Detection: Identifies object boundaries in images (useful in medical imaging).
 Histogram of Oriented Gradients (HOG): Used in object detection.
 CNN-based Feature Extraction: Uses deep learning models like ResNet and VGG for
automatic feature learning.
 Use Case: Facial recognition, self-driving car object detection.
 Time-Series Features: Extract meaningful trends and seasonality
from time-series data.
 Rolling Averages: Smooth out short-term fluctuations.
 Seasonal Decomposition: Separates trend, seasonality, and residual
components.
 Autoregressive Features: Uses past values as inputs for predictive
models.
 Use Case: Forecasting electricity demand based on historical consumption
patterns.
 Dimensionality Reduction (PCA, t-SNE, UMAP):
 PCA (Principal Component Analysis) reduces high-dimensional data
while preserving variance.
 t-SNE and UMAP are useful for visualizing clusters in large datasets.
 Use Case: Reducing thousands of customer behavior variables into a few
principal components for clustering.
Example:
For text analysis, TF-IDF converts raw sentences into numerical form:

“AI is “AI is
Sentence transforming advancing
healthcare” research”
AI 0.4 0.3
transforming 0.6 0.0
research 0.0 0.7
Handling outliers
 Techniques:
 Winsorization: Replaces extreme values with a specified percentile
(e.g., capping values at the 5th and 95th percentile).
 Z-score Method: Removes values that are more than a certain
number of standard deviations from the mean (e.g., ±3σ).
 IQR (Interquartile Range) Method: Removes values beyond 1.5
times the interquartile range (Q1 and Q3).
 Transformations (Log, Square Root): Reduces the impact of
extreme values by adjusting scale.
:

Outlier (IQR
Employee Salary
Method)
A 50,000 No
B 52,000 No
C 200,000 Yes
Feature Interaction

 Feature interaction helps capture relationships between variables that


aren’t obvious in their raw form.
 Techniques:
 Multiplication or Division of Features:
 Example: Instead of using “Weight” and “Height” separately, create BMI =
Weight / Height².
 Polynomial Features:
 Example: Adding squared or cubic terms for better non-linear modelling.
 Clustering-Based Features:
 Assign cluster labels using k-means, which can be used as categorical
inputs.

You might also like