0% found this document useful (0 votes)
10 views28 pages

04 - Feature Engineering

The document provides an introduction to feature engineering in data science, emphasizing its significance in transforming raw data into usable features for machine learning. It outlines various techniques for handling missing values, outliers, and categorical variables, as well as methods like log transformation and scaling. Additionally, it highlights the importance of expert knowledge and time investment in the feature engineering process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views28 pages

04 - Feature Engineering

The document provides an introduction to feature engineering in data science, emphasizing its significance in transforming raw data into usable features for machine learning. It outlines various techniques for handling missing values, outliers, and categorical variables, as well as methods like log transformation and scaling. Additionally, it highlights the importance of expert knowledge and time investment in the feature engineering process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduction to data science

Feature Engineering
The importance of the data collecting and
preprocessing

What data scientists spend the most time doing


Feature engineering
Feature engineering

◼ “Feature engineering is the art part of data science” - Sergey Yurgenson

◼ “Coming up with features is difficult, time-consuming, requires expert


knowledge.” - Andrew Ng

◼ Feature engineering is the process of selecting, manipulating and


transforming raw data into features that can be used in supervised
learning.

◼ Feature engineering is a very important step in machine learning that


refers to the process of designing artificial features into an algorithm..
Feature engineering
Feature Engineering Methods
Feature Engineering Techniques

1. Imputation
The process of managing missing values, which is one of the
most common problems when it comes to preparing data for
machine learning.
Feature Engineering Techniques

Represent missing values in a dataset


• NaN (Not a Number): the default for libraries like Pandas in
Python.
• NULL or None: in SQL databases
• Empty Strings (""): in text-based data or CSV files where a
field might be left blank.
• Special Indicators (-999, 9999, or other unlikely values to
signify missing data): in older datasets or specific industries
where such conventions were established.
• Blanks or Spaces: in fixed-width text files
Feature Engineering Techniques

Types of missing data


Feature Engineering Techniques

Types of missing data Missing Completely At Random

Missing At
Random

Missing Not At
Random
Feature Engineering Techniques

Handling missing values

• Delete Rows with Missing Values: removing rows or


columns with missing values

• Replacing With Arbitrary Value: replacing missing


values with estimates

• Interpolation: Missing values can also be imputed using


‘interpolation’
Feature Engineering Techniques

Replacing With Arbitrary Value


•Replacing with previous value – Forward fill
Feature Engineering Techniques

Replacing With Arbitrary Value


• Replacing with next value – Backward fill
Feature Engineering Techniques

2. Handling Outliers
▪ Outliers are the observations in a given dataset that lies far
from the rest of the observations.
▪ Outliers occur due to the variability in the data, or due to
experimental error/human error

Highest sea-levels in Venice


Feature Engineering Techniques

2. Handling Outliers

▪ Below are some of the techniques of detecting


outliers
✓ Boxplots
✓ Z-score
✓ Inter Quantile Range(IQR)
Feature Engineering Techniques

2. Handling Outliers

Z-score
Feature Engineering Techniques

2. Handling Outliers

Inter Quantile Range(IQR)


Feature Engineering Techniques

2. Handling Outliers

Box-Plot
Feature Engineering Techniques

3. Log Transform
▪ A data transformation process where each variable is
replaced with its logarithm (each variable x is replaced
with log(x), base 10, base 2, or natural logarithm.
▪ It is commonly used to compress the y-axis in histograms,
making visualization clearer and de-emphasizing outliers
in the data.
▪ The log transformations have not produced normal
distributions; although close to normal
Feature Engineering Techniques

3. Log Transform
Feature Engineering Techniques

4. One-Hot Encoding

▪ A technique that we use to represent categorical


variables as numerical values in a machine learning
model.
Example: The categorical parameters will prepare
separate columns for both Male and Female labels. So,
wherever there is a Male, the value will be 1 in the Male
column and 0 in the Female column, and vice-versa.
Feature Engineering Techniques

4. One-Hot Encoding
Feature Engineering Techniques

4. One-Hot Encoding
Feature Engineering Techniques

5. Scaling
▪ A data calibration technique that facilitates the
comparison of different types of data.
▪ Useful for measurements to correct the way the model
handles small and large numbers.

Perform a comparative analysis of the planets


if we normalize values using their proportions
against each other instead of actual diameters
Feature Engineering Techniques

5. Scaling
▪ Techniques

• Min-max scaling: Rescales the range of features to 0 to 1 by


subtracting the minimum value and dividing by the range.
• Standardization: Rescales features to have a mean of 0 and
standard deviation of 1 using the formula (x - mean) / std_dev.
This is useful for algorithms that assume a normal distribution of
feature values.
• Log transforms: Apply a logarithmic transformation to highly
skewed features to reduce the impact of outliers.
Feature Creation

The prices of properties in x city


Feature Creation

The prices of properties in x city


Feature engineering – Examples

Melbourne Housing dataset.

◼ Dealing with missing values

◼ Scaling data

You might also like