Feature Engineering

Uploaded by

bhupesh.ug22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views35 pages

Feature Engineering

Uploaded by

bhupesh.ug22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Feature Engineering

Feature and Its Engineering

• Input data comprise features, which are usually in the form of structured
columns.
• Algorithms require features with some specific characteristic to work
properly.
• Here, the need for feature engineering arises.
Feature engineering efforts mainly have two goals:
• Preparing the proper input dataset, compatible with the machine learning
algorithm requirements.
• Improving the performance of machine learning models.
• The features you use influence more than everything else the result. No
algorithm alone, to my knowledge, can supplement the information gain
given by correct feature engineering.(Luca Massaron)
According to Forbes Survey

data scientists spend 80% of their time on data preparation:

List of Techniques
1. Imputation
2. Handling Outliers
3. Binning
4. Log Transform
5. One-Hot Encoding
6. Grouping Operations
7. Scaling
Missing Values
1.Missing values are one of the most common problems when preparing data
data for machine learning.
2.The reason for the missing values might be human errors, interruptions in
the data flow, privacy concerns, and so on.
3.Whatever is the reason, missing values affect the performance of the
machine learning models.
4. The most simple solution to the missing values is to drop the rows or the
entire column.
5.There is not an optimum threshold for dropping but 70% rule can be used.
6. Try to drop the rows and columns which have missing values with higher
than this threshold.
Numerical Imputation
• When applying a typical machine learning model to such data, we will
need to first replace such missing data with some appropriate fill value.
This is known as imputation of missing values, and strategies range from
simple (e.g., replacing missing values with the mean of the column) to
sophisticated (e.g., using matrix completion or a robust model to handle
such data).
• The sophisticated approaches tend to be very application-specific.

https://
colab.research.google.com/drive/1Jhj4YyLQlALfFgLiLu6mO67niGFtyGJg#scroll
To=CWNFSK_TlviA
Categorical Imputation
• Replacing the missing values with the maximum occurred value in a
column is a good option for handling categorical columns.
• But, if the values in the column are distributed uniformly and there is not
a dominant value, imputing a category like “Other” might be more
sensible, because in such a case, imputation is likely to converge a random
selection.
Definition of Outliers
An outlier is an observation that lies an abnormal distance from other values
in a random sample from a population. In a sense, this definition leaves it up
to the analyst (or a consensus process) to decide what will be considered
abnormal. Before abnormal observations can be singled out, it is necessary to
characterize normal observations.
Box Plot Construction
The box plot is a useful graphical display for describing the behavior of the
data in the middle as well as at the ends of the distributions. The box plot
uses the median and the lower and upper quartiles (defined as the 25th and
75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3,
then the difference (Q3 - Q1) is called the interquartile range or IQ.
Box Plots with fences
• A box plot is constructed by drawing a box between the upper and lower
quartiles with a solid line drawn across the box to locate the median. The
following quantities (called fences) are needed for identifying extreme
values in the tails of the distribution:
• lower inner fence: Q1 - 1.5*IQ
• upper inner fence: Q3 + 1.5*IQ
• lower outer fence: Q1 - 3*IQ
• upper outer fence: Q3 + 3*IQ
Outlier Detection Criterion
A point beyond an inner fence on either side is considered a mild outlier. A
point beyond an outer fence is considered an extreme outlier.
Example of an Outlier Box Plot
The data set of N = 90 ordered observations as shown below is examined for
outliers:30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322,
336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451,
453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550,
559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637,
638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794,
802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005, 1068,
1441
Example(Cont.)
• The computations are as follows:
• Median = (n+1)/2 largest data point = the average of the 45th and 46th ordered
points = (559 + 560)/2 = 559.5
• Lower quartile = .25(N+1)th ordered point = 22.75th ordered point = 411
+ .75(436-411) = 429.75
• Upper quartile = .75(N+1)th ordered point = 68.25th ordered point = 739
+.25(752-739) = 742.25
• Interquartile range = 742.25 - 429.75 = 312.5
• Lower inner fence = 429.75 - 1.5 (312.5) = -39.0
• Upper inner fence = 742.25 + 1.5 (312.5) = 1211.0
• Lower outer fence = 429.75 - 3.0 (312.5) = -507.75
• Upper outer fence = 742.25 + 3.0 (312.5) = 1679.75
From an examination of the fence points and the data, one point (1441) exceeds the
upper inner fence and stands out as a mild outlier; there are no extreme outliers.
Histogram with Box Plot

The outlier is identified as the largest value in the data set, 1441, and appears as
the circle to the right of the box plot.
Outliers may contain important information
Outliers should be investigated carefully. Often they contain valuable
information about the process under investigation or the data gathering and
recording process. Before considering the possible elimination of these points
from the data, one should try to understand why they appeared and whether
it is likely similar values will continue to appear. Of course, outliers are often
bad data points.
Handling Outliers(?)
Binning
Binning methods smooth a sorted data value by consulting its
“neighborhood”, that is, the values around it.
Binning Method(Data Smoothing)
The data is first sorted and then the sorted values are distributed into a
number of buckets or bins. As binning methods consult the neighborhood of
values, they perform local smoothing.
• There are basically two types of binning approaches –
• Equal width (or distance) binning : The simplest binning approach is to
partition the range of the variable into k equal-width intervals. The
interval width is simply the range [A, B] of the variable divided by k,
w = (B-A) / k.
Thus, ith interval range will be [A + (i-1)w, A + iw] where i = 1, 2, 3…..k
Skewed data cannot be handled well by this method.
• Equal depth (or frequency) binning : In equal-frequency binning we divide
the range [A, B] of the variable into intervals that contain (approximately)
equal number of points; equal frequency may not be possible due to
repeated values.
Perform Smoothing on the Data
Three approaches to perform smoothing
• Smoothing by bin means : In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
• Smoothing by bin median : In this method each bin value is replaced by
its bin median value.
• Smoothing by bin boundary : In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Example
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 25, 30
Partition using equal frequency approach:
Bin 1 : 2, 6, 7
Bin 2 : 9, 13, 20
Bin 3 : 21, 24, 30
Smoothing by bin mean :
Bin 1 : 5, 5, 5
Bin 2 : 14, 14, 14
Bin 3 : 25, 25, 25
Smoothing by bin median :
Bin 1 : 6, 6, 6
Bin 2 : 13, 13, 13
Bin 3 : 24, 24, 24
Smoothing by bin boundary :
Bin 1 : 2, 7, 7
Bin 2 : 9, 9, 20
Programming Examples
1.
https://
colab.research.google.com/drive/14MLastd8CPHKVNEvEk4TM1HPMVzP66s4
2.
https://
colab.research.google.com/drive/1cNHJww6QAVrB2NyVsVxRh3h2ipuU7zjh
Log Transform
Logarithm transformation (or log transform) is one of the most commonly
used mathematical transformations in feature engineering.
Benefits of Log Transform
• It helps to handle skewed data and after transformation, the distribution
becomes more approximate to normal.
• In most of the cases the magnitude order of the data changes within the range
of the data.
• For instance, the difference between ages 15 and 20 is not equal to the
ages 65 and 70. In terms of years, yes, they are identical, but for all other
aspects, 5 years of difference in young ages mean a higher magnitude
difference. This type of data comes from a multiplicative process and log
transform normalizes the magnitude differences like that.
• It also decreases the effect of the outliers, due to the normalization of
magnitude differences and the model become more robust.
• A critical note: The data you apply log transform must have only positive values,
otherwise you receive an error. Also, you can add 1 to your data before
transform it. Thus, you ensure the output of the transformation to be positive.
Log(x+1)
One-Hot Encoding
• One of the most common encoding methods in machine learning. This
method spreads the values in a column to multiple flag columns and
assigns 0 or 1 to them. These binary values express the relationship
between grouped and encoded column.
• This method changes the categorical data, which is challenging to
understand for algorithms, to a numerical format and enables you to
group your categorical data without losing any information.
Example
Grouping Operations
GroupBy: Split, Apply, Combine:
Aggregate conditionally on some label or index: this is implemented in the so-
called ``groupby`` operation. The name "group by" comes from a command in
the SQL database language, but it is perhaps more illuminative to think of it in
the terms first coined by Hadley Wickham of Rstats fame: *split, apply,
combine*.
Example(Split, Apply and Combine)
What GroupBy accomplishes
• This makes clear what the groupby accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on
the value of the specified key.
• The apply step involves computing some function, usually an aggregate,
transformation, or filtering, within the individual groups.
• The combine step merges the results of these operations into an output
array.
• While this could certainly be done manually using some combination of the
masking, aggregation, and merging commands covered earlier, an important
realization is that the intermediate splits do not need to be explicitly
instantiated. Rather, the GroupBy can (often) do this in a single pass over the
data, updating the sum, mean, count, min, or other aggregate for each group
along the way. The power of the GroupBy is that it abstracts away these
steps: the user need not think about how the computation is done under the
hood, but rather thinks about the operation as a whole.
Example
The most basic split-apply-combine operation can be computed with
the groupby() method of DataFrames, passing the name of the desired key
column:

https://
colab.research.google.com/drive/18YkdZsK42-lqePHm519ZSJRaG22s4JeL#scr
ollTo=SDTSESS0lXWr

Notice that what is returned is not a set of DataFrames, but

a DataFrameGroupBy object.
Scaling
• In most cases, the numerical features of the dataset do not have a
certain range and they differ from each other.
• In real life, it is nonsense to expect age and income columns to have the
same range.
• Scaling solves this problem.
• The continuous features become identical in terms of the range, after a
scaling process.
• This process is not mandatory for many algorithms, but it might be still
nice to apply.
• However, the algorithms based on distance calculations such as k-NN or k-
Means need to have scaled continuous features as model input.
Standard Scaler
StandardScaler follows Standard Normal Distribution (SND). Therefore, it
makes mean = 0 and scales the data to unit variance.
Min-Max Normailzation
• MinMaxScaler scales all the data features in the range [0, 1] or else in the
range [-1, 1].
• This transformation does not change the distribution of the feature and
due to the decreased standard deviations, the effects of
the outliers increases. Therefore, before normalization, it is recommended
to handle the outliers.
RobustScaler
Remove the outliers and then use either StandardScaler or MinMaxScaler for
preprocessing the dataset.
Python Example
Standard, MinMax and RobustScaler Implementation in Python
https://
colab.research.google.com/drive/1wfhQ8GUTiq636hwUMXIz4loJt8uMSilI#scr
ollTo=UG7xSQUQFgPm&line=7&uniqifier=1
References
1. https://
towardsdatascience.com/feature-engineering-for-machine-learning-3a5e
293a5114
2. https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
3. https://www.geeksforgeeks.org/ml-binning-or-discretization/
4. https://
colab.research.google.com/github/jakevdp/PythonDataScienceHandbook
/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb#scrollT
o=Hk9q1xYbEKPy
5. https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-
robustscaler-techniques-ml/

Advance Python Programming
0% (1)
Advance Python Programming
184 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Education Under The Japanese Regime
100% (3)
Education Under The Japanese Regime
13 pages
Arsha Adkar Business Worksheet
No ratings yet
Arsha Adkar Business Worksheet
4 pages
4 5864134585436080020
100% (1)
4 5864134585436080020
25 pages
OUTLIERS
100% (1)
OUTLIERS
5 pages
Manual - Profinet Board - CP (TIA)
No ratings yet
Manual - Profinet Board - CP (TIA)
13 pages
Geology of Sindh District
60% (5)
Geology of Sindh District
3 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Outliners
No ratings yet
Outliners
15 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Topic 05 - Data Preprocessing
No ratings yet
Topic 05 - Data Preprocessing
62 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Lecture 7 - Data Cleaning
No ratings yet
Lecture 7 - Data Cleaning
36 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Unit - 3: Big Data Analytics
No ratings yet
Unit - 3: Big Data Analytics
23 pages
Unit-1 3
No ratings yet
Unit-1 3
58 pages
9 Tutorials 31 07 2024
No ratings yet
9 Tutorials 31 07 2024
28 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
ML Assignment-1
No ratings yet
ML Assignment-1
7 pages
Practical Guide and Concepts Data Mining
No ratings yet
Practical Guide and Concepts Data Mining
63 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Unit 2
No ratings yet
Unit 2
34 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Unit 1
No ratings yet
Unit 1
21 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
206 pages
OpenSAP Ds1 Week 3 Transcript
No ratings yet
OpenSAP Ds1 Week 3 Transcript
17 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Lec 6 Data Preprocessing Using R
No ratings yet
Lec 6 Data Preprocessing Using R
84 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
02 Pre Processing
No ratings yet
02 Pre Processing
52 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Fundamentals Stats
No ratings yet
Fundamentals Stats
44 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Lecture Notes 1.7 & 1.8
No ratings yet
Lecture Notes 1.7 & 1.8
3 pages
Data Processing - Unit-3
No ratings yet
Data Processing - Unit-3
38 pages
Binning
No ratings yet
Binning
6 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Knowledge Discovery Database - Unit 2
No ratings yet
Knowledge Discovery Database - Unit 2
53 pages
24ucs172 S6
No ratings yet
24ucs172 S6
19 pages
1 Program
No ratings yet
1 Program
20 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
高中英语语法专项讲解+练习 2.介词、代词、数词
No ratings yet
高中英语语法专项讲解+练习 2.介词、代词、数词
8 pages
Hadoroh 3 Tarikat
No ratings yet
Hadoroh 3 Tarikat
2 pages
Bizuayehu Getachew V.good
No ratings yet
Bizuayehu Getachew V.good
104 pages
Binning 1
No ratings yet
Binning 1
3 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
Shri Vaishnav Institute of Management, Indore (M.P.)
No ratings yet
Shri Vaishnav Institute of Management, Indore (M.P.)
14 pages
Supply Chain Performance: Achieving Strategic Fit and Scope
No ratings yet
Supply Chain Performance: Achieving Strategic Fit and Scope
28 pages
GCC Agro Investments in Sub Saharan Africa March 2015
No ratings yet
GCC Agro Investments in Sub Saharan Africa March 2015
52 pages
Q&a 4
100% (1)
Q&a 4
19 pages
Model Test 133
No ratings yet
Model Test 133
16 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Unit 4 Grammar Summary
No ratings yet
Unit 4 Grammar Summary
14 pages
Level 7 Hfle
No ratings yet
Level 7 Hfle
21 pages
Case Study
No ratings yet
Case Study
8 pages
Dansk Selskab For Byggeret - Fit For Purpose
No ratings yet
Dansk Selskab For Byggeret - Fit For Purpose
14 pages
Frederick Schauer - Authority and Authorities
No ratings yet
Frederick Schauer - Authority and Authorities
33 pages
Sample Case
No ratings yet
Sample Case
6 pages
Course Title: Economics For Engineers Credit Units: 2 Course Code: ECON132
No ratings yet
Course Title: Economics For Engineers Credit Units: 2 Course Code: ECON132
2 pages
Little - A Little - Few - A Few - GrammarBank
No ratings yet
Little - A Little - Few - A Few - GrammarBank
3 pages
Poverty and Mental Health Final
No ratings yet
Poverty and Mental Health Final
25 pages
Baghi Et Al 2016 Shear Properties of The Strain Hardening Cementitious Composite Material
No ratings yet
Baghi Et Al 2016 Shear Properties of The Strain Hardening Cementitious Composite Material
13 pages
Human Rightts
No ratings yet
Human Rightts
2 pages
Aeroshell LGF
No ratings yet
Aeroshell LGF
3 pages
Kushtagi Bangalore: SRS Travels 15L
No ratings yet
Kushtagi Bangalore: SRS Travels 15L
2 pages
‎⁨‏لقطة شاشة ٢٠٢٤-٠٣-٢٩ في ١١.٠٧.٠٧ م⁩
No ratings yet
‎⁨‏لقطة شاشة ٢٠٢٤-٠٣-٢٩ في ١١.٠٧.٠٧ م⁩
6 pages
Shourya Reddy
No ratings yet
Shourya Reddy
2 pages
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Control Charts: Six Sigma Thinking, #7
From Everand
Control Charts: Six Sigma Thinking, #7
Sumeet Savant
4/5 (1)
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Feature Engineering

Uploaded by

Feature Engineering

Uploaded by

Feature Engineering

Feature and Its Engineering

data scientists spend 80% of their time on data preparation:

Notice that what is returned is not a set of DataFrames, but

You might also like