0% found this document useful (0 votes)
20 views35 pages

Feature Engineering

Uploaded by

bhupesh.ug22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views35 pages

Feature Engineering

Uploaded by

bhupesh.ug22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Feature Engineering

Feature and Its Engineering


• Input data comprise features, which are usually in the form of structured
columns.
• Algorithms require features with some specific characteristic to work
properly.
• Here, the need for feature engineering arises.
Feature engineering efforts mainly have two goals:
• Preparing the proper input dataset, compatible with the machine learning
algorithm requirements.
• Improving the performance of machine learning models.
• The features you use influence more than everything else the result. No
algorithm alone, to my knowledge, can supplement the information gain
given by correct feature engineering.(Luca Massaron)
According to Forbes Survey

data scientists spend 80% of their time on data preparation:


List of Techniques
1. Imputation
2. Handling Outliers
3. Binning
4. Log Transform
5. One-Hot Encoding
6. Grouping Operations
7. Scaling
Missing Values
1.Missing values are one of the most common problems when preparing data
data for machine learning.
2.The reason for the missing values might be human errors, interruptions in
the data flow, privacy concerns, and so on.
3.Whatever is the reason, missing values affect the performance of the
machine learning models.
4. The most simple solution to the missing values is to drop the rows or the
entire column.
5.There is not an optimum threshold for dropping but 70% rule can be used.
6. Try to drop the rows and columns which have missing values with higher
than this threshold.
Numerical Imputation
• When applying a typical machine learning model to such data, we will
need to first replace such missing data with some appropriate fill value.
This is known as imputation of missing values, and strategies range from
simple (e.g., replacing missing values with the mean of the column) to
sophisticated (e.g., using matrix completion or a robust model to handle
such data).
• The sophisticated approaches tend to be very application-specific.

https://
colab.research.google.com/drive/1Jhj4YyLQlALfFgLiLu6mO67niGFtyGJg#scroll
To=CWNFSK_TlviA
Categorical Imputation
• Replacing the missing values with the maximum occurred value in a
column is a good option for handling categorical columns.
• But, if the values in the column are distributed uniformly and there is not
a dominant value, imputing a category like “Other” might be more
sensible, because in such a case, imputation is likely to converge a random
selection.
Definition of Outliers
An outlier is an observation that lies an abnormal distance from other values
in a random sample from a population. In a sense, this definition leaves it up
to the analyst (or a consensus process) to decide what will be considered
abnormal. Before abnormal observations can be singled out, it is necessary to
characterize normal observations.
Box Plot Construction
The box plot is a useful graphical display for describing the behavior of the
data in the middle as well as at the ends of the distributions. The box plot
uses the median and the lower and upper quartiles (defined as the 25th and
75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3,
then the difference (Q3 - Q1) is called the interquartile range or IQ.
Box Plots with fences
• A box plot is constructed by drawing a box between the upper and lower
quartiles with a solid line drawn across the box to locate the median. The
following quantities (called fences) are needed for identifying extreme
values in the tails of the distribution:
• lower inner fence: Q1 - 1.5*IQ
• upper inner fence: Q3 + 1.5*IQ
• lower outer fence: Q1 - 3*IQ
• upper outer fence: Q3 + 3*IQ
Outlier Detection Criterion
A point beyond an inner fence on either side is considered a mild outlier. A
point beyond an outer fence is considered an extreme outlier.
Example of an Outlier Box Plot
The data set of N = 90 ordered observations as shown below is examined for
outliers:30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322,
336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451,
453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550,
559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637,
638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794,
802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005, 1068,
1441
Example(Cont.)
• The computations are as follows:
• Median = (n+1)/2 largest data point = the average of the 45th and 46th ordered
points = (559 + 560)/2 = 559.5
• Lower quartile = .25(N+1)th ordered point = 22.75th ordered point = 411
+ .75(436-411) = 429.75
• Upper quartile = .75(N+1)th ordered point = 68.25th ordered point = 739
+.25(752-739) = 742.25
• Interquartile range = 742.25 - 429.75 = 312.5
• Lower inner fence = 429.75 - 1.5 (312.5) = -39.0
• Upper inner fence = 742.25 + 1.5 (312.5) = 1211.0
• Lower outer fence = 429.75 - 3.0 (312.5) = -507.75
• Upper outer fence = 742.25 + 3.0 (312.5) = 1679.75
From an examination of the fence points and the data, one point (1441) exceeds the
upper inner fence and stands out as a mild outlier; there are no extreme outliers.
Histogram with Box Plot

The outlier is identified as the largest value in the data set, 1441, and appears as
the circle to the right of the box plot.
Outliers may contain important information
Outliers should be investigated carefully. Often they contain valuable
information about the process under investigation or the data gathering and
recording process. Before considering the possible elimination of these points
from the data, one should try to understand why they appeared and whether
it is likely similar values will continue to appear. Of course, outliers are often
bad data points.
Handling Outliers(?)
Binning
Binning methods smooth a sorted data value by consulting its
“neighborhood”, that is, the values around it.
Binning Method(Data Smoothing)
The data is first sorted and then the sorted values are distributed into a
number of buckets or bins. As binning methods consult the neighborhood of
values, they perform local smoothing.
• There are basically two types of binning approaches –
• Equal width (or distance) binning : The simplest binning approach is to
partition the range of the variable into k equal-width intervals. The
interval width is simply the range [A, B] of the variable divided by k,
w = (B-A) / k.
Thus, ith interval range will be [A + (i-1)w, A + iw] where i = 1, 2, 3…..k
Skewed data cannot be handled well by this method.
• Equal depth (or frequency) binning : In equal-frequency binning we divide
the range [A, B] of the variable into intervals that contain (approximately)
equal number of points; equal frequency may not be possible due to
repeated values.
Perform Smoothing on the Data
Three approaches to perform smoothing
• Smoothing by bin means : In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
• Smoothing by bin median : In this method each bin value is replaced by
its bin median value.
• Smoothing by bin boundary : In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Example
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 25, 30
Partition using equal frequency approach:
Bin 1 : 2, 6, 7
Bin 2 : 9, 13, 20
Bin 3 : 21, 24, 30
Smoothing by bin mean :
Bin 1 : 5, 5, 5
Bin 2 : 14, 14, 14
Bin 3 : 25, 25, 25
Smoothing by bin median :
Bin 1 : 6, 6, 6
Bin 2 : 13, 13, 13
Bin 3 : 24, 24, 24
Smoothing by bin boundary :
Bin 1 : 2, 7, 7
Bin 2 : 9, 9, 20
Programming Examples
1.
https://
colab.research.google.com/drive/14MLastd8CPHKVNEvEk4TM1HPMVzP66s4
2.
https://
colab.research.google.com/drive/1cNHJww6QAVrB2NyVsVxRh3h2ipuU7zjh
Log Transform
Logarithm transformation (or log transform) is one of the most commonly
used mathematical transformations in feature engineering.
Benefits of Log Transform
• It helps to handle skewed data and after transformation, the distribution
becomes more approximate to normal.
• In most of the cases the magnitude order of the data changes within the range
of the data.
• For instance, the difference between ages 15 and 20 is not equal to the
ages 65 and 70. In terms of years, yes, they are identical, but for all other
aspects, 5 years of difference in young ages mean a higher magnitude
difference. This type of data comes from a multiplicative process and log
transform normalizes the magnitude differences like that.
• It also decreases the effect of the outliers, due to the normalization of
magnitude differences and the model become more robust.
• A critical note: The data you apply log transform must have only positive values,
otherwise you receive an error. Also, you can add 1 to your data before
transform it. Thus, you ensure the output of the transformation to be positive.
Log(x+1)
One-Hot Encoding
• One of the most common encoding methods in machine learning. This
method spreads the values in a column to multiple flag columns and
assigns 0 or 1 to them. These binary values express the relationship
between grouped and encoded column.
• This method changes the categorical data, which is challenging to
understand for algorithms, to a numerical format and enables you to
group your categorical data without losing any information.
Example
Grouping Operations
GroupBy: Split, Apply, Combine:
Aggregate conditionally on some label or index: this is implemented in the so-
called ``groupby`` operation. The name "group by" comes from a command in
the SQL database language, but it is perhaps more illuminative to think of it in
the terms first coined by Hadley Wickham of Rstats fame: *split, apply,
combine*.
Example(Split, Apply and Combine)
What GroupBy accomplishes
• This makes clear what the groupby accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on
the value of the specified key.
• The apply step involves computing some function, usually an aggregate,
transformation, or filtering, within the individual groups.
• The combine step merges the results of these operations into an output
array.
• While this could certainly be done manually using some combination of the
masking, aggregation, and merging commands covered earlier, an important
realization is that the intermediate splits do not need to be explicitly
instantiated. Rather, the GroupBy can (often) do this in a single pass over the
data, updating the sum, mean, count, min, or other aggregate for each group
along the way. The power of the GroupBy is that it abstracts away these
steps: the user need not think about how the computation is done under the
hood, but rather thinks about the operation as a whole.
Example
The most basic split-apply-combine operation can be computed with
the groupby() method of DataFrames, passing the name of the desired key
column:

https://
colab.research.google.com/drive/18YkdZsK42-lqePHm519ZSJRaG22s4JeL#scr
ollTo=SDTSESS0lXWr

Notice that what is returned is not a set of DataFrames, but


a DataFrameGroupBy object.
Scaling
• In most cases, the numerical features of the dataset do not have a
certain range and they differ from each other.
• In real life, it is nonsense to expect age and income columns to have the
same range.
• Scaling solves this problem.
• The continuous features become identical in terms of the range, after a
scaling process.
• This process is not mandatory for many algorithms, but it might be still
nice to apply.
• However, the algorithms based on distance calculations such as k-NN or k-
Means need to have scaled continuous features as model input.
Standard Scaler
StandardScaler follows Standard Normal Distribution (SND). Therefore, it
makes mean = 0 and scales the data to unit variance.
Min-Max Normailzation
• MinMaxScaler scales all the data features in the range [0, 1] or else in the
range [-1, 1].
• This transformation does not change the distribution of the feature and
due to the decreased standard deviations, the effects of
the outliers increases. Therefore, before normalization, it is recommended
to handle the outliers.
RobustScaler
Remove the outliers and then use either StandardScaler or MinMaxScaler for
preprocessing the dataset.
Python Example
Standard, MinMax and RobustScaler Implementation in Python
https://
colab.research.google.com/drive/1wfhQ8GUTiq636hwUMXIz4loJt8uMSilI#scr
ollTo=UG7xSQUQFgPm&line=7&uniqifier=1
References
1. https://
towardsdatascience.com/feature-engineering-for-machine-learning-3a5e
293a5114
2. https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
3. https://www.geeksforgeeks.org/ml-binning-or-discretization/
4. https://
colab.research.google.com/github/jakevdp/PythonDataScienceHandbook
/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb#scrollT
o=Hk9q1xYbEKPy
5. https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-
robustscaler-techniques-ml/

You might also like