Feature Engineering
Feature Engineering
https://
colab.research.google.com/drive/1Jhj4YyLQlALfFgLiLu6mO67niGFtyGJg#scroll
To=CWNFSK_TlviA
Categorical Imputation
• Replacing the missing values with the maximum occurred value in a
column is a good option for handling categorical columns.
• But, if the values in the column are distributed uniformly and there is not
a dominant value, imputing a category like “Other” might be more
sensible, because in such a case, imputation is likely to converge a random
selection.
Definition of Outliers
An outlier is an observation that lies an abnormal distance from other values
in a random sample from a population. In a sense, this definition leaves it up
to the analyst (or a consensus process) to decide what will be considered
abnormal. Before abnormal observations can be singled out, it is necessary to
characterize normal observations.
Box Plot Construction
The box plot is a useful graphical display for describing the behavior of the
data in the middle as well as at the ends of the distributions. The box plot
uses the median and the lower and upper quartiles (defined as the 25th and
75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3,
then the difference (Q3 - Q1) is called the interquartile range or IQ.
Box Plots with fences
• A box plot is constructed by drawing a box between the upper and lower
quartiles with a solid line drawn across the box to locate the median. The
following quantities (called fences) are needed for identifying extreme
values in the tails of the distribution:
• lower inner fence: Q1 - 1.5*IQ
• upper inner fence: Q3 + 1.5*IQ
• lower outer fence: Q1 - 3*IQ
• upper outer fence: Q3 + 3*IQ
Outlier Detection Criterion
A point beyond an inner fence on either side is considered a mild outlier. A
point beyond an outer fence is considered an extreme outlier.
Example of an Outlier Box Plot
The data set of N = 90 ordered observations as shown below is examined for
outliers:30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322,
336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451,
453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550,
559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637,
638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794,
802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005, 1068,
1441
Example(Cont.)
• The computations are as follows:
• Median = (n+1)/2 largest data point = the average of the 45th and 46th ordered
points = (559 + 560)/2 = 559.5
• Lower quartile = .25(N+1)th ordered point = 22.75th ordered point = 411
+ .75(436-411) = 429.75
• Upper quartile = .75(N+1)th ordered point = 68.25th ordered point = 739
+.25(752-739) = 742.25
• Interquartile range = 742.25 - 429.75 = 312.5
• Lower inner fence = 429.75 - 1.5 (312.5) = -39.0
• Upper inner fence = 742.25 + 1.5 (312.5) = 1211.0
• Lower outer fence = 429.75 - 3.0 (312.5) = -507.75
• Upper outer fence = 742.25 + 3.0 (312.5) = 1679.75
From an examination of the fence points and the data, one point (1441) exceeds the
upper inner fence and stands out as a mild outlier; there are no extreme outliers.
Histogram with Box Plot
The outlier is identified as the largest value in the data set, 1441, and appears as
the circle to the right of the box plot.
Outliers may contain important information
Outliers should be investigated carefully. Often they contain valuable
information about the process under investigation or the data gathering and
recording process. Before considering the possible elimination of these points
from the data, one should try to understand why they appeared and whether
it is likely similar values will continue to appear. Of course, outliers are often
bad data points.
Handling Outliers(?)
Binning
Binning methods smooth a sorted data value by consulting its
“neighborhood”, that is, the values around it.
Binning Method(Data Smoothing)
The data is first sorted and then the sorted values are distributed into a
number of buckets or bins. As binning methods consult the neighborhood of
values, they perform local smoothing.
• There are basically two types of binning approaches –
• Equal width (or distance) binning : The simplest binning approach is to
partition the range of the variable into k equal-width intervals. The
interval width is simply the range [A, B] of the variable divided by k,
w = (B-A) / k.
Thus, ith interval range will be [A + (i-1)w, A + iw] where i = 1, 2, 3…..k
Skewed data cannot be handled well by this method.
• Equal depth (or frequency) binning : In equal-frequency binning we divide
the range [A, B] of the variable into intervals that contain (approximately)
equal number of points; equal frequency may not be possible due to
repeated values.
Perform Smoothing on the Data
Three approaches to perform smoothing
• Smoothing by bin means : In smoothing by bin means, each value in a bin
is replaced by the mean value of the bin.
• Smoothing by bin median : In this method each bin value is replaced by
its bin median value.
• Smoothing by bin boundary : In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
Example
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 25, 30
Partition using equal frequency approach:
Bin 1 : 2, 6, 7
Bin 2 : 9, 13, 20
Bin 3 : 21, 24, 30
Smoothing by bin mean :
Bin 1 : 5, 5, 5
Bin 2 : 14, 14, 14
Bin 3 : 25, 25, 25
Smoothing by bin median :
Bin 1 : 6, 6, 6
Bin 2 : 13, 13, 13
Bin 3 : 24, 24, 24
Smoothing by bin boundary :
Bin 1 : 2, 7, 7
Bin 2 : 9, 9, 20
Programming Examples
1.
https://
colab.research.google.com/drive/14MLastd8CPHKVNEvEk4TM1HPMVzP66s4
2.
https://
colab.research.google.com/drive/1cNHJww6QAVrB2NyVsVxRh3h2ipuU7zjh
Log Transform
Logarithm transformation (or log transform) is one of the most commonly
used mathematical transformations in feature engineering.
Benefits of Log Transform
• It helps to handle skewed data and after transformation, the distribution
becomes more approximate to normal.
• In most of the cases the magnitude order of the data changes within the range
of the data.
• For instance, the difference between ages 15 and 20 is not equal to the
ages 65 and 70. In terms of years, yes, they are identical, but for all other
aspects, 5 years of difference in young ages mean a higher magnitude
difference. This type of data comes from a multiplicative process and log
transform normalizes the magnitude differences like that.
• It also decreases the effect of the outliers, due to the normalization of
magnitude differences and the model become more robust.
• A critical note: The data you apply log transform must have only positive values,
otherwise you receive an error. Also, you can add 1 to your data before
transform it. Thus, you ensure the output of the transformation to be positive.
Log(x+1)
One-Hot Encoding
• One of the most common encoding methods in machine learning. This
method spreads the values in a column to multiple flag columns and
assigns 0 or 1 to them. These binary values express the relationship
between grouped and encoded column.
• This method changes the categorical data, which is challenging to
understand for algorithms, to a numerical format and enables you to
group your categorical data without losing any information.
Example
Grouping Operations
GroupBy: Split, Apply, Combine:
Aggregate conditionally on some label or index: this is implemented in the so-
called ``groupby`` operation. The name "group by" comes from a command in
the SQL database language, but it is perhaps more illuminative to think of it in
the terms first coined by Hadley Wickham of Rstats fame: *split, apply,
combine*.
Example(Split, Apply and Combine)
What GroupBy accomplishes
• This makes clear what the groupby accomplishes:
• The split step involves breaking up and grouping a DataFrame depending on
the value of the specified key.
• The apply step involves computing some function, usually an aggregate,
transformation, or filtering, within the individual groups.
• The combine step merges the results of these operations into an output
array.
• While this could certainly be done manually using some combination of the
masking, aggregation, and merging commands covered earlier, an important
realization is that the intermediate splits do not need to be explicitly
instantiated. Rather, the GroupBy can (often) do this in a single pass over the
data, updating the sum, mean, count, min, or other aggregate for each group
along the way. The power of the GroupBy is that it abstracts away these
steps: the user need not think about how the computation is done under the
hood, but rather thinks about the operation as a whole.
Example
The most basic split-apply-combine operation can be computed with
the groupby() method of DataFrames, passing the name of the desired key
column:
https://
colab.research.google.com/drive/18YkdZsK42-lqePHm519ZSJRaG22s4JeL#scr
ollTo=SDTSESS0lXWr