A Short Guide For Feature Engineering and Feature Selection
A Short Guide For Feature Engineering and Feature Selection
0. Basic Concepts
Machine Learning is a technique of data science that helps computers learn from existing data in
order to forecast future behaviors, outcomes and trends - Microsoft
The field of Machine Learning seeks to answer the question “How can we build computer systems that
automatically improve with experience, and what are the fundamental laws that govern all learning
processes?“ - Carnegie Mellon University
Narrowly speaking, in data mining context, machine learning (ML) is the process of letting computers to
learn from historical data, recognize pattern/relationship within data, and then make predictions.
0.2 Methodology
A typical ML workflow/pipeline looks like this:
img source
There can be many ways to divide the tasks that make up the ML workflow into phases. But generally the
basic steps are similar as the graph above.
Anomaly
identify outliers fraud detection
Detection
0.4 Terminology
Feature: also known as Attribute/ Independent Variable/ Predictor/ Input Variable. It's an individual
measurable property/characteristic of a phenomenon being observed [wiki]. The age of a person, etc.
Target: also known as Dependent Variable/ Response Variable/ Output Variable. It's the variable being
predicted in supervised learning.
Algorithm: the specific procedure used to implement a particular ML technique. Linear Regression,
etc.
Model: the algorithm applied to a dataset, complete with its settings (its parameters). Y=4.5x+0.8, etc.
We want the model that best captures the relationship between features and the target.
Supervised learning : train the model with labeled data to generate reasonable predictions for the
response to new data.
Unsupervised learning : train the model with un-labeled data to find intrinsic structures/ patterns
within the data.
Reinforcement learning: the model is learned from a series of actions by maximizing a reward
function, which can either be maximized by penalizing bad actions and/or rewarding good actions. Self-
driving, etc.
1. Data Exploration
1.1 Variables
Definition: any measurable property/characteristic of a phenomenon being observed. They are called
'variables' because the value they take may vary (and it usually does) in a population.
Types of Variable
Number of
Variables whose values are either finite or countably
Numerical Discrete children in a
infinite. wiki
family
Note: In reality we may have mixed type of variable for a variety of reasons. For example, in credit scoring
"Missed payment status" is a common variable that can take values 1, 2, 3 meaning that the customer has
missed 1-3 payments in their account. And it can also take the value D, if the customer defaulted on that
account. We may have to convert data types after certain steps of data cleaning.
Shape:
Categorical
Histogram/ Frequency table...
Central Tendency:
Mean/ Median/ Mode
Dispersion:
Numerical
Min/ Max/ Range/ Quantile/ IQR/ MAD/ Variance/ Standard Deviation/
Shape:
Skewness/ Histogram/ Boxplot...
Below are some methods that can give us the basic stats on the variable:
pandas.Dataframe.describe()
pandas.Dataframe.dtypes
Barplot
Countplot
Boxplot
Distplot
Scatter Plot
Correlation Plot
Heat Map
Scatter Plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for
typically two variables for a set of data. If the pattern of dots slopes from lower left to upper right, it
indicates a positive correlation between the variables being studied. If the pattern of dots slopes from
upper left to lower right, it indicates a negative correlation. [wiki]
Correlation plot can be used to quickly find insights. It is used to investigate the dependence between
multiple variables at the same time and to highlight the most correlated variables in a data table.
Heat map (or heatmap) is a graphical representation of data where the individual values contained in a
matrix are represented as colors.
2. Feature Cleaning
A study on the impact of missing data on different ML algorithm can be found here.
It is important to understand the mechanisms by which missing fields are introduced in a dataset.
Depending on the mechanism, we may choose to process the missing values differently. The mechanisms
were first introduced by Rubin [2].
A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the
observations. When data is MCAR, there is absolutely no relationship between the data missing and any
other values, observed or missing, within the dataset. In other words, those missing data points are a
random subset of the data. There is nothing systematic going on that makes some data more likely to be
missing than other.
If values for observations are missing completely at random, then disregarding those cases would not bias
the inferences made.
Missing at Random
Missing as Random (MAR) occurs when there is a systematic relationship between the propensity of missing
values and the observed data. In other words, the probability an observation being missing depends only
on available information (other variables in the dataset), but not on the variable itself.
For example, if men are more likely to disclose their weight than women, weight is MAR (on variable
gender). The weight information will be missing at random for those men and women that decided not to
disclose their weight, but as men are more prone to disclose it, there will be more missing values for women
than for men.
In a situation like the above, if we decide to proceed with the variable with missing values, we might benefit
from including gender to control the bias in weight for the missing observations.
Missingness depends on information that has not been recorded, and this information also predicts the
missing values. E.g., if a particular treatment causes discomfort, a patient is more likely to drop out of the
study (and 'discomfort' is not measured).
Missingness depends on the (potentially missing) variable itself. E.g., people with higher earnings are less
likely to reveal them.
By business understanding. In many situations we can assume the mechanism by probing into the
business logic behind that variable.
By statistical test. Divide the dataset into ones with/without missing and perform t-test to see if
there's significant differences. If there is, we can assume that missing is not completed at random.
But we should keep in mind that we can hardly 100% be sure that data are MCAR, MAR, or MNAR because
unobserved predictors (lurking variables) are unobserved.
replacing the NA by
good 1. distort distribution
Mean/Median/Mode mean/median/most frequent
practice if 2. distort relationship
Imputation values (for categorical
MCAR with other variables
feature) of that variable
1. distort distribution
2. may be considered
Captures the
outlier if NA is few or
replacing the NA by values importance
mask true outlier if NA is
End of distribution that are at the far end of the of
many.
Imputation distribution of that variable, missingness
3. if missingness is not
calculated by mean + 3*std if there is
important this may mask
one
the predictive power of
the original variable
Captures the
1. distort distribution
importance
2. typical used value:
Arbitrary Value replacing the NA by arbitrary of
-9999/9999. But be aware
Imputation values missingness
it may be regarded as
if there is
outliers.
one
Captures the
creating an additional variable importance
Add a variable to indicating whether the data of
expand feature space
denote NA was missing for that missingness
observation if there is
one
In real settings, when it's hard to decide the missing mechanism or there's few time to study deeply about
each missing variables, the popular way is to adopt:
Note: Some algorithms like XGboost incorporate missing data treatment into its model building process, so
you don't need to do the step. However it's important to make sure you understand how the algorithm treat
them and explain to the business team.
2.2 Outliers
Definition: An outlier is an observation which deviates so much from the other observations as to arouse
suspicions that it was generated by a different mechanism. [3]
Note: Outliers, depending on the context, either deserve special attention or should be completely ignored.
For example, an unusual transaction on a credit card is usually a sign of fraudulent activity, while a height of
1600cm of a person is very likely due to measurement error and should be filter out or impute with
something else.
Some algorithms are very sensitive to outliers, For example, Adaboost may treat outliers as "hard" cases
and put tremendous weights on outliers, therefore producing a model with bad generalization. Any
algorithms that rely on means/variance are sensitive to outliers as those stats are greatly influenced by
extreme values.
On the other hand some algorithm are more robust to outliers. For example, decision trees tend to ignore
the presence of outliers when creating the branches of their trees. Typically, trees make splits by asking if
variable x >= value t, and therefore the outlier will fall on each side of the branch, but it will be treated
equally as the remaining values, regardless of its magnitude.
In fact outlier analysis and anomaly detection is a huge field of research. Charu's book "Outlier Analysis" [4]
offer a great insight into the topic. PyOD[5] is a comprehensive Python toolkit which contains many of the
advanced methods in this field.
All the methods here listed are for univariate outlier detection. Multivariate outlier detection is beyond the
scope of this guide.
Method Definition Pros Cons
Detect by
identify outliers based require business
arbitrary flexiable
on arbitrary boundaries understanding
boundary
Mean &
sensitive to extreme
Standard outlier detection by good for variable with
value itself (as the
Deviation Mean & Standard Gaussian distribution (68-
outlier increase the
method [6], Deviation Method 95-99 rule)
sd)
[7]
outlier detection by
robust than Mean & SD
MAD method Median and Median
method. Resilient to can be too aggressive
[6],[7] Absolute Deviation
extremes.
Method
However, beyond these methods, it's more important to keep in mind that the business context should
govern how you define and react to these outliers. The meanings of your findings should be dictated by the
underlying context, rather than the number itself.
lose
replacing the outlier by
Mean/Median/Mode preserve information of
mean/median/most frequent values of
Imputation distribution outlier if there
that variable
is one
minimize lose
transform continuous variables into the impact information of
Discretization
discrete variables from outlier if there
outlier is one
lose
information of
Discard outliers drop all the observations that are outliers /
outlier if there
is one
There are many strategies for dealing with outliers in data, and depending on the context and data set, any
could be the right or the wrong way. It’s important to investigate the nature of the outlier before deciding.
Note: In some situations rare values, like outliers, may contains valuable information of the dataset and
therefore need particular attention. For example, a rare value in transaction may denote fraudulent.
Rare values in categorical variables tend to cause over-fitting, particularly in tree based methods.
A big number of infrequent labels adds noise, with little information, therefore causing over-fitting.
Rare labels may be present in training set, but not in test set, therefore causing over-fitting to the train
set.
Rare labels may appear in the test set, and not in the train set. Thus, the model will not know how to
evaluate it.
Grouping into one new Grouping the observations that show rare labels into a unique
category category
when there's one predominant category (over 90%) in the variable: observe the relationship
between that variable and the target, then either discard that variable, or keep it as it was. In this case,
variable often is not useful for prediction as it is quasi-constant (as we will later see in Feature Selection
part).
when there's a small number of categories: keep it as it was. Because only few categories are
unlikely to bring so much noise.
when there's high cardinality: try the 2 methods above. But it does not guarantee better results than
original variable.
Variables with too many labels tend to dominate over those with only a few labels, particularly in tree
based algorithms.
A big number of labels within a variable may introduce noise with little if any information, therefore
making the machine learning models prone to over-fit.
Some of the labels may only be present in the training data set, but not in the test set, therefore
causing algorithms to over-fit the training set.
Contrarily, new labels may appear in the test set that were not present in the training set, therefore
leaving algorithm unable to perform a calculation over the new observation.
Method
All these methods attempt to group some of the labels and reduce cardinality. Grouping labels with decision
tree is equivalent to the method introduced in section 3.2.2 Discretization with decision tree, which aims to
merge labels into more homogenous groups. Grouping labels with rare occurrence into one category is
equivalent to method in section 2.3.2.
3. Feature Engineering
If range of inputs varies, in some algorithms, object functions will not work properly.
Gradient descent converges much faster with feature scaling done. Gradient descent is a common
optimization algorithm used in logistic regression, SVMs, neural networks etc.
Algorithms that involve distance calculation like KNN, Clustering are also affected by the magnitude
of the feature. Just consider how Euclidean distance is calculated: taking the square root of the sum of
the squared differences between observations. This distance can be greatly affected by differences in
scale among the variables. Variables with large variances have a larger effect on this measure than
variables with small variances.
Note: Tree-based algorithms are almost the only algorithms that are not affected by the magnitude of the
input, as we can easily see from how trees are built. When deciding how to make a split, tree algorithm look
for decisions like "whether feature value X>3.0" and compute the purity of the child node after the split, so
the scale of the feature does not count.
transforms features
by scaling each compress the observations in
feature to a given the narrow range if the
Min-Max
range. Default to / variable is skewed or has
scaling
[0,1]. outliers, thus impair the
X_scaled = (X - X.min predictive power.
/ (X.max - X.min)
As we can see, Normalization - Standardization and Min-Max method will compress most data to a narrow
range, while robust scaler does a better job at keeping the spread of the data, although it cannot remove
the outlier from the processed result. Remember removing/imputing outliers is another topic in data
cleaning and should be done beforehand.
If your feature is not Gaussian like, say, has a skewed distribution or has outliers, Normalization -
Standardization is not a good choice as it will compress most data to a narrow range.
However, we can transform the feature into Gaussian like and then use Normalization -
Standardization. Feature transformation will be discussed in section 3.4
When performing distance or covariance calculation (algorithm like Clustering, PCA and LDA), it is
better to use Normalization - Standardization as it will remove the effect of scales on variance and
covariance. Explanation here.
Min-Max scaling has the same drawbacks as Normalization - Standardization, and also new data may
not be bounded to [0,1] as they can be out of the original range. Some algorithms, for example some
deep learning network prefer input on a 0-1 scale so this is a good choice.
A comparison of the three methods when facing skewed variables can be found here.
An in-depth study of feature scaling can be found here.
3.2 Discretize
Definition: Discretization is the process of transforming continuous variables into discrete variables by
creating a set of contiguous intervals that spans the range of the variable's values.
help to improve model performance by grouping of similar attributes with similar predictive strengths
bring into non-linearity and thus improve fitting power of the model
enhance interpretability with grouped values
minimize the impact of extreme values/seldom reversal patterns
prevent overfitting possible with numerical variables
allow feature interaction between continuous variables (section 3.5.5)
sensitive to
Equal width divides the scope of possible values into
/ skewed
binning N bins of the same width
distribution
this arbitrary
binning may
Equal divides the scope of possible values of may help boost the
disrupt the
frequency the variable into N bins, where each bin algorithm's
relationship
binning carries the same amount of observations performance
with the
target
needs hyper-
K-means using k-means to partition values into
/ parameter
binning clusters
tuning
1. may cause
observations within
Discretization over-fitting
using a decision tree to identify the each bin are more
using 2. may not
optimal splitting points that would similar to
decision get a good
determine the bins themselves than to
trees performing
those of other bins
tree
In general there's no best choice of discretization method. It really depends on the dataset and the following
learning algorithm. Study carefully about your features and context before deciding. You can also try
different methods and compare the model performance.
Some literature reviews on feature discretization can be found here1, here2, here3.
We must transform strings of categorical variables into numbers so that algorithms can handle those
values. Even if you see an algorithm can take into categorical inputs, it's most likely that the algorithm
incorporate the encoding process within.
1. expand
feature space
dramatically if
replace the categorical
too many labels
variable by different boolean
One-hot keep all information of that in that variable
variables (0/1) to indicate
encoding variable 2. does not add
whether or not certain label
additional value
is true for that observation
to make the
variable more
predictive
1. may yield
same encoding
for two different
replace each label of the labels (if they
Count/frequency categorical variable by the appear same
/
encoding count/frequency within that times) and lose
category valuable info.
2. may not add
predictive
power
1. Capture information
within the label, therefore
rendering more predictive
replace the label by the
features
mean of the target for that Prone to cause
Mean encoding 2. Create a monotonic
label. (the target must be 0/1 over-fitting
relationship between the
valued or continuous)
variable and the target
3. Do not expand the
feature space
Method Definition Pros Cons
1. Establishes a monotonic
relationship to the
dependent variable
1. May incur in
2. Orders the categories on
replace the label with Weight loss of
a "logistic" scale which is
of Evidence of each label. information
natural for logistic
WOE is computed from the (variation) due
WOE regression
basic odds ratio: ln( to binning to
encoding[9] 3,The transformed
(Proportion of Good few categories
variables, can then be
Outcomes) / (Proportion of 2. Prone to
compared because they
Bad Outcomes)) cause over-
are on the same scale.
fitting
Therefore, it is possible to
determine which one is
more predictive.
1. Capture information
within the label, therefore
rendering more predictive
Similar to mean encoding,
features
Target but use both posterior Prone to cause
2. Create a monotonic
encoding[10] probability and prior over-fitting
relationship between the
probability of the target
variable and the target
3. Do not expand the
feature space
Note: if we are using one-hot encoding in linear regression, we should keep k-1 binary variable to avoid
multicollinearity. This is true for any algorithms that look at all features at the same time during training.
Including SVM, neural network and clustering. Tree-based algorithm, on the other hand, need the entire set
of binary variable to select the best split.
Note: it is not recommended to use one-hot encoding with tree algorithms. One-hot will cause the split be
highly imbalanced (as each label of the original categorical feature will now be a new feature), and the result
is that neither of the two child nodes will have a good gain in purity. The prediction power of the one-hot
feature will be weaker than the original feature as they have been broken into many pieces.
Regression
Linear regression is a straightforward approach for predicting a quantitative response Y on the basis of a
different predictor variable X1, X2, ... Xn. It assumes that there is a linear relationship between X(s) and Y.
Mathematically, we can write this linear relationship as Y ≈ β0 + β1X1 + β2X2 + ... + βnXn.
Classification
Similarly, for classification, Logistic Regression assumes a linear relationship between the variables and the
log of the odds.
If the machine learning model assumes a linear dependency between the predictors Xs and the outcome Y,
when there is not such a linear relationship, the model will have a poor performance. In such cases, we are
better off trying another machine learning model that does not make such assumption.
If there is no linear relationship and we have to use the linear/logistic regression models, mathematical
transformation/discretization may help create the relationship, though it cannot guarantee a better result.
Linear Regression has the following assumptions over the predictor variables X:
Normality assumption means that every variable X should follow a Gaussian distribution.
Homoscedasticity, also known as homogeneity of variance, describes a situation in which the error term
(that is, the “noise” or random disturbance in the relationship between the independent variables (Xs) and
the dependent variable (Y)) is the same across all values of the independent variables.
The remaining machine learning models, including Neural Networks, Support Vector Machines, Tree based
methods and PCA do not make any assumption over the distribution of the independent variables.
However, in many occasions the model performance may benefit from a "Gaussian-like" distribution.
Why may models benefit from a "Gaussian-like" distributions? In variables with a normal distribution, the
observations of X available to predict Y vary across a greater range of values, that is, the values of X are
"spread" over a greater range.
In the situations above, transformation of the original variable can help give the variable more of a bell-
shape of the Gaussian distribution.
Log transformation is useful when applied to skewed distributions as they tend to expand the values
which fall in the range of lower magnitudes and tend to compress or reduce the values which fall in the
range of higher magnitudes, which helps to make the skewed distribution as normal-like as possible.
Square root transformation does a similar thing in this sense.
Box-Cox transformation in sklearn [13] is another popular function belonging to the power transform
family of functions. This function has a pre-requisite that the numeric values to be transformed must be
positive (similar to what log transform expects). In case they are negative, shifting using a constant value
helps. Mathematically, the Box-Cox transform function can be denoted as follows.
Quantile transformation in sklearn [14] transforms the features to follow a uniform or a normal
distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent
values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
However, this transform is non-linear. It may distort linear correlations between variables measured at the
same scale but renders variables measured at different scales more directly comparable.
We can use Q-Q plot to check if the variable is normally distributed (a 45 degree straight line of the values
over the theoretical quantiles) after transformation.
Below is an example showing the effect of sklearn's Box-plot/Yeo-johnson/Quantile transform to map data
from various distributions to a normal distribution.
img source
On “small” datasets (less than a few hundred points), the quantile transformer is prone to overfitting. The
use of the power transform is then recommended.
As mentioned in section 2.1, we can create new binary feature denoting whether the observations have
missing value on raw feature with value 0/1.
Creating new features by performing simple statistical calculations on the raw features, including:
count/sum
average/median/mode
max/min/stddev/variance/range/IQR/Coefficient of Variation
time span/interval
Take call log for example, we can create new features like: number of calls, number of call-in/call-out,
average calling duration, monthly average calling duration, max calling duration, etc.
After having some simple statistical derived features, we can have them crossed together. Common
dimensions used for crossing include:
time
region
business types
Still take call log for example, we can have crossed features like: number of calls during night times/day
times, number of calls under different business types (banks/taxi services/travelling/hospitalities), number
of calls during the past 3 months, etc. Many of the statistical calculations mentioned in section 3.5.2 can be
used again to create more features.
Note: An open-source python framework named Featuretools that helps automatically generate such
features can be found here.
Personally I haven't used it in practice. You may try and discover if it can be of industry usage.
Common techniques. For example, in order to predict future performance of credit card sales of a branch,
ratios like credit card sales / sales person or credit card sales / marketing spend would be more powerful
than just using absolute number of card sold in the branch.
3.5.5 Cross Products between Categorical Features
Consider a categorical feature A, with two possible values {A1, A2}. Let B be a feature with possibilities {B1,
B2}. Then, a feature-cross between A & B would take one of the following values: {(A1, B1), (A1, B2), (A2, B1),
(A2, B2)}. You can basically give these ‘combinations’ any names you like. Just remember that every
combination denotes a synergy between the information contained by the corresponding values of A and B.
This is an extremely useful technique, when certain features together denote a property better than
individually by themselves. Mathematically speaking, you are doing a cross product between all possible
values of the categorical features. The concepts is similar to Feature Crossing of section 3.5.3, but this one
particularly refers to the crossing between 2 categorical features.
The cross product can also be applied to numerical features, which results in a new interaction feature
between A and B. This can be done easily be sklearn's PolynomialFeatures, which generate a new feature set
consisting of all polynomial combinations of the features with degree less than or equal to the specified
degree. For example, three raw features {X1, X2, X3} can generate a feature set of {1, X1X2, X1X3, X2X3,
X1X2X3} with a degree of 2.
In tree-based algorithms, each sample will be assigned to a particular leaf node. The decision path to each
node can be seen as a new non-linear feature, and we can create N new binary features where n equals to
the total number of leaf nodes in a tree or tree ensembles. The features can then be fed into other
algorithms such as logistic regression.
The idea of using tree algorithm to generate new features is first introduced by Facebook in this paper.
The good things about this method is that we can get a complex combinations of several features together,
which is informative (as is constructed by the tree's learning algorithm). This saves us much time compared
to doing feature crossing manually, and is widely used in CTR (click-through rate) of online advertising
industry.
As we can see from all above, feature generation by manual takes lots of effort and may not guarantee good
returns, particular when we have huge amounts of features to work with. Feature learning with trees can be
seen as an early attempt in creating features automatically, and with the deep learning methods come into
fashion from around 2016, they also have achieved some success in this area, such as autoencoders and
restricted Boltzmann machines. They have been shown to automatically and in a unsupervised or semi-
supervised way, learn abstract representations of features (a compressed form), that in turn have
supported state-of-the-art results in domains such as speech recognition, image classification, object
recognition and other areas. However, such features have limited interpretability and deep learning require
much more data to be able to extract high quality result.
4. Feature Selection
Definition: Feature Selection is the process of selecting a subset of relevant features for use in machine
learning model building.
It is not always the truth that the more data, the better the result will be. Including irrelevant features (the
ones that are just unhelpful to the prediction) and redundant features (irrelevant in the presence of others)
will only make the learning process overwhelmed and easy to cause overfitting.
We should keep in mind that different feature subsets render optimal performance for different algorithms.
So it's not a separate process along with the machine learning model training. Therefore, if we are selecting
features for a linear model, it is better to use selection procedures targeted to those models, like
importance by regression coefficient or Lasso. And if we are selecting features for trees, it is better to use
tree derived importance.
Univariate filters evaluate and rank a single feature according to a certain criteria, while multivariate filters
evaluate the entire feature space. Filter methods are:
As a result, filter methods are suited for a first step quick screen and removal of irrelevant features.
Method Definition
removing features that show the same value for the majority/all of the observations
Variance
(constant/quasi-constant features)
Correlation remove features that are highly correlated with each other
Chi-Square Compute chi-squared stats between each non-negative feature and class
Mutual
Mutual information measures how much information the presence/absence of a
Information
feature contributes to making the correct prediction on Y.
Filter
Univariate
builds one decision tree per feature, to predict the target, then make predictions and
ROC-AUC or
ranks the features according to the machine learning metric (roc-auc or mse)
MSE
WOE encoding (see section 3.3.2) and IV often go hand in hand in scorecard development. The two concepts
both derived from logistic regression and is kind of standard practice in credit card industry. IV is a popular
and widely used measure as there are very convenient rules of thumb for variables selection associated
with IV as below:
However, all these filtering methods fail to consider the interaction between features and may reduce our
predict power. Personally I only use variance and correlation to filter some absolutely unnecessary features.
Note: One thing to keep in mind when using chi-square test or univariate selection methods, is that in very
big datasets, most of the features will show a small p_value, and therefore look like they are highly
predictive. This is in fact an effect of the sample size. So care should be taken when selecting features using
these procedures. An ultra tiny p_value does not highlight an ultra-important feature, it rather indicates that
the dataset contains too many samples.
Note: Correlated features do not necessarily affect model performance (trees, etc), but high dimensionality
does and too many features hurt model interpretability. So it's always better to reduce correlated features.
The most common search strategy group is Sequential search, including Forward Selection, Backward
Elimination and Exhaustive Search. Randomized search is another popular choice, including Evolutionary
computation algorithms such as genetic, and Simulated annealing.
Another key element in wrappers is stopping criteria. When to stop the search? In general there're three:
performance increase
performance decrease
predefined number of features is reached
Step forward feature selection starts by evaluating all features individually and selects the one that
generates the best performing algorithm, according to a pre-set evaluation criteria. In the second step, it
evaluates all possible combinations of the selected feature and a second feature, and selects the pair that
produce the best performing algorithm based on the same pre-set criteria.
The pre-set criteria can be the roc_auc for classification and the r squared for regression for example.
This selection procedure is called greedy, because it evaluates all possible single, double, triple and so on
feature combinations. Therefore, it is quite computationally expensive, and sometimes, if feature space is
big, even unfeasible.
There is a special package for python that implements this type of feature selection: mlxtend.
Step backward feature selection starts by fitting a model using all features. Then it removes one feature. It
will remove the one that produces the highest performing algorithm (least statistically significant) for a
certain evaluation criteria. In the second step, it will remove a second feature, the one that again produces
the best performing algorithm. And it proceeds, removing feature after feature, until a certain criteria is
met.
The pre-set criteria can be the roc_auc for classification and the r squared for regression for example.
In an exhaustive feature selection the best subset of features is selected, over all possible feature subsets,
by optimizing a specified performance metric for a certain machine learning algorithm. For example, if the
classifier is a logistic regression and the dataset consists of 4 features, the algorithm will evaluate all 15
feature combinations as follows:
and select the one that results in the best performance (e.g., classification accuracy) of the logistic
regression classifier.
This exhaustive search is very computationally expensive. In practice for this computational cost, it is rarely
used.
TODO
Regularization consists in adding a penalty to the different parameters of the machine learning model to
reduce the freedom of the model. Hence, the model will be less likely to fit the noise of the training data so
less likely to be overfitting.
In linear model regularization, the penalty is applied over the coefficients that multiply each of the
predictors. For linear models there are in general 3 types of regularization:
L1 regularization (Lasso)
L2 regularization (Ridge)
L1/L2 (Elastic net)
From the different types of regularization, Lasso (L1) has the property that is able to shrink some of the
coefficients to zero. Therefore, that feature can be removed from the model.
Both for linear and logistic regression we can use the Lasso regularization to remove non-important
features. Keep in mind that increasing the penalization will increase the number of features removed.
Therefore, you will need to keep an eye and monitor that you don't set a penalty too high so that to remove
even important features, or too low and then not remove non-important features.
Having said this, if the penalty is too high and important features are removed, you should notice a drop in
the performance of the algorithm and then realize that you need to decrease the regularization.
Regularization is a large topic. For for information you can refer to here:
Least angle and l1 penalised regression: A review
Penalised feature selection and classification in bioinformatics
Feature selection for classification: A review
Machine Learning Explained: Regularization
Random forests are one of the most popular machine learning algorithms. They are so successful because
they provide in general a good predictive performance, low overfitting and easy interpretability. This
interpretability is given by the fact that it is straightforward to derive the importance of each variable on the
tree decision. In other words, it is easy to compute how much each variable is contributing to the decision.
Random forest is a bagging algorithm consists a bunch of base estimators (decision trees), each of them
built over a random extraction of the observations from the dataset and a random extraction of the
features. Not every tree sees all the features or all the observations, and this guarantees that the trees are
de-correlated and therefore less prone to over-fitting.
Each tree is also a sequence of yes-no questions based on a single or combination of features. At each split,
the question divides the dataset into 2 buckets, each of them hosting observations that are more similar
among themselves and different from the ones in the other bucket. Therefore, the importance of each
feature is derived by how "pure" each of the buckets is.
For classification, the measure of impurity is either the Gini impurity or the information gain/entropy. For
regression the measure of impurity is variance. Therefore, when training a tree, it is possible to compute
how much each feature decreases the impurity. The more a feature decreases the impurity, the more
important the feature is. In random forests, the impurity decrease from each feature can be averaged
across trees to determine the final importance of the variable.
Selecting features by using tree derived feature importance is a very straightforward, fast and generally
accurate way of selecting good features for machine learning. In particular, if you are going to build tree
methods.
However, correlated features will show in a tree similar and lowered importance, compared to what their
importance would be if the tree was built without correlated counterparts.
Limitation
Similarly to selecting features using Random Forests derived feature importance, we can select features
based on the importance derived by gradient boosted trees. And we can do that in one go, or in a recursive
manner, depending on how much time we have, how many features are in the dataset, and whether they
are correlated or not.
1. Rank the features according to their importance derived from a machine learning algorithm: it can be
tree importance, or LASSO / Ridge, or the linear / logistic regression coefficients.
2. Remove one feature -the least important- and build a machine learning algorithm utilizing the
remaining features.
3. Calculate a performance metric of your choice: roc-auc, mse, rmse, accuracy.
4. If the metric decreases by more of an arbitrarily set threshold, then that feature is important and
should be kept. Otherwise, we can remove that feature.
5. Repeat steps 2-4 until all features have been removed (and therefore evaluated) and the drop in
performance assessed.
The method combines the selection process like wrappers and feature importance derivation from ML
models like embedded methods so it's called hybrid.
The difference between this method and the step backwards feature selection lies in that it does not
remove all features first in order to determine which one to remove. It removes the least important one,
based on the machine learning model derived importance. And then, it makes an assessment as to whether
that feature should be removed or not. So it removes each feature only once during selection, whereas step
backward feature selection removes all the features at each step of selection.
This method is therefore faster than wrapper methods and generally better than embedded methods. In
practice it works extremely well. It does also account for correlations (depending on how stringent you set
the arbitrary performance drop threshold). On the downside, the drop in performance assessed to decide
whether the feature should be kept or removed, is set arbitrarily. The smaller the drop the more features
will be selected, and vice versa.
As we talked about in section 4.3.2, Random Forests assign equal or similar importance to features that are
highly correlated. In addition, when features are correlated, the importance assigned is lower than the
importance attributed to the feature itself, should the tree be built without the correlated counterparts.
Therefore, instead of eliminating features based on importance at one time (from all initial features), we
may get a better selection by removing one feature recursively, and recalculating the importance on each
round.
In this situation, when a feature that is highly correlated to another one is removed, then, the importance of
the remaining feature increases. This may lead to a better subset feature space selection. On the downside,
building several random forests is quite time consuming, in particular if the dataset contains a high number
of features.
1. Rank the features according to their importance derived from a machine learning algorithm: it can be
tree importance, or LASSO / Ridge, or the linear / logistic regression coefficients.
2. Build a machine learning model with only 1 feature, the most important one, and calculate the model
metric for performance.
3. Add one feature -the most important- and build a machine learning algorithm utilizing the added and
any feature from previous rounds.
4. Calculate a performance metric of your choice: roc-auc, mse, rmse, accuracy.
5. If the metric increases by more than an arbitrarily set threshold, then that feature is important and
should be kept. Otherwise, we can remove that feature.
6. Repeat steps 2-5 until all features have been removed (and therefore evaluated) and the drop in
performance assessed.
The difference between this method and the step forward feature selection is similar. It does not look for all
features first in order to determine which one to add, so it's faster than wrappers.
5. Data Leakage
This section is a remainder to myself as I have had made huge mistakes because of not aware of the
problem. Data leakage is when information from outside the training dataset is used to create the model
[15]. The result is that you may be creating overly optimistic models that are practically useless and cannot
be used in production. The model shows great result on both your training and testing data but in fact it's
not because your model really has a good generalizability but it uses information from the test data.
While it is well known to use cross-validation or at least separate a validation set in training and evaluating
the models, people may easily forget to do the same during the feature engineering & selection process.
Keep in mind that the test dataset must not be used in any way to make choices about the model, including
feature engineering & selection.
Reference
1. http://www.simonqueenborough.info/R/basic/missing-data
2. Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3): 581-592.
3. D. Hawkins. Identification of Outliers, Chapman and Hall , 1980.
4. https://www.springer.com/gp/book/9781461463955
5. https://github.com/yzhao062/pyod
6. https://docs.oracle.com/cd/E40248_01/epm.1112/cb_statistical/frameset.htm?ch07s02s10s01.html
7. https://www.academia.edu/5324493/Detecting_outliers_Do_not_use_standard_deviation_around_the_
mean_use_absolute_deviation_around_the_median
8. https://www.purplemath.com/modules/boxwhisk3.htm
9. http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEI
ntroductoryOverview
10. A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction
Problems. https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20c
ardinality%20categoricals.pdf
11. https://www.aaai.org/Papers/AAAI/1992/AAAI92-019.pdf
12. http://onlinestatbook.com/2/transformations/box-cox.html
13. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html#skle
arn.preprocessing.PowerTransformer
14. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#skl
earn.preprocessing.QuantileTransformer
15. https://machinelearningmastery.com/data-leakage-machine-learning/