06 - Data Preprocessing
06 - Data Preprocessing
Data preprocessing
Real-world machine learning pipelines
Joaquin Vanschoren
Data transformations
Machine learning models make a lot of assumptions about the data
In reality, these assumptions are often violated
We build pipelines that transform the data before feeding it to the learners
Scaling (or other numeric transformations)
Encoding (convert categorical features into numerical ones)
Automatic feature selection
Feature engineering (e.g. binning, polynomial features,...)
Handling missing data
Handling imbalanced data
Dimensionality reduction (e.g. PCA)
Learned embeddings (e.g. for text)
Seek the best combinations of transformations and learning methods
Often done empirically, using cross-validation
Make sure that there is no data leakage during this process!
Scaling
Use when different numeric features have different scales (different range of values)
Features with much higher values may overpower the others
Goal: bring them all within the same range
Different methods exist
Why do we need scaling?
KNN: Distances depend mainly on feature with larger values
SVMs: (kernelized) dot products are also based on distances
Linear model: Feature scale affects regularization
Weights have similar scales, more interpretable
Standard scaling (standardization)
Generally most useful, assumes data is more or less normally distributed
Per feature, subtract the mean value , scale by standard deviation
μ σ
niY ni
Blending: gradually decrease as you get more examples of category i and class Y=0
1 niY 1 nY
Enc(i) = + (1 − )
−(ni −1) −(ni −1)
1 + e ni 1 + e n
Same for regression, using : average target value with category i, : overall mean
niY
ni
nY
n
Example
n
when niY = 1
# Correct cross-validation
scores = cross_val_score(pipe, X, y)
If you want to apply different preprocessors to different columns, use ColumnTransformer
If you want to merge pipelines, you can use FeatureUnion to concatenate columns
# 2 sub-pipelines, one for numeric features, other for categorical
ones
numeric_pipe = make_pipeline(SimpleImputer(),StandardScaler())
categorical_pipe = make_pipeline(SimpleImputer(),OneHotEncoder())
(OneHotEncoder(),categorical_features))
Pipeline selection
Remove features (= Xi X:,i ) that are highly correlated (have high correlation coefficient )
ρ
1 ¯
¯¯¯
¯¯¯ ¯
¯¯¯
¯¯¯
Should we remove feel_temp ? Or temp ? Maybe one correlates more with the target?
Supervised feature selection: overview
Univariate: F-test and Mutual Information
Model-based: Random Forests, Linear models, kNN
Wrapping techniques (black-box search)
Permutation importance
Univariate statistics (F-test)
Consider each feature individually (univariate), independent of the model that you aim to apply
Use a statistical test: is there a linear statistically significant relationship with the target?
Use F-statistic (or corresponding p value) to rank all features, then select features using a threshold
Best , best %, probability of removing useful features (FPR),...
k k
Cannot detect correlations (e.g. temp and feel_temp) or interactions (e.g. binary features)
F-statistic
For regression: does feature correlate (positively or negatively) with the target ?
Xi y
2
ρ(Xi , y)
F-statistic = ⋅ (N − 1)
2
1 − ρ(Xi , y)
¯
¯¯¯
¯¯
within-class variance var(Xi )
F-statistic = =
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
between-class variance
var(Xi )
Mutual information
Measures how much information gives about the target . In terms of entropy :
Xi Y H
Idea: estimate H(X) as the average distance between a data point and its Nearest Neighbors
k
Captures complex dependencies (e.g. hour, month), but requires more samples to be accurate
Model-based Feature Selection
Use a tuned(!) supervised model to judge the importance of each feature
Linear models (Ridge, Lasso, LinearSVM,...): features with highest weights (coefficients)
Tree–based models: features used in first nodes (high information gain)
Selection model can be different from the one you use for final modelling
Captures interactions: features are more/less informative in combination (e.g. winter, temp)
RandomForests: learns complex interactions (e.g. hour), but biased to high cardinality features
Relief: Model-based selection with kNN
Increase feature weights if and have different class (near miss), else decrease
xi xk
2 2
wi = wi−1 + (xi − nearMissi ) − (xi − nearHiti )
Many variants: ReliefF (uses L1 norm, faster), RReliefF (for regression), ...
Iterative Model-based Feature Selection
Dropping many features at once is not ideal: feature importance may change in subset
Recursive Feature Elimination (RFE)
Remove least important feature(s), recompute remaining importances, repeat
s
Univariate:
For regression: f_regression , mutual_info_regression
For classification: f_classification , chi2 , mutual_info_classication
Selecting: SelectKBest , SelectPercentile , SelectFpr ,...
selector = SelectPercentile(score_func=f_regression, percentile=50)
X_selected = selector.fit_transform(X,y)
selected_features = selector.get_support()
f_values, p_values = f_regression(X,y)
mi_values = mutual_info_regression(X,y,discrete_features=[])
Model-based:
SelectFromModel : requires a model and a selection threshold
RFE , RFECV (recursive feature elimination): requires model and final nr features
selector = SelectFromModel(RandomForestRegressor(),
threshold='mean')
rfe_selector = RFE(RidgeCV(), n_features_to_select=20)
X_selected = selector.fit_transform(X)
rf_importances = Randomforest().fit(X, y).feature_importances_