0% found this document useful (0 votes)

33 views68 pages

06 - Data Preprocessing

Uploaded by

Rehan Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views68 pages

06 - Data Preprocessing

Uploaded by

Rehan Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Lecture 6.

Data preprocessing
Real-world machine learning pipelines
Joaquin Vanschoren
Data transformations
Machine learning models make a lot of assumptions about the data
In reality, these assumptions are often violated
We build pipelines that transform the data before feeding it to the learners
Scaling (or other numeric transformations)
Encoding (convert categorical features into numerical ones)
Automatic feature selection
Feature engineering (e.g. binning, polynomial features,...)
Handling missing data
Handling imbalanced data
Dimensionality reduction (e.g. PCA)
Learned embeddings (e.g. for text)
Seek the best combinations of transformations and learning methods
Often done empirically, using cross-validation
Make sure that there is no data leakage during this process!
Scaling
Use when different numeric features have different scales (different range of values)
Features with much higher values may overpower the others
Goal: bring them all within the same range
Different methods exist
Why do we need scaling?
KNN: Distances depend mainly on feature with larger values
SVMs: (kernelized) dot products are also based on distances
Linear model: Feature scale affects regularization
Weights have similar scales, more interpretable
Standard scaling (standardization)
Generally most useful, assumes data is more or less normally distributed
Per feature, subtract the mean value , scale by standard deviation
μ σ

New feature has μ = 0 and σ = 1 , values can still be arbitrarily large

x − μ
xnew =
σ
Min-max scaling
Scales all features between a given and value (e.g. 0 and 1)
min max

Makes sense if min/max values have meaning in your data

Sensitive to outliers
x − xmin
xnew = ⋅ (max − min) + min
xmax − xmin
Robust scaling
Subtracts the median, scales between quantiles and
q25 q75

New feature has median 0,

q25 = −1 and
q75 = 1

Similar to standard scaler, but ignores outliers

Normalization
Makes sure that feature values of each point (each row) sum up to 1 (L1 norm)
Useful for count data (e.g. word counts in documents)
Can also be used with L2 norm (sum of squares is 1)
Useful when computing distances in high dimensions
Normalized Euclidean distance is equivalent to cosine similarity
Maximum Absolute scaler
For sparse data (many features, but few are non-zero)
Maintain sparseness (efficient storage)
Scales all values so that maximum absolute value is 1
Similar to Min-Max scaling without changing 0 values
Power transformations
Some features follow certain distributions
E.g. number of twitter followers is log-normal distributed
Box-Cox transformations transform these to normal distributions ( is fitted)
λ

Only works for positive values, use Yeo-Johnson otherwise

log(x) λ = 0
bcλ (x) = { λ
x −1
λ ≠ 0
λ
Categorical feature encoding
Many algorithms can only handle numeric features, so we need to encode the categorical ones
boro salary vegan
0 Manhattan 103 0
1 Queens 89 0
2 Manhattan 142 0
3 Brooklyn 54 1
4 Brooklyn 63 1
5 Bronx 219 0
Ordinal encoding
Simply assigns an integer value to each category in the order they are encountered
Only really useful if there exist a natural order in categories
Model will consider one category to be 'higher' or 'closer' to another
boro boro_ordinal salary
0 Manhattan 2 103
1 Queens 3 89
2 Manhattan 2 142
3 Brooklyn 1 54
4 Brooklyn 1 63
5 Bronx 0 219
One-hot encoding (dummy encoding)
Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category
Can explode if a feature has lots of values, causing issues with high dimensionality
What if test set contains a new category not seen in training data?
Either ignore it (just use all 0's in row), or handle manually (e.g. resample)
boro boro_Bronx boro_Brooklyn boro_Manhattan boro_Queens salary
0 Manhattan 0 0 1 0 103
1 Queens 0 0 0 1 89
2 Manhattan 0 0 1 0 142
3 Brooklyn 0 1 0 0 54
4 Brooklyn 0 1 0 0 63
5 Bronx 1 0 0 0 219
Target encoding
Value close to 1 if category correlates with class 1, close to 0 if correlates with class 0
Preferred when you have lots of category values. It only creates one new feature per class
Blends posterior probability of the target and prior probability .
niY nY

: nr of samples with category i and class Y=1, : nr of samples with category i

ni n

niY ni

Blending: gradually decrease as you get more examples of category i and class Y=0
1 niY 1 nY
Enc(i) = + (1 − )
−(ni −1) −(ni −1)
1 + e ni 1 + e n

Same for regression, using : average target value with category i, : overall mean
niY

ni
nY

n
Example

For Brooklyn, niY = 2, ni = 2, nY = 2, n = 6

Would be closer to 1 if there were more examples, all with label 1

1 2 1 2
Enc(Brooklyn) = + (1 − ) = 0, 82
−1 −1
1 + e 2 1 + e 6

Note: the implementation used here sets Enc(i) =

n
when niY = 1

boro boro_encoded salary vegan

0 Manhattan 0.089647 103 0
1 Queens 0.333333 89 0
2 Manhattan 0.089647 142 0
3 Brooklyn 0.820706 54 1
4 Brooklyn 0.820706 63 1
5 Bronx 0.333333 219 0
In practice (scikit-learn)
Ordinal encoding and one-hot encoding are implemented in scikit-learn
dtype defines that the output should be an integer
ordinal_encoder = OrdinalEncoder(dtype=int)
one_hot_encoder = OneHotEncoder(dtype=int)

Target encoding is available in category_encoders

scikit-learn compatible
Also includes other, very specific encoders
target_encoder = TargetEncoder(return_df=True)

All encoders (and scalers) follow the fit-transform paradigm

fit prepares the encoder, transform actually encodes the features
We'll discuss this next
encoder.fit(X, y)
X_encoded = encoder.transform(X,y)
Applying data transformations
Data transformations should always follow a fit-predict paradigm
Fit the transformer on the training data only
E.g. for a standard scaler: record the mean and standard deviation
Transform (e.g. scale) the training data, then train the learning model
Transform (e.g. scale) the test data, then evaluate the model
Only scale the input features (X), not the targets (y)
If you fit and transform the whole dataset before splitting, you get data leakage
You have looked at the test data before training the model
Model evaluations will be misleading
If you fit and transform the training and test data separately, you distort the data
E.g. training and test points are scaled differently
In practice (scikit-learn)
# choose scaling method and fit on training data
scaler = StandardScaler()
scaler.fit(X_train)

# transform training and test data

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# calling fit and transform in sequence

X_train_scaled = scaler.fit(X_train).transform(X_train)
# same result, but more efficient computation
X_train_scaled = scaler.fit_transform(X_train)
Test set distortion
Properly scaled: fit on training set, transform on training and test set
Improperly scaled: fit and transform on the training and test data separately
Test data points nowhere near same training data points
Data leakage
Cross-validation: training set is split into training and validation sets for model selection
Incorrect: Scaler is fit on whole training set before doing cross-validation
Data leaks from validation folds into training folds, selected model may be optimistic
Right: Scaler is fit on training folds only
Pipelines
A pipeline is a combination of data transformation and learning algorithms
It has a fit , predict , and score method, just like any other learning algorithm
Ensures that data transformations are applied correctly
In practice (scikit-learn)

A pipeline combines multiple processing steps in a single estimator

All but the last step should be data transformer (have a transform method)
# Make pipeline, step names will be 'minmaxscaler' and 'linearsvc'
pipe = make_pipeline(MinMaxScaler(), LinearSVC())
# Build pipeline with named steps
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", LinearSVC())])

# Correct fit and score

score = pipe.fit(X_train, y_train).score(X_test, y_test)
# Retrieve trained model by name
svm = pipe.named_steps["svm"]

# Correct cross-validation
scores = cross_val_score(pipe, X, y)
If you want to apply different preprocessors to different columns, use ColumnTransformer
If you want to merge pipelines, you can use FeatureUnion to concatenate columns
# 2 sub-pipelines, one for numeric features, other for categorical
ones
numeric_pipe = make_pipeline(SimpleImputer(),StandardScaler())
categorical_pipe = make_pipeline(SimpleImputer(),OneHotEncoder())

# Using categorical pipe for features A,B,C, numeric pipe otherwise

preprocessor = make_column_transformer((categorical_pipe,
["A","B","C"]),
remainder=numeric_pipe)

# Combine with learning algorithm in another pipeline

pipe = make_pipeline(preprocess, LinearSVC())

# Feature union of PCA features and selected features

union = FeatureUnion([("pca", PCA()), ("selected", SelectKBest())])
pipe = make_pipeline(union, LinearSVC())
ColumnTransformer concatenates features in order
pipe = make_column_transformer((StandardScaler(),numeric_features),
(PCA(),numeric_features),

(OneHotEncoder(),categorical_features))
Pipeline selection

We can safely use pipelines in model selection (e.g. grid search)

Use '__' to refer to the hyperparameters of a step, e.g. svm__C
# Correct grid search (can have hyperparameters of any step)
param_grid = {'svm__C': [0.001, 0.01],
'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid=param_grid).fit(X,y)
# Best estimator is now the best pipeline
best_pipe = grid.best_estimator_

# Tune pipeline and evaluate on held-out test set

grid = GridSearchCV(pipe,
param_grid=param_grid).fit(X_train,y_train)
grid.score(X_test,y_test)
Example: Tune multiple steps at once

pipe = make_pipeline(StandardScaler(),PolynomialFeatures(), Ridge())

param_grid = {'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid=param_grid).fit(X_train,
y_train)
Automatic Feature Selection
It can be a good idea to reduce the number of features to only the most useful ones
Simpler models that generalize better (less overfitting)
Curse of dimensionality (e.g. kNN)
Even models such as RandomForest can benefit from this
Sometimes it is one of the main methods to improve models (e.g. gene expression data)
Faster prediction and training
Training time can be quadratic (or cubic) in number of features
Easier data collection, smaller models (less storage)
More interpretable models: fewer features to look at
Example: bike sharing
The Bike Sharing Demand dataset shows the amount of bikes rented in Washington DC
Some features are clearly more informative than others (e.g. temp, hour)
Some are correlated (e.g. temp and feel_temp)
We add two random features at the end
Unsupervised feature selection
Variance-based
Remove (near) constant features
Choose a small variance threshold
Scale features before computing variance!
Infrequent values may still be important
Covariance-based
Remove correlated features
The small differences may actually be important
You don't know because you don't consider the target
Covariance based feature selection

Remove features (= Xi X:,i ) that are highly correlated (have high correlation coefficient )
ρ

1 ¯
¯¯¯
¯¯¯ ¯
¯¯¯
¯¯¯

cov(X1 , X2 ) ∑ (Xi,1 − X1 )(Xi,2 − X2 )

N −1 i
ρ(X1 , X2 ) = =
σ(X1 )σ(X2 ) σ(X1 )σ(X2 )

Should we remove feel_temp ? Or temp ? Maybe one correlates more with the target?
Supervised feature selection: overview
Univariate: F-test and Mutual Information
Model-based: Random Forests, Linear models, kNN
Wrapping techniques (black-box search)
Permutation importance
Univariate statistics (F-test)
Consider each feature individually (univariate), independent of the model that you aim to apply
Use a statistical test: is there a linear statistically significant relationship with the target?
Use F-statistic (or corresponding p value) to rank all features, then select features using a threshold
Best , best %, probability of removing useful features (FPR),...
k k

Cannot detect correlations (e.g. temp and feel_temp) or interactions (e.g. binary features)
F-statistic

For regression: does feature correlate (positively or negatively) with the target ?
Xi y

2
ρ(Xi , y)
F-statistic = ⋅ (N − 1)
2
1 − ρ(Xi , y)

For classification: uses ANOVA: does explain the between-class variance?

Alternatively, use the test (only for categorical features)

χ
2

¯
¯¯¯
¯¯
within-class variance var(Xi )
F-statistic = =
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
between-class variance
var(Xi )
Mutual information

Measures how much information gives about the target . In terms of entropy :
Xi Y H

M I (X, Y ) = H (X) + H (Y ) − H (X, Y )

Idea: estimate H(X) as the average distance between a data point and its Nearest Neighbors
k

You need to choose and say which features are categorical

Captures complex dependencies (e.g. hour, month), but requires more samples to be accurate
Model-based Feature Selection
Use a tuned(!) supervised model to judge the importance of each feature
Linear models (Ridge, Lasso, LinearSVM,...): features with highest weights (coefficients)
Tree–based models: features used in first nodes (high information gain)
Selection model can be different from the one you use for final modelling
Captures interactions: features are more/less informative in combination (e.g. winter, temp)
RandomForests: learns complex interactions (e.g. hour), but biased to high cardinality features
Relief: Model-based selection with kNN

For I iterations, choose a random point and find nearest neighbors

xi k xk

Increase feature weights if and have different class (near miss), else decrease
xi xk
2 2
wi = wi−1 + (xi − nearMissi ) − (xi − nearHiti )

Many variants: ReliefF (uses L1 norm, faster), RReliefF (for regression), ...
Iterative Model-based Feature Selection

Dropping many features at once is not ideal: feature importance may change in subset
Recursive Feature Elimination (RFE)
Remove least important feature(s), recompute remaining importances, repeat
s

Can be rather slow

Sequential feature selection (Wrapping)
Evaluate your model with different sets of features, find best subset based on performance
Greedy black-box search (can end up in local minima)
Backward selection: remove least important feature, recompute importances, repeat
Forward selection: set aside most important feature, recompute importances, repeat
Floating: add best new feature, remove worst one, repeat (forward or backward)
Stochastic search: use random mutations in candidate subset (e.g. simulated annealing)
Permutation feature importance
Defined as the decrease in model performance when a single feature value is randomly shuffled
This breaks the relationship between the feature and the target
Model agnostic, metric agnostic, and can be calculated many times with different permutations
Can be applied to unseen data (not possible with model-based techniques)
Less biased towards high-cardinality features (compared with RandomForests)
Comparison
Feature importances (scaled) and cross-validated score of pipeline
R
2

Pipeline contains features selection + Ridge

Selection threshold value ranges from 25% to 100% of all features
Best method ultimately depends on the problem and dataset at hand
In practice (scikit-learn)
Unsupervised: VarianceTreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
variances = selector.variances_

Univariate:
For regression: f_regression , mutual_info_regression
For classification: f_classification , chi2 , mutual_info_classication
Selecting: SelectKBest , SelectPercentile , SelectFpr ,...
selector = SelectPercentile(score_func=f_regression, percentile=50)
X_selected = selector.fit_transform(X,y)
selected_features = selector.get_support()
f_values, p_values = f_regression(X,y)
mi_values = mutual_info_regression(X,y,discrete_features=[])
Model-based:
SelectFromModel : requires a model and a selection threshold
RFE , RFECV (recursive feature elimination): requires model and final nr features

selector = SelectFromModel(RandomForestRegressor(),
threshold='mean')
rfe_selector = RFE(RidgeCV(), n_features_to_select=20)
X_selected = selector.fit_transform(X)
rf_importances = Randomforest().fit(X, y).feature_importances_

Sequential feature selection (from mlxtend , sklearn-compatible)

selector = SequentialFeatureSelector(RidgeCV(), k_features=20,
forward=True,
floating=True)
X_selected = selector.fit_transform(X)

Permutation Importance (in sklearn.inspection ), no fit-transform interface

importances =
permutation_importance(RandomForestRegressor().fit(X,y),
X, y,
n_repeats=10).importances_mean
feature_ids = (-importances).argsort()[:n]
Feature Engineering
Create new features based on existing ones
Polynomial features
Interaction features
Binning
Mainly useful for simple models (e.g. linear models)
Other models can learn interations themselves
But may be slower, less robust than linear models
Polynomials
Add all polynomials up to degree and all products
d

Equivalent to polynomial basis expansions

2 2 d
[1, x1 , . . . , xp ] → [1, x1 , . . . , xp , x , . . . , xp , . . . , xp , x1 x2 , . . . , xp−1 xp ]
1
Binning
Partition numeric feature values into intervals (bins) n

Create new one-hot features, 1 if original value falls in corresponding bin

Models different intervals differently (e.g. different age groups)

orig [-3.0,-1.5] [-1.5,0.0] [0.0,1.5] [1.5,3.0]
0 -0.752759 0.000000 1.000000 0.000000 0.000000
1 2.704286 0.000000 0.000000 0.000000 1.000000
2 1.391964 0.000000 0.000000 1.000000 0.000000
Binning + interaction features
Add interaction features (or product features )
Product of the bin encoding and the original feature value
Learn different weights per bin
orig b0 b1 b2 b3 X*b0 X*b1 X*b2 X*b3
0 -0.752759 0.000000 1.000000 0.000000 0.000000 -0.000000 -0.752759 -0.000000 -0.000000
1 2.704286 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 2.704286
2 1.391964 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.391964 0.000000
Categorical feature interactions
One-hot-encode categorical feature
Multiply every one-hot-encoded column with every numeric feature
Allows to built different submodels for different categories
gender age pageviews time
0 M 14 70 269
1 F 16 12 1522
2 M 12 42 235
3 F 25 64 63
4 F 22 93 21
age_M pageviews_M time_M gender_M_M age_F pageviews_F time_F gender_F_F
0 14 70 269 1 0 0 0 0
1 0 0 0 0 16 12 1522 1
2 12 42 235 1 0 0 0 0
3 0 0 0 0 25 64 63 1
4 0 0 0 0 22 93 21 1
Missing value imputation
Data can be missing in different ways:
Missing Completely at Random (MCAR): purely random points are missing
Missing at Random (MAR): something affects missingness, but no relation with the value
E.g. faulty sensors, some people don't fill out forms correctly
Missing Not At Random (MNAR): systematic missingness linked to the value
Has to be modelled or resolved (e.g. sensor decay, sick people leaving study)
Missingness can be encoded in different ways:'?', '-1', 'unknown', 'NA',...
Also labels can be missing (remove example or use semi-supervised learning)
Overview
Mean/constant imputation
kNN-based imputation
Iterative (model-based) imputation
Matrix Factorization techniques
Mean imputation
Replace all missing values of a feature by the same value
Numerical features: mean or median
Categorical features: most frequent category
Constant value, e.g. 0 or 'missing' for text features
Optional: add an indicator column for missingness
Example: Iris dataset (randomly removed values in 3rd and 4th column)
kNN imputation
Use special version of kNN to predict value of missing points
Uses only non-missing data when computing distances
Iterative (model-based) Imputation
Better known as Multiple Imputation by Chained Equations (MICE)
Iterative approach
Do first imputation (e.g. mean imputation)
Train model (e.g. RandomForest) to predict missing values of a given feature
Train new model on imputed data to predict missing values of the next feature
Repeat times in round-robin fashion, leave one feature out at a time
m
Matrix Factorization
Basic idea: low-rank approximation
Replace missing values by 0
Factorize with rank :
X r X
n×p
= U
n×r
V
r×p

With n data points and p features

Solved using gradient descent
Recompute : now complete
X
Soft-thresholded Singular Value Decomposition (SVD)
Same basic idea, but smoother
Replace missing values by 0, compute SVD: X = UΣV
T

Solved with gradient descent

Reduce eigenvalues by shrinkage factor:
λi = s ⋅ λi

Recompute : now complete

Repeat for iterations

m
Comparison
Best method depends on the problem and dataset at hand. Use cross-validation.
Iterative Imputation (MICE) generally works well for missing (completely) at random data
Can be slow if the prediction model is slow
Low-rank approximation techniques scale well to large datasets
In practice (scikit-learn)
Simple replacement: SimpleImputer
Strategies: mean (numeric), median , most_frequent (categorical)
Choose whether to add indicator columns, and how missing values are encoded
imp = SimpleImputer(strategy='mean', missing_values=np.nan,
add_indicator=False)
X_complete = imp.fit_transform(X_train)

kNN Imputation: KNNImputer

imp = KNNImputer(n_neighbors=5)
X_complete = imp.fit_transform(X_train)

Multiple Imputation (MICE): IterativeImputer

Choose estimator (default: BayesianRidge ) and number of iterations (default 10)
imp = IterativeImputer(estimator=RandomForestClassifier(),
max_iter=10)
X_complete = imp.fit_transform(X_train)
In practice (fancyimpute)
Cannot be used in CV pipelines (has fit_transform but no transform )
Soft-Thresholded SVD: SoftImpute
Choose max number of gradient descent iterations
Choose shrinkage value for eigenvectors (default: )
1

imp = SoftImpute(max_iter=10, shrinkage_value=None)

X_complete = imp.fit_transform(X)

Low-rank imputation: MatrixFactorization

Choose rank of the low-rank approximation
Gradient descent hyperparameters: learning rate, epochs,...
Several variants exist
imp = MatrixFactorization(rank=10, learning_rate=0.001,
epochs=10000)
X_complete = imp.fit_transform(X)
Handling imbalanced data
Problem:
You have a majority class with many times the number of examples as the minority class
Or: classes are balanced, but associated costs are not (e.g. FN are worse than FP)
We already covered some ways to resolve this:
Add class weights to the loss function: give the minority class more weight
In practice: set class_weight='balanced'
Change the prediction threshold to minimize false negatives or false positives
There are also things we can do by preprocessing the data
Resample the data to correct the imbalance
Random or model-based
Generate synthetic samples for the minority class
Build ensembles over different resampled datasets
Combinations of these
Random Undersampling
Copy the points from the minority class
Randomly sample from the majority class (with or without replacement) until balanced
Optionally, sample until a certain imbalance ratio (e.g. 1/5) is reached
Multi-class: repeat with every other class
Preferred for large datasets, often yields smaller/faster models with similar performance
Model-based Undersampling
Edited Nearest Neighbors
Remove all majority samples that are misclassified by kNN (mode) or that have a neighbor
from the other class (all).
Remove their influence on the minority samples
Condensed Nearest Neighbors
Remove all majority samples that are not misclassified by kNN
Focus on only the hard samples
Random Oversampling
Copy the points from the majority class
Randomly sample from the minority class, with replacement, until balanced
Optionally, sample until a certain imbalance ratio (e.g. 1/5) is reached
Makes models more expensive to train, doens't always improve performance
Similar to giving minority class(es) a higher weight (and more expensive)
Synthetic Minority Oversampling Technique (SMOTE)
Repeatedly choose a random minority point and a neighboring minority point
Pick a new, artificial point on the line between them (uniformly)
May bias the data. Be careful never to create artificial points in the test set.
ADASYN (Adaptive Synthetic)
Similar, but starts from 'hard' minority points (misclassified by kNN)
Combined techniques
Combines over- and under-sampling
E.g. oversampling with SMOTE, undersampling with Edited Nearest Neighbors (ENN)
SMOTE can generate 'noisy' point, close to majority class points
ENN will remove up these majority points to 'clean up' the space
Ensemble Resampling
Bagged ensemble of balanced base learners. Acts as a learner, not a preprocessor
BalancedBagging: take bootstraps, randomly undersample each, train models (e.g. trees)
Benefits of random undersampling without throwing out so much data
Easy Ensemble: take multiple random undersamplings directly, train models
Traditionally uses AdaBoost as base learner, but can be replaced
Comparison
The best method depends on the data (amount of data, imbalance,...)
For a very large dataset, random undersampling may be fine
You still need to choose the appropriate learning algorithms
Don't forget about class weighting and prediction thresholding
Some combinations are useful, e.g. SMOTE + class weighting + thresholding
In practice (imblearn)
Follows fit-sample paradigm (equivalent of fit-transform, but also affects y)
Undersampling: RandomUnderSampler, EditedNearestNeighbours,...
(Synthetic) Oversampling: RandomOverSampler, SMOTE, ADASYN,...
Combinations: SMOTEENN,...
X_resampled, y_resampled = SMOTE(k_neighbors=5).fit_sample(X, y)

Can be used in imblearn pipelines (not sklearn pipelines)

imblearn pipelines are compatible with GridSearchCV,...
Sampling is only done in fit (not in predict )
smote_pipe = make_pipeline(SMOTE(), LogisticRegression())
scores = cross_validate(smote_pipe, X_train, y_train)
param_grid = {"k_neighbors": [3,5,7]}
grid = GridSearchCV(smote_pipe, param_grid=param_grid, X, y)

The ensembling techniques should be used as wrappers

clf = EasyEnsembleClassifier(base_estimator=SVC()).fit(X_train,
y_train)
Real-world data
The effect of sampling procedures can be unpredictable
Best method can depend on the data and FP/FN trade-offs
SMOTE and ensembling techniques often work well
Summary
Data preprocessing is a crucial part of machine learning
Scaling is important for many distance-based methods (e.g. kNN, SVM, Neural Nets)
Categorical encoding is necessary for numeric methods (or implementations)
Selecting features can speed up models and reduce overfitting
Feature engineering is often useful for linear models
It is often better to impute missing data than to remove data
Imbalanced datasets require extra care to build useful models
Pipelines allow us to encapsulate multiple steps in a convenient way
Avoids data leakage, crucial for proper evaluation
Choose the right preprocessing steps and models in your pipeline
Cross-validation helps, but the search space is huge
Smarter techniques exist to automate this process (AutoML)

Collected Works of Poe Webster S Chinese Edgar Allan Poe Download
100% (3)
Collected Works of Poe Webster S Chinese Edgar Allan Poe Download
71 pages
Lecture 4
No ratings yet
Lecture 4
63 pages
List of Imported Libraries
No ratings yet
List of Imported Libraries
12 pages
Herbarium Data
No ratings yet
Herbarium Data
9 pages
Unit II
No ratings yet
Unit II
119 pages
l06 Machine Learning
No ratings yet
l06 Machine Learning
52 pages
09 KTK - 14 Statistics
No ratings yet
09 KTK - 14 Statistics
36 pages
DS 1
No ratings yet
DS 1
20 pages
Lecture Slides - ML - Part 2
No ratings yet
Lecture Slides - ML - Part 2
22 pages
1 PB
No ratings yet
1 PB
9 pages
10 - Neural Networks For Text
No ratings yet
10 - Neural Networks For Text
40 pages
End To End Project
No ratings yet
End To End Project
21 pages
Machine Learning
No ratings yet
Machine Learning
34 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
Preprocessing
No ratings yet
Preprocessing
5 pages
Coding Neural Networks-Classification & Regression
No ratings yet
Coding Neural Networks-Classification & Regression
39 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Sudmo Components en
No ratings yet
Sudmo Components en
492 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
Ex Ml-Basics
No ratings yet
Ex Ml-Basics
1 page
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
Axle Fabco FSD-XA
No ratings yet
Axle Fabco FSD-XA
3 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Bhu 18
No ratings yet
Bhu 18
10 pages
Engineering Safety Handbook - Rev March 2023
No ratings yet
Engineering Safety Handbook - Rev March 2023
64 pages
Anaphy Lab Disc 6
No ratings yet
Anaphy Lab Disc 6
25 pages
Anthroposophy in Everyday Life - Steiner, Rudolf, 1861-1925 - 1995 - Hudson, NY - Anthroposophic Press - 9780880104272 - Anna's Archive
No ratings yet
Anthroposophy in Everyday Life - Steiner, Rudolf, 1861-1925 - 1995 - Hudson, NY - Anthroposophic Press - 9780880104272 - Anna's Archive
100 pages
Ionic Equilibria: Ostwald'S Dilution Law
No ratings yet
Ionic Equilibria: Ostwald'S Dilution Law
39 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Skit Learn Cheatsheet
No ratings yet
Skit Learn Cheatsheet
11 pages
Week 10
No ratings yet
Week 10
50 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Lecture 4-5
No ratings yet
Lecture 4-5
48 pages
Deques: ECE 250 Algorithms and Data Structures
No ratings yet
Deques: ECE 250 Algorithms and Data Structures
34 pages
A Study On Fracture Toughness of Ultra High Toughness 2020 Construction and
No ratings yet
A Study On Fracture Toughness of Ultra High Toughness 2020 Construction and
22 pages
Sources of Air Pollution PDF
100% (1)
Sources of Air Pollution PDF
30 pages
Assignment1 LATEX
No ratings yet
Assignment1 LATEX
11 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Machine Learning Practice
No ratings yet
Machine Learning Practice
17 pages
Study - Id66039 - Electric Vehicles in China
No ratings yet
Study - Id66039 - Electric Vehicles in China
31 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Some Random Reviews: Which Happens by Default?
No ratings yet
Some Random Reviews: Which Happens by Default?
9 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Nogueira 2015 Stir Bar - Sorptive - Ex
No ratings yet
Nogueira 2015 Stir Bar - Sorptive - Ex
10 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
This Study Resource Was
No ratings yet
This Study Resource Was
3 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
This Study Resource Was
No ratings yet
This Study Resource Was
1 page
Island Biogeography With Questions
No ratings yet
Island Biogeography With Questions
2 pages
6 - Machine Learning 2
No ratings yet
6 - Machine Learning 2
14 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Unit 5 Material
No ratings yet
Unit 5 Material
18 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
2 pages
A240CX-BD CD DD Flameproof Coil Solenoid Valves PDF
No ratings yet
A240CX-BD CD DD Flameproof Coil Solenoid Valves PDF
1 page
Machine Learning: by Team 2
No ratings yet
Machine Learning: by Team 2
41 pages
(Ebook PDF) Fundamentals of Enzymology: The Cell and Molecular Biology of Catalytic Proteins 3rd Edition Instant Download
100% (3)
(Ebook PDF) Fundamentals of Enzymology: The Cell and Molecular Biology of Catalytic Proteins 3rd Edition Instant Download
49 pages
DM Lab Cycle 2 1
No ratings yet
DM Lab Cycle 2 1
10 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Advanced Scikit Learn
No ratings yet
Advanced Scikit Learn
98 pages
Mini 4
No ratings yet
Mini 4
9 pages
Final ML
No ratings yet
Final ML
2 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Medical Surveillance Checklist
No ratings yet
Medical Surveillance Checklist
1 page
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Found Sounds Scavenger Hunt
No ratings yet
Found Sounds Scavenger Hunt
1 page
WTG Nordex NXX 1 Micrositing en
No ratings yet
WTG Nordex NXX 1 Micrositing en
1 page
IPAQ - AUTOMATIC REPORT - Kuisioner
No ratings yet
IPAQ - AUTOMATIC REPORT - Kuisioner
20 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Reading Froudekrylov
No ratings yet
Reading Froudekrylov
6 pages
Feature Engineering
No ratings yet
Feature Engineering
23 pages
Department of Computing: CS-220: Database Systems Class: BSCS-4C
100% (1)
Department of Computing: CS-220: Database Systems Class: BSCS-4C
14 pages
Soul Worker Mega Guide
No ratings yet
Soul Worker Mega Guide
16 pages
Building Technology 1 - Building Materials: Midterm Project
No ratings yet
Building Technology 1 - Building Materials: Midterm Project
68 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
Centralised Lubrication System For A Manitou MT 732 Complete
No ratings yet
Centralised Lubrication System For A Manitou MT 732 Complete
35 pages
MODULE 3 Developmental Stages in Middle and Late Adolescence
100% (1)
MODULE 3 Developmental Stages in Middle and Late Adolescence
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Precon Agenda Final
No ratings yet
Precon Agenda Final
10 pages
Cems A 6 Part I Appx
No ratings yet
Cems A 6 Part I Appx
15 pages
CIDAM World Religions
100% (16)
CIDAM World Religions
18 pages
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
100% (1)
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
1 page
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet