1
ADVICE ABOUT PRACTICAL
ASPECTS OF ML
Jesse Davis
Goals of this Lecture: Address Practical Aspects of
Machine Learning
2
Massaging the data for better performance
Discussing how to set up an appropriate empirical evaluation
Identifying potential pitfalls
At high level a bunch of stuff I wish I knew for
Performing academic empirical evaluations
Dealing with real-world “applied” tasks
3 Part I: Selecting Features
Dimensionality Reduction
4
Represent data with fewer dimensions! ☺
Effectively: Alter the given feature space
Two broad ways
Construct new feature space
Simple drop dimensions in given space
Why Dimensionality Reduction
5
Easier learning – fewer parameters
|Features| ≫ |training examples| ??
Better visualization
Hard to understand more than 3D or 4D
Discover “intrinsic dimensionality” of data
High dimensional data may truly be low dimensional
More interpretable models
Interested in which features are relevant for task
Improve efficiency
Fewer features = less memory / runtime
Don’t Some Algorithms Do This?
6
Decision trees:
Selectmost promising feature at each node
Tree only contains a subset of features
Problem: Irrelevant attributes can degrade performance due to
data fragmentation
Datasplit into smaller and smaller sets
Even random attribute can look good with little data by chance
More data does not help
Principal Component Analysis
7
First principal component:
Direction of the largest variance x2
u1
Each subsequent principal component:
Orthogonal to the previous ones, and
Directions of the largest variance of
the residuals
x1
Big Idea: Rotate the axes and drop irrelevant ones!
Eigenfaces [Turk, Pentland ’91]
8
Input images:
N images
Each 5050 pixels
2500 features
Misleading figure.
Best to think of as an N 2500 matrix: |Examples| |Features|
Reduce Dimensionality 2500 → 15
9
First principal component
Average
face
Other
components
Problematic Data Set for PCA
10
PCA cannot capture NON-LINEAR structure!
PCA Conclusions
11
PCA
Rotate the axes and sort new dimensions in order of “importance”
Discard low significance dimensions
Uses:
Get compact description
Ignore noise
Improve classification (hopefully)
Not magic:
Doesn’t know class labels
Can only capture linear variations
One of many tricks to reduce dimensionality!
Feature Selection:
12
Two Approaches
Filtering-Based Wrapper-Based
Feature Selection Feature Selection
all features
all FS algorithm
FS algorithm features calls ML
algorithm
Score and rank each many times,
feature: Pick top k uses it to help
model select features
ML algorithm
ML algorithm
model
Filter-Based Approaches
13
Idea: Measure each feature’s usefulness in isolation (i.e.,
independent of other features)
Pro: Very fast so scales to large feature sets or large data sets
Cons
Misses feature interactions
May select many redundant feature
Approach 1: Correlation
14
Gain(S,A) = Entropy(S) - Σ(|Sv| / |S|) Entropy(Sv)
v Values(A)
cov( fi , y)
R ( fi , y ) =
var( fi ) var( y)
(f )( y )
m
k =1 k,i − fi k −y
R( f i , y) =
(f ) (y )
m 2 m 2
k =1 k ,i
− fi k =1 k
−y
Approach 2: Single Variable Classifier
15
Select variable according to individual predictive performance
Build classifier with just one variable
Discrete:
Decision stump
Continuous: Threshold the variable value
Measure performance using accuracy, balanced accuracy, AUC,
etc.
Wrapper-Based Feature Selection
16
Feature selection = search
State = set of features
Start state
Forwardselection: Empty
Backward elimination: Full
Operators:
Forward:add a feature
Backward: subtract a feature
Scoring function: Learned model’s performance on
training/tuning/ CV on the state’s feature set
Forward Feature Selection
17
Greedy search (aka “Hill Climbing”)
{}
50%
{F1} {F2} {Fd}
...
62% 72% 52%
add F3
{F1,F2} {F2,F3} ... {F2,Fd}
74% 73% 84%
Backward Feature Selection
18
Greedy search (aka “Hill Climbing”)
{F1,…,Fd}
75%
subtract F2
{F2,…,Fd} {F1, F3,…,Fd} ... {F1,…,Fd-1}
72% 82% 78%
subtract F3
{F3,…,Fd} {F1, F4,…,Fd} ... {F1, F3,…,Fd-1}
80% 83% 81%
Forward vs. Backward Selection
19
Forward Backward
Faster in early steps because Fast for choosing all but a
fewer features to test small subset of the features
Fast for choosing a small
Preserves features whose
subset of the features
usefulness requires other
features (e.g., area requires
Misses features whose both length & width)
usefulness requires other
features (feature synergy)
Impact of feature selection on classification of
20
fMRI data [Pereira et al. ’05]
Feature Selection vs. Dimensionality Reduction
21
Feature selection: Project to a lower dimensional subspace
perpendicular to removed feature
Dimensionality reduction: allow other kinds of projection
Project onto
x2 x2 rotated axes
Drop x2
x1 x1
Feature Selection in Practice
22
You cannot globally select the best features
Thisis cheating
Data leakage from test set to training set
Results would be overoptimistic
Feature selection must be performed separately for each fold
Implication: Each fold could have a different feature set
23 Advice for Evaluation
Empirical Evaluation: Think about What You Want to
Demonstrate
24
Many relevent questions
Do we beat competitors?
Are we more data efficient than competitions?
Are we faster than the competition?
Good practices:
Pose a question / hypothesis and answer it
Also include a naive baseline such as
◼ Always predict majority class
◼ Return mean value in training data
Case Study: RPE for Professional Soccer
25
Players
1.20
Given: GPS and
accelerometer data from a 1.00
player’s training session Train
set
Predict: Player’s Rate of 0.80 average
Neural
Perceived Exertion Net
MAE
0.60
LASSO
Question: Is model valid 0.40
across seasons?
0.20
0.00
Results: Is an Individual Model More Accurate Than
a Team Model?
26
0.90
0.85
Mean Absolute Error
0.80
0.75
0.70
0.65
Individual Team
Neural Net Boosted Tree LASSO
Lower value is better
How Does Amount of Data Affect Performance?
27
0.50
0.45
0.40
0.35
0.30
AUCPR
0.25
0.20
0.15
0.10
0.05
0.00
1 2 3
Number of Training Databases
TODTLER DTM LSM Random
Learning curve: Show performance as a function
of the amount of training data
Case Study: Activity Recognition
28
Given: 3D accelerometer data from a phone
Predict: Person’s activity (walking, ascending stairs, descending
stairs, cycling, jogging)
Hypothesis: Deriving new signals will help
Setup: Simulate different attachments by
rotating axes
Approaches compared:
TSFuse + GBT
TSFresh + GBT (Time series features, but no fusion)
RNN (LSTM)
Results Activity Recognition
29
TSFuse
TSFresh
RNN
Case Study: Energy Efficient Prediction
30
Motivation: Learned models often deployed on devices with
resource constraints (e.g., battery)
Question: How does feature selection strategy affect performance?
Static
selection: Always consider k feature
Dynamic selection: May ignore some features
Approach: Fix max feature budget
RCV: Speedup and Weighted Accuracy vs.
Feature Budget
31
6.00
Speedup Factor Our approach:
5.00
4X more predictions
4.00
on resource budget
3.00
2.00
1.00
0.00
Δ Weighted Accuracy
0.02
0.01
IG
0.00
ΔCP
-0.01
0 200 400 600 800 1000
Feature Budget
Comparing Run Times Is A Dark Art
32
What to measure: Wall clock or CPU time?
Be sure to run everything on identically configured machines
Should you include time to tune models?
Easy to manipulate
Also very relevant…
Differences due to
Programming languages
How optimized the code is (definitely relevant)
Evaluate Design Decision: Ablasion or Lesion Study
33
When designing your algorithm / model you make lots of decision
choices
Which features
Which normalizations
Which functionality
Ablative analysis tries to explain the difference between some
baseline (much poorer) performance and current performance
Remove aspects of system and measure effect on performance
Case Study: Fatigue Protocol Data
34
Rating of perceived exertion (RPE): 6 – 20
Upper
arm
Wrist
Both
Tibia
Given: IMU data from a runner
Predict: Current fatigue level
Pre-processing: Normalizations Based on Domain
Knowledge
35
RPE evolution is trial-dependent: Normalize to first value
Normalize based on change from first windows
Domain Insight: Change in feature values over time is key
Effects of Feature Normalization for
Gradient Boosted Trees
36
No learning baselines:
3.50 Constant predictions
3.00
Median RPE
2.50
MAE of RPE
Personalized
2.00
Median
1.50 No Normalization
1.00 Normalization
0.50
0.00
Case Study: Resource Monitoring
37
Maintenance Univariate measurement: Abnormally high
usage Patterns Sampled every 5 minutes
Given: Real water usage data from a retail store
Do: Detect high periods of usage
Approach: Semi-supervised learning
Simple statistical features, day of week, etc.
Above features plus learned shape patterns
Results: Anomaly Detection for
Water Usage
38
Area under ROC Curve
Simple Features
Simple Features + Learned Patterns
Time of Data
39 Potential Problems or Pitfalls
Cross Validation Errors
40
Must repeat entire data processing pipeline on every fold of
cross-validation using only that fold’s TRAINING DATA
E.g.,
cannot do preprocessing over entire data set (feature selection,
parameter tuning, etc.
Did I tweak my algorithm a million times until I get good results?
Solution:Use one or two datasets for development, then expand
evaluation
Temporal dependencies in the data?
Temporal Data Is Trickier!
41
Setting: One season of data from training sessions from a professional
football team:
Season start Season end
Training: first 80% of data Testing: last 20% of data
Predict adverse drug reactions
Adverse
Patient’s history First Prescription
Reaction?
Training data Censoring window
Class Imbalance
42
Real-world problems: Often more examples of one class
(negatives) than the other (positives)
One class rare: Anomaly detection, cancer, goals in a soccer
match, etc.
This causes difficulties for learners: Hard to beat always
predicting the majority class!
Idea 1: Sampling
43
Oversample the minority class: May lead to overfitting
Undersample the majority class: Odd to throw away data
SMOTE: Generate synthetic minority examples
Find nearest neighbors
Interpolate between them
- -
+ - +
+ - +
+ - -- - +
Synthetic - - -
example
Idea 2: Manipulate the Learner
44
Change the cost function: Penalize mistakes on minority class
more heavily
Optimize towards something that is better at capturing skew
0.5 × 𝑇𝑃 0.5 × 𝑇𝑁
Balanced accuracy = +
𝑇𝑃 + 𝐹𝑁 𝐹𝑃 + 𝑇𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
F1 =2×
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
ROC
Precision / Recall
My Model Is Not Accurate Enough
45
Suppose: Activity recognition into walking, running, ascend stairs,
descend stairs
Five minutes of data from ten subjects
Divide data into 5 second windows which yields 600 examples
Use five simple features from X, Y, Z acceleration
Train linear separator using log loss
Optimize using gradient descent
Leave-one-subject out CV: 70% accuracy
Question: What do I do?
Possible fixes
46
More data
More / better features
Change optimizer
Change objective function
Change model class
Question: What do I do?
Option 1: Grad student descent and try everything
Option 2: Debug the learning process
Look at Learning Curve
47
Error
Error
Train
Test
#Training examples #Training examples
Train and test error Train error is low
High
Test error is high
Close
=> More data
More/better features
More expressive model?
Conclusions
48
Feature selection is important in practice
Think about what you want to show in your empirical evaluation
Practical issues are hard
It
is a lot of guess and check at first
Eventually you develop intuitions
Generally speaking: Features and data more important than model
Questions?
49