Pa FS
Pa FS
com
1 General Model Building Steps B (Questions) The issue can be addressed with a few well-
defined questions. Question-Circle
1.1 Problem Definition B (Data) Good and useful data is available for answering the
• Three main categories of predictive modeling problems: questions above.
(More than one category can apply in a given business problem.) B (Impact) The predictions will likely drive actions or increase
1
understanding.
Category Focus Aim
B (Better solution) Predictive analytics likely produces a
Descriptive What happened in To “describe” or interpret solution better than any existing approach.
the past observed trends by
identifying relationships
B (Update) We can continue to monitor and update the models
when new data becomes available.
ACTEX Learning
between variables
Predictive What will happen in To make accurate • How to produce a meaningful problem definition?
the future “predictions” B General strategy: Get to the root cause of the business issue
Prescriptive The impacts of To answer the “what if?” and make it specific enough to be solvable.
Copyright © ArchiMedia Advantage Inc.
1.2 Data Collection and Validation • Consistency: Records in the data should be inputted
consistently on the same basis and rules, e.g.:
Data design
Need Help? Email [email protected]
B Population: Important for the data source to be a good proxy • Sufficient documentation: Examples of useful elements:
of the true population of interest. B A description of the dataset overall, including the data source
B Time frame: Choose the time period which best reflects the
2
B A clear description of each variable (definition and format)
business environment of interest.
B Notes about any past updates or other irregularities of the
In general, recent history is better than distant history.
dataset
• Sampling: The process of taking a subset of observations from B A statement of accountability for the correctness of the
the data source to generate the dataset dataset
ACTEX Learning
B Random sampling: “Randomly” draw observations from B A description of the governance processes used to manage the
the underlying population without replacement. Each record dataset
is equally likely to be sampled.
B Stratified sampling: Divide the underlying population into Other data issues
Copyright © ArchiMedia Advantage Inc.
a no. of non-overlapping “strata” (often w.r.t. target) non- • Personally identifiable information (PII): Information that
randomly, then randomly sample a set no. of observations can be used to trace an individual’s identity, e.g., name, SSN,
All Rights Reserved
from each stratum ⇒ get a more representative sample. address, photographs, and biometric records
A special case—systematic sampling: Draw observations
according to a set pattern; no random mechanism controlling
which observations are sampled.
www.ACTEXLearning.com
B Sensitive variables: Differential treatment based on sensitive B (Important!) Decide which type of model (GLMs or trees)
variables may lead to unfair discrimination and raise equity is more suitable, e.g., for a complex, non-monotonic relation,
concerns. trees may do better
Examples: Race, ethnicity, gender, age, income Dollar-Sign, disability
• Univariate exploration tools:
status Wheelchair, or other prohibited classes
B Proxy variables: Variables that are closely related to (hence Variable Summary Visual Observations
3
serve as a “proxy” of) prohibited variables. Type Statistics Displays
ACTEX Learning
maximum
• Target leakage: (Important to watch out for!) Categorical Class Bar charts B Which levels are
B Definition: When predictors in a model “leak” information frequencies Chart-bar most common?
about the target variable that would not be available when B Any sparse
Copyright © ArchiMedia Advantage Inc.
Categorical variable split histograms remedy: add a small positive number to each
among the factor
value of the variable if there are zeros)
by categorical (stacked or levels?
variable dodged) B Square root transformation
(works for non-negative variables)
Categorical 2-way Bar charts Any sizable differences
Options to handle outliers:
× frequency (stacked, in the class
(Remove) If an outlier is unlikely to have a
4
Categorical table Table dodged, or proportions among B
material effect on the model, then OK to
filled) different factor levels?
remove it.
• Common data issues for numeric variables: B (Keep) If the outliers make up only an
insignificant proportion of the data, then OK
to leave them in the data.
ACTEX Learning
Issue 1: Highly correlated predictors
B (Modify) Modify the outliers to make them
Problems B Difficult to separate out the individual effects of more reasonable, e.g., change negative values
different predictors on the target variable to zero.
B For GLMs, coefficients become widely varying in B (Using robust model forms) Fit models by
Copyright © ArchiMedia Advantage Inc.
sign and magnitude, and difficult to interpret. minimizing the absolute error (instead of
squared error) between predicted values and
All Rights Reserved
Possible B Drop one of the strongly correlated predictors. the observed values.
Solutions B Use PCA to compress the correlated predictors Reason: Absolute error places much less
into a few PCs. relative weight on the large errors and reduces
the impact of outliers on the fitted model.
www.ACTEXLearning.com
5
B Variable values have a sense of numeric order Predictor Numeric Target Categorical Target
that may be useful for predicting the target Combination
variable.
Numeric Scatterplot colored Boxplot for numeric
B Variable has a simple monotonic relationship × by categorical predictor split by
with target ⇒ its effect can be effectively Categorical predictor target and faceted by
ACTEX Learning
captured by treating it as a numeric variable. categorical predictor
B Future observations will have new variable Categorical Boxplot for target Bar chart for one
values (e.g., calendar year) × split by one predictor filled by
Categorical predictor and target and faceted by
Copyright © ArchiMedia Advantage Inc.
• Common issue for categorical predictors: Sparse levels faceted by the the other predictor
B Problem with high dimensionality/granularity: Sparse factor other predictor
All Rights Reserved
levels reduce robustness of models and may cause overfitting. Numeric Bin one of the predictors (i.e., cut it into
B A solution: Combine sparse levels with more populous levels × several ranges), or try a decision tree.
where the target variable behaves similarly to form more Numeric
representative and interpretable groups.
www.ACTEXLearning.com
B Interaction vs. correlation: Literally similar, but different Common performance metrics
Before fitting models, split the data into the training set Test: Prediction performance on new, unseen data
(70-80%) and the test set (20-30%) by stratified sampling. B Loss function: Most performance metrics use a loss function
ARROW-CIRCLE-DOWN to capture the discrepancy between the actual and predicted
Models are fitted to ARROW-CIRCLE-RIGHT Training set
values for each observation of the target variable.
Prediction performance is evaluated on ARROW-CIRCLE-RIGHT Test set
Examples:
Test set observations must be truly unseen to the trained model.
Square loss (most common for numeric targets)
6
• Why do the split? Absolute loss
B Model performance on the training set tends to be overly Zero-one loss (mostly for categorical targets)
optimistic and favor complex models.
• Metrics for regression problems:
B Test set provides a more objective ground for assessing the v
performance of models on new, unseen data.
u n
ACTEX Learning
u1 X
B RMSE: t (yi − ŷi )2
B Split replicates the way the models will be used in practice. n i=1
data)
• Trade-off about the (sizes of the two sets:
Training is more robust • Metrics for (binary) classification problems:
All Rights Reserved
7
Weighted average relation: Train the model on all but one folds and
n− n+ measure performance on left-out fold
accuracy = × specificity + × sensitivity. ⇓
n n
Repeat with each fold left out in turn
B How confusion matrix metrics vary with cutoff:
to get k performance values
Sensitivity ↓
ACTEX Learning
(
⇓
Cutoff ↑ ⇒
Specificity ↑ Average to get overall CV metric
Plot sensitivity against specificity for all cutoffs from 0 without using any test set.
to 1 and compute the area under the curve. B Hyperparameter tuning: To tune hyperparameters (=
All Rights Reserved
Two special points on an ROC curve: parameters with values supplied in advance; not optimized
by the model fitting algorithm) by picking the values that
(1, 0), if cutoff = 0,
(sensitivity, specificity) = produce the best CV performance (lowest MSE or highest
(0, 1), if cutoff = 1. accuracy).
www.ACTEXLearning.com
• Considerations when selecting the best model: • Effects of undersampling and oversampling on model
• Overfitting:
Sidebar: Unbalanced data (for binary targets)
B Definition: Model is trying too hard to capture not only the
• Meaning: One class is much more dominant than the other. signal, but also the noise specific to the training data.
• Problems with unbalanced data: B Indications: Small training error, but large test error
B A classifier implicitly places more weight on the majority class B Problem: An overfitted model fits training data well, but does
8
and tries to fit those observations well, but the minority class not generalize well to new, unseen data (poor predictions).
may be the +ve class. Not a useful model!
ACTEX Learning
minority class, but draw fewer observations (“undersample”) Complexity
Bias ↑ Bias ↓
from the majority class. Variance ↓ Variance ↑
B Drawback: Less data ⇒ training becomes less robust and the Training error ↑ Training error ↓
Test error has a U-shape & %
classifier becomes more prone to overfitting.
Copyright © ArchiMedia Advantage Inc.
• Solution 2—Oversampling: Keep all observations from the B Bias-variance decomposition of expected test MSE:
All Rights Reserved
(underfitting)
distributions).
• Practical implications of bias-variance trade-off:
Check “Residuals vs Fitted” plot for 1 & 2 ; Q-Q plot for
Need to set model complexity to a reasonable level 3.
⇓ • Validation methods based on the test set:
optimize bias-variance trade-off improve prediction
(
9
⇒ B Predicted vs. actual values of target: The two sets of values
avoid underfitting & overfitting performance
should be close (can check this quantitatively or graphically).
• Sidebar: Dimensionality vs. granularity B Benchmark model: Show that the recommended model
outperforms a benchmark model, if one exists (e.g., intercept-
B Granularity ↑ ⇒ model complexity tends to ↑ only GLM, purely random classifier), on the test set.
B Two main differences between the two concepts:
ACTEX Learning
1.6 Recommendations for Next Steps
Concept Applicability Comparability
• (Adjust the business problem) Changes in external factors, e.g.,
Dimensionality Specific to categorical Two categorical
market conditions, regulations, may cause initial assumptions to
variables variables can
Copyright © ArchiMedia Advantage Inc.
• (Apply new types of models) Try new types of models when new • Two key components:
variable has zero values Exclamation-Triangle; see Exercise 4.1.4 (c) in the manual.)
• Assumptions: LMs vs. GLMs
• Common e.g. of target distributions and link functions:
LMs GLMs
Variable Type Common Dist. Common Link
Independence Given the predictor values, the observations of
the target variable are independent. Real-valued with a Normal Identity
10
(Same for both LMs and GLMs.) bell-shaped dist. (Gaussian)
Target Given the predictor Given the predictor Binary (0/1) Binomial Logit
distribution values, the target values, the target Count (≥ 0, integers) Poisson Log
variable follows a normal distribution is a
+ve, continuous with Gamma, Log
distribution. member of the linear
right skew inverse Gaussian
ACTEX Learning
exponential family.
≥ 0, continuous with Tweedie Log
Mean The target mean directly A function (“link”) of
a large mass at zero
equals the linear the target mean equals
predictor: the linear predictor:
(Note: For gamma and inverse Gaussian, the target variable has to
Copyright © ArchiMedia Advantage Inc.
µ=η g (µ) =
link
η
linear
.
be strictly positive. Values of zero are not allowed. Exclamation-Triangle)
= β0 + β1 X1 + · · · + βp Xp . predictor
All Rights Reserved
11
intervals over the range of the original variable.
B Pros: No definite order among the coefficients of the • Interactions:
dummy variables corresponding to different bins ⇒ target
Need to “manually” include interaction terms
mean can vary highly irregularly over the bins.
of the product form Xj Xk
B Cons: ⇓
ACTEX Learning
Usually no clear choice of the no. of bins and the Coefficient of Xj will vary with the value of Xk
associated boundaries
Results in a loss of information (exact values of the
Interpretation of coefficients
numeric predictor gone)
• General statements:
Copyright © ArchiMedia Advantage Inc.
B Pros: A simple way to allow the relationship between direction) of features on the target mean.
a numeric variable and the target mean to vary over B p-values express statistical significance of features; the
different intervals smaller, the more significant.
B Cons: Usually no clear choice of the break points
www.ACTEXLearning.com
eβ̂j ×
Need Help? Email [email protected]
µ̂ = µ̂ .
@non-baseline level @baseline level • Selection criteria based on penalized likelihood:
Other modeling techniques: Offsets vs. weights B Idea: Prevent overfitting by requiring an included/retained
feature to improve model fit by at least a specified amount.
Offsets Weights B Two common choices:
Form of the Aggregate Average Criterion Definition Penalty per Parameter
12
target variable (e.g., total # claims in (e.g., average # claims
a group of similar in a group of similar
AIC −2l + 2(p + 1) 2
policyholders) policyholders)
BIC −2l + [ln(ntr )](p + 1) ln(ntr )
Do they affect Target mean is Variance is inversely (In R, −2l is treated as the deviance.)
the target mean directly proportional related to exposure: B AIC vs. BIC:
ACTEX Learning
or variance? to exposure,
(some terms) For both, the lower the value, the better.
e.g., with log link, Var(Yi ) = .
Ei BIC is more conservative and results in simpler models.
µi = Ei exp(· · · ). Observations with a
larger exposure will
• Manual binarization: Convert factor variables to dummy
variables manually before running stepwise selection.
Copyright © ArchiMedia Advantage Inc.
• Idea: Reduce overfitting by shrinking the size of the coefficient B Provided that λ is large enough, increasing α from 0 to
estimates, especially those of non-predictive features. 1 makes more coefficient estimates zero.
B Cannot be tuned by cv.glmnet(); need to tune
• How it works: To optimize training loglikelihood (equivalently, manually.
training deviance) adjusted by a penalty term that reflects the
size of the coefficients, i.e., to minimize 2.2 Single Decision Trees
Need Help? Email [email protected]
13
values/levels of predictors and represented in the form of a
Lasso L = λ pj=1 |βj | Some coef. may be zero “tree” Tree
P
Ridge regression R = λ pj=1 βj2 None reduced to zero B Predictions: Observations in the same terminal node share the
P
Elastic net α L + (1 − α) R Some coef. may be zero same predicted mean (for numeric targets) or same predicted
• Two hyperparameters: class (for categorical targets).
ACTEX Learning
1 λ: Regularization (a.k.a. shrinkage) parameter • Recursive binary splitting:
B Controls the amount of regularization: B Two terms: The algorithm is...
bias2 ↑ Greedy: At each step, adopt the split that leads to the
(
more shrinkage
λ↑ ⇒ complexity ↓ ⇒ .
greatest reduction in impurity at that point, instead of
Copyright © ArchiMedia Advantage Inc.
variance ↓
looking ahead and selecting a split that results in a better
B Feature selection property: For elastic nets with tree in a future step. (Repeat until a stopping criterion
All Rights Reserved
B Node impurity measures: • Interpretation of trees: Things you can comment on:
Gini index and entropy are more sensitive to node B (Classification trees) Combinations leading to the +ve event
impurity than classification error rate
• Cost-complexity pruning:
Reason: They depend on all p̂mk , not just the max. class
proportion. B Rationale: To reduce tree complexity by pruning branches
from bottom that do not improve goodness of fit by a sufficient
• Tree parameters:
14
amount ⇒ prevent overfitting and ease interpretation.
ACTEX Learning
cp cp × |T | ,
parameter required for a split less complex (model fit to training data) (tree complexity)
to be made
over all subtrees of T0 , where
(not 100% right...)
training RSS, for regression,
Copyright © ArchiMedia Advantage Inc.
B Alternative: One-standard-error (1-SE) rule • Combining base predictions to form overall prediction:
(The transformations alter (The transformations can alter average probability −→ overall class
variable the values of the predictors the calculations of node
(The default is to take the majority vote.)
and target variable that go impurity measures, e.g., RSS,
• Key parameters:
into the likelihood function.) that define the tree splits.)
15
√
monotonic, e.g., log Common choice: p (classification) or p/3 (regression)
(Monotonic transformations Typically tuned by CV
will not change the way tree
B ntree: # trees to be grown
splits are made.)
Higher ntree, more variance reduction
Often overfitting does not arise even if set to a large no.
2.3 Ensemble Trees
ACTEX Learning
Set to a relatively small value to save run time
Random forests Boosting
• Idea: • Idea:
Copyright © ArchiMedia Advantage Inc.
B (Variance reduction) Combine the results of multiple trees B In each iteration, fit a tree to the residuals of the preceding
Tree Tree Tree Tree Tree Tree Tree Tree fitted to different bootstrapped training tree and subtract a scaled-down version of the current tree’s
All Rights Reserved
samples in parallel ⇒ reduce variance of overall predictions. predictions from the residuals to form the new residuals.
B (Randomization) Take a random sample of predictors as B Each tree focuses on observations the previous tree predicted
candidates for each split ⇒ reduce correlation between base poorly.
trees ⇒ further reduce variance of overall predictions. B Overall prediction: fˆ(x) = Bb=1 λfˆb (x).
P
www.ACTEXLearning.com
B nrounds: Max. # rounds in the tree construction process B Use: Plot PD(x1 ) against various x1 to show the marginal
effect of X1 on the target variable.
Effects of nrounds: Higher nrounds ⇒ algorithm learns
B Limitations:
better but is more prone to overfitting.
Assume predictor of interest is independent of other
Rule of thumb: Set to a relatively large value
Need Help? Email [email protected]
predictors.
• Random forests vs. boosted trees: Some predictions may be based on practically
unreasonable combinations of predictor values.
Item Random Forest Boosting
Fitting process In parallel In series (sequential) 2.4 Pros and Cons of Different Models
Focus Variance Bias • Tips for recommending a model: Refer to the business
16
Overfitting Less vulnerable More vulnerable problem (prediction vs. interpretation) and characteristics of
Hyperparameter tuning Less sensitive More sensitive data (e.g., any complex, non-monotonic relations?)
• GLMs:
Two interpretational tools for ensemble trees
B Pros:
• Variable importance plots: 1 (Target distribution) GLMs excel in accommodating a
ACTEX Learning
B Definition of importance scores: The total drop in node wide variety of distributions for the target variable.
impurity (RSS for regression trees and Gini index for 2 (Interpretability) The model equation clearly shows how
classification trees) due to splits over a given predictor, the target mean depends on the features; coefficients =
averaged over all base trees: interpretable measure of directional effect of features.
Copyright © ArchiMedia Advantage Inc.
2 (Interpretability) For some link functions (e.g., inverse 3 (Categorical variables) Categorical predictors are
17
1 (Categorical predictors) Possible to see some non-intuitive predictors with a large no. of levels
or nonsensical results when only a handful of the levels (Reason: Too many ways to split ⇒ easy to find a
of a categorical predictor are selected. spurious split that looks good on training data, but
2 (Target distribution) Limited/restricted model forms doesn’t really exist in the signal.)
allowed by glmnet() (Weak point!)
3 (Interpretability) Coefficient estimates are more difficult • Ensemble trees:
ACTEX Learning
to interpret ∵ variables are standardized. (Weak point!)
B Pros: Much more robust and predictive than base trees by
• Single trees: combining the results of multiple trees
B Pros: B Cons:
Copyright © ArchiMedia Advantage Inc.
1 (Interpretability) If there are not too many buckets, trees 1 Opaque (“black box”), difficult to interpret
are easy to interpret because of the if/then nature of the (Reason: Many base trees are used, but variable
All Rights Reserved
classification rules and their graphical representation. importance or partial dependence plots can help.)
2 (Complex relationships) Trees excel in handling non- 2 Computationally prohibitive to implement
monotonic and non-additive relationships without the (Reason: Huge computational burden with fitting
need to insert extra features manually. multiple base trees.)
www.ACTEXLearning.com
18
applicable. 2 Feature generation: Replace the original variables by PCs to
reduce overfitting and improve prediction performance.
3.1 Principal Components Analysis (PCA)
• Interpretation of PCs:
• Idea:
B Signs and magnitudes of PC loadings: What do the PCs
ACTEX Learning
B To transform a set of numeric variables into a smaller set of
represent, e.g., proxy, average, or contrast of which variables?
representative variables (PCs) ⇒ reduce dimension of data
Which variables are more correlated with one another?
B Especially useful for highly correlated data ⇒ a few PCs are
B Sizes of proportions of variance explained (PVEs):
enough to capture most information.
Copyright © ArchiMedia Advantage Inc.
B Linear combinations of the original features: Are the first few PVEs large enough (related to the strong
correlations between variables)? If so, the PCs are useful.
zim = φ1m xi1 + φ2m xi2 + · · · + φpm xip .
• Biplots: Visualization of PCA output by displaying both the
with φ21m + φ22m + ··· + φ2pm = 1 (normalization). scores and loading vectors of the first two PCs. Example:
www.ACTEXLearning.com
• Drawbacks of PCA:
19
subgroups (“clusters”) and uncover hidden patterns.
B PC loadings on top and right axes ⇒ deduce meaning of PCs B Observations within each cluster should be rather similar to
one another.
B PC scores on bottom and left axes ⇒ deduce characteristics
of observations (based on meaning of PCs)
B Observations in different clusters should be rather different
(well separated).
ACTEX Learning
• Number of PCs (M ) to use: • Two feature generation methods based on clustering:
cumulative PVE ↑
B Cluster groups: As a new factor variable
B Trade-off: M ↑ ⇒ dimension ↑
B Cluster means: As a new numeric variable
Copyright © ArchiMedia Advantage Inc.
20
Complete (default) Maximal pairwise distance
B Make a plot of the proportion of variation explained
between-cluster SS Single Minimal pairwise distance
(= ) against K.
total SS Average Average of all pairwise distances
Between-cluster SS
Total SS Centroid Distance between the two cluster
1 centroids
ACTEX Learning
B Complete and average linkage are commonly used.
(Reason: They tend to result in more balanced clusters.)
elbow B Single linkage tends to produce extended, trailing clusters
Copyright © ArchiMedia Advantage Inc.
materially different characteristics. the same scale and share the same degree of importance.
• K-means vs. hierarchical clustering: • Alternative distance measures:
Correlation-based distance
Item K-means Hierarchical B Motivation: Focuses on shapes of feature values rather than
their exact magnitudes.
Is randomization Yes No
21
needed? (for initial cluster centers)
B Limitation: Only makes sense when p ≥ 3, for otherwise the
correlation between two observations always equals ±1.
Is the no. of Yes No
clusters (K needs to be specified) (Specify the height of the
• Clustering and curse of dimensionality:
pre-specified? dendrogram later) B Visualization of the results of cluster analysis becomes
Are the clusters No Yes problematic in high dimensions (p ≥ 3).
ACTEX Learning
nested? (a hierarchy of clusters) B As the number of dimensions increases, our intuition
breaks down and it becomes harder to differentiate between
observations that are close and those that are far apart.
Copyright © ArchiMedia Advantage Inc.
All Rights Reserved