0% found this document useful (0 votes)
52 views21 pages

Pa FS

Uploaded by

Hamza Iftekhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views21 pages

Pa FS

Uploaded by

Hamza Iftekhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

www.ACTEXLearning.

com

MESSAGE FROM AMBROSE

Exam PA Review Sheet (Last updated 06/04/2023)


This review sheet (a.k.a. a “cheat sheet”) follows the order of topics in the ACTEX Study Manual for Exam PA and
provides a “helicopter” HELICOPTER view of the entire PA exam syllabus. As you probably know, the focus of Exam PA is on
conceptual understanding Brain and written communication COMMENT-DOTS. To write well and score high, there are quite a lot of
things you have to memorize in advance (e.g., describe best subset selection, explain how cost-complexity pruning
works, the pros and cons of GLMs vs. decision trees), and this cheat sheet collects, I believe, the most important
items in one place for your convenience. Although no substitute for a thorough review of the study manual, this
cheat sheet should be a valuable aid to enhance retention and memorization BOOK-OPEN. When I myself took Exam PA in
December 2019, I used a preliminary version of this cheat sheet and found it quite useful. (It was only 7 pages long
then!) Please feel free to refine it and put in additional facts and tips that you think are valuable.
Need Help? Email [email protected]

1 General Model Building Steps B (Questions) The issue can be addressed with a few well-
defined questions. Question-Circle
1.1 Problem Definition B (Data) Good and useful data is available for answering the
• Three main categories of predictive modeling problems: questions above.
(More than one category can apply in a given business problem.) B (Impact) The predictions will likely drive actions or increase

1
understanding.
Category Focus Aim
B (Better solution) Predictive analytics likely produces a
Descriptive What happened in To “describe” or interpret solution better than any existing approach.
the past observed trends by
identifying relationships
B (Update) We can continue to monitor and update the models
when new data becomes available.

ACTEX Learning
between variables

Predictive What will happen in To make accurate • How to produce a meaningful problem definition?
the future “predictions” B General strategy: Get to the root cause of the business issue
Prescriptive The impacts of To answer the “what if?” and make it specific enough to be solvable.
Copyright © ArchiMedia Advantage Inc.

different “prescribed” and “what is the best B Specific strategies:


decisions course of action” questions
All Rights Reserved

 (Hypotheses) Use prior knowledge of the business problem


• Characteristics of predictive modeling problems: to ask questions Question-Circle and develop testable hypotheses.
 (KPIs) Select appropriate key performance indicators to
B (Issue) There is a clearly identified and defined business issue
provide a quantitative basis for measuring success.
to be addressed.
www.ACTEXLearning.com

• Constraints to face: • Granularity: Refers to how precisely a variable is measured,

Exam PA Review Sheet (Last updated 06/04/2023)


i.e., level of detail for the information contained by the variable.
B The availability of easily accessible and high quality data
B Implementation issues, e.g., the presence of necessary Data quality issues
IT infrastructure and technology to fit complex models
• Reasonableness: Data values should be reasonable (make
efficiently, the cost and effort required to maintain the selected
sense) in the context of the business problem, e.g., variables such
model
as age, time, and income Dollar-Sign should be non-negative.

1.2 Data Collection and Validation • Consistency: Records in the data should be inputted
consistently on the same basis and rules, e.g.:
Data design
Need Help? Email [email protected]

B Same measurement unit for numeric variables


• Relevance: Need to ensure that the data is unbiased, i.e.,
representative of the environment where the model will operate. B Same coding scheme for categorical variables

B Population: Important for the data source to be a good proxy • Sufficient documentation: Examples of useful elements:
of the true population of interest. B A description of the dataset overall, including the data source
B Time frame: Choose the time period which best reflects the

2
B A clear description of each variable (definition and format)
business environment of interest.
B Notes about any past updates or other irregularities of the
In general, recent history is better than distant history.
dataset
• Sampling: The process of taking a subset of observations from B A statement of accountability for the correctness of the
the data source to generate the dataset dataset

ACTEX Learning
B Random sampling: “Randomly” draw observations from B A description of the governance processes used to manage the
the underlying population without replacement. Each record dataset
is equally likely to be sampled.
B Stratified sampling: Divide the underlying population into Other data issues
Copyright © ArchiMedia Advantage Inc.

a no. of non-overlapping “strata” (often w.r.t. target) non- • Personally identifiable information (PII): Information that
randomly, then randomly sample a set no. of observations can be used to trace an individual’s identity, e.g., name, SSN,
All Rights Reserved

from each stratum ⇒ get a more representative sample. address, photographs, and biometric records
A special case—systematic sampling: Draw observations
according to a set pattern; no random mechanism controlling
which observations are sampled.
www.ACTEXLearning.com

How to handle PII? 1.3 Exploratory Data Analysis (EDA)

Exam PA Review Sheet (Last updated 06/04/2023)


B Anonymization: Anonymize or de-identify the data to remove • Aim: Use summary statistics + graphical displays to gain
the PII. insights into the distribution of variables on their own and in
relation to one another (esp. the target variable).
B Data security: Ensure that the data receives sufficient
protection. Some typical uses:
B Terms of use: Be well aware of the terms and conditions, and B Clean and validate the data to make it ready for analysis
the privacy policy related to the collection and use of data.
B Identify potentially useful predictors
• Variables with legal/ethical concerns: B Generate useful features (e.g., variable transformations)
Need Help? Email [email protected]

B Sensitive variables: Differential treatment based on sensitive B (Important!) Decide which type of model (GLMs or trees)
variables may lead to unfair discrimination and raise equity is more suitable, e.g., for a complex, non-monotonic relation,
concerns. trees may do better
Examples: Race, ethnicity, gender, age, income Dollar-Sign, disability
• Univariate exploration tools:
status Wheelchair, or other prohibited classes
B Proxy variables: Variables that are closely related to (hence Variable Summary Visual Observations

3
serve as a “proxy” of) prohibited variables. Type Statistics Displays

Examples: Numeric Mean, Histograms, B Any (right)


median, Chart-area skew?
 Occupation (possibly a proxy of gender)
variance, boxplots
 Geographical location (possibly a proxy of age and B Any unusual
minimum,
income) values?

ACTEX Learning
maximum
• Target leakage: (Important to watch out for!) Categorical Class Bar charts B Which levels are
B Definition: When predictors in a model “leak” information frequencies Chart-bar most common?
about the target variable that would not be available when B Any sparse
Copyright © ArchiMedia Advantage Inc.

the model is deployed in practice levels?


B Key to detecting target leakage—Timing: These variables are B (For binary
All Rights Reserved

observed at the same time as or after the target variable. targets)


B Problem with this issue: These variables cannot serve as Presence of
predictors in practice and would lead to artificially good imbalance
model performance if mistakenly included.
www.ACTEXLearning.com

• Bivariate exploration tools:

Exam PA Review Sheet (Last updated 06/04/2023)


Issue 2: Skewness (esp. right skewness due to outliers)
Variable Summary Visual Observations Problems Extreme values:
Pair Statistics Displays B Exert a disproportionate effect on model fit
Numeric Correlations Scatterplots Any noticeable B Distort visualizations (e.g., axes expanded
× (only for linear relationships, e.g., inordinately to take care of outliers)
Numeric relations) monotonic Chart-line, Possible Apply transformations to reduce right
non-linear? Solutions skewness:
B Log transformation
Numeric Mean/median Split Any sizable differences
× of numeric boxplots, in the means/medians (works only for strictly positive variables;
Need Help? Email [email protected]

Categorical variable split histograms remedy: add a small positive number to each
among the factor
value of the variable if there are zeros)
by categorical (stacked or levels?
variable dodged) B Square root transformation
(works for non-negative variables)
Categorical 2-way Bar charts Any sizable differences
Options to handle outliers:
× frequency (stacked, in the class
(Remove) If an outlier is unlikely to have a

4
Categorical table Table dodged, or proportions among B
material effect on the model, then OK to
filled) different factor levels?
remove it.
• Common data issues for numeric variables: B (Keep) If the outliers make up only an
insignificant proportion of the data, then OK
to leave them in the data.

ACTEX Learning
Issue 1: Highly correlated predictors
B (Modify) Modify the outliers to make them
Problems B Difficult to separate out the individual effects of more reasonable, e.g., change negative values
different predictors on the target variable to zero.

B For GLMs, coefficients become widely varying in B (Using robust model forms) Fit models by
Copyright © ArchiMedia Advantage Inc.

sign and magnitude, and difficult to interpret. minimizing the absolute error (instead of
squared error) between predicted values and
All Rights Reserved

Possible B Drop one of the strongly correlated predictors. the observed values.
Solutions B Use PCA to compress the correlated predictors Reason: Absolute error places much less
into a few PCs. relative weight on the large errors and reduces
the impact of outliers on the fitted model.
www.ACTEXLearning.com

B Trade-off: To strike a balance Balance-Scale between:

Exam PA Review Sheet (Last updated 06/04/2023)


Issue 3: Should they be converted to a factor?  Ensuring each level has a sufficient no. of observations
Considerations “Yes” if...  Preserving the differences in the behavior of the target
B Variable has a small no. of distinct values, e.g.,
variable among different factor levels for prediction
quarter of the year (1 to 4). B Tip: Knowledge of the meaning of the variables is often useful
B Variable values are merely numeric labels (no when making combinations, e.g., regrouping hour of day as
sense of numeric order, e.g., group no.). “morning,” “afternoon,” and “evening.”
B Variable has a complex relationship with (Use common sense and check the data dictionary!)
target variable ⇒ factor conversion gives • Interaction:
models (esp. GLMs) more flexibility to
Need Help? Email [email protected]

capture relationship B Definition: Relationship between a predictor and the target


variable depends on the value/level of another predictor.
“No” if...
(Tip: Good to include the definition in your response
B Variable has a large no. of distinct values, e.g., whenever an exam subtask tests interaction!)
hour of the day (would cause a high dimension
B Graphical displays to detect interactions:
and overfitting if converted into a factor).

5
B Variable values have a sense of numeric order Predictor Numeric Target Categorical Target
that may be useful for predicting the target Combination
variable.
Numeric Scatterplot colored Boxplot for numeric
B Variable has a simple monotonic relationship × by categorical predictor split by
with target ⇒ its effect can be effectively Categorical predictor target and faceted by

ACTEX Learning
captured by treating it as a numeric variable. categorical predictor
B Future observations will have new variable Categorical Boxplot for target Bar chart for one
values (e.g., calendar year) × split by one predictor filled by
Categorical predictor and target and faceted by
Copyright © ArchiMedia Advantage Inc.

• Common issue for categorical predictors: Sparse levels faceted by the the other predictor
B Problem with high dimensionality/granularity: Sparse factor other predictor
All Rights Reserved

levels reduce robustness of models and may cause overfitting. Numeric Bin one of the predictors (i.e., cut it into
B A solution: Combine sparse levels with more populous levels × several ranges), or try a decision tree.
where the target variable behaves similarly to form more Numeric
representative and interpretable groups.
www.ACTEXLearning.com

B Interaction vs. correlation: Literally similar, but different Common performance metrics

Exam PA Review Sheet (Last updated 06/04/2023)


 Interaction: Concerns a 3-way relationship (1 target • General
variable and 2 predictors)
 Correlation: Concerns the relationship between two B Regression vs. classification problems:
numeric predictors  Regression: When target is numeric (quantitative)
 Classification: When target is categorical (qualitative)
1.4 Model Construction and Evaluation
(Note: The predictors can be numeric or categorical. Exclamation-Triangle)
Training/test set split B What do metrics computed on training and test sets measure:
• How?  Training: Goodness of fit to training data
Need Help? Email [email protected]

Before fitting models, split the data into the training set  Test: Prediction performance on new, unseen data
(70-80%) and the test set (20-30%) by stratified sampling. B Loss function: Most performance metrics use a loss function
ARROW-CIRCLE-DOWN to capture the discrepancy between the actual and predicted
Models are fitted to ARROW-CIRCLE-RIGHT Training set
values for each observation of the target variable.
Prediction performance is evaluated on ARROW-CIRCLE-RIGHT Test set
Examples:
Test set observations must be truly unseen to the trained model.
 Square loss (most common for numeric targets)

6
• Why do the split?  Absolute loss
B Model performance on the training set tends to be overly  Zero-one loss (mostly for categorical targets)
optimistic and favor complex models.
• Metrics for regression problems:
B Test set provides a more objective ground for assessing the v
performance of models on new, unseen data.
u n

ACTEX Learning
u1 X
B RMSE: t (yi − ŷi )2
B Split replicates the way the models will be used in practice. n i=1

• Why use stratified sampling: To produce representative


n
1 X (yi − ŷi )2
B Pearson χ 2
statistic: (often for count
training and test sets w.r.t. target variable (not predictors). n i=1 ŷi
Copyright © ArchiMedia Advantage Inc.

data)
• Trade-off about the (sizes of the two sets:
Training is more robust • Metrics for (binary) classification problems:
All Rights Reserved

Larger training set ⇒


Evaluation on test set is less reliable B Classification rule:
• Alternative: Do the split based on a time variable, e.g., year, Predicted Predicted
to evaluate how well a model extrapolates past time trends to > cutoff ⇔ = “+”
probability for “+” class
future, unseen years.
www.ACTEXLearning.com

B Confusion matrices:  Typically ranges between 0.5 (random classifier) and 1

Exam PA Review Sheet (Last updated 06/04/2023)


Reference (= Actual) (perfect classifier).
Prediction − + • Summary of performance metrics:
− TN FN
+ FP TP Target Type Model Metrics Criterion

TN + TP proportion of Numeric (R)MSE, Pearson chi-square Lower,


 Accuracy = =
n correctly classified obs. better
FN + FP proportion of Categorical Accuracy, sensitivity, specificity, Higher,
 Classification error rate = = AUC better
n misclassified obs.
TP proportion of +ve obs.
Need Help? Email [email protected]

 Sensitivity = = Cross-validation (CV)


TP + FN correctly classified as +ve
TN proportion of -ve obs. • How it works:
 Specificity = =
TN + FP correctly classified as -ve For a fixed +ve integer k (e.g., 10), randomly split the training
TP proportion of +ve predictions data into k folds of approximately equal size
 Precision = =
FP + TP truly belonging to +ve class ⇓

7
Weighted average relation: Train the model on all but one folds and
n− n+ measure performance on left-out fold
accuracy = × specificity + × sensitivity. ⇓
n n
Repeat with each fold left out in turn
B How confusion matrix metrics vary with cutoff:
to get k performance values
Sensitivity ↓

ACTEX Learning
(

Cutoff ↑ ⇒
Specificity ↑ Average to get overall CV metric

May use a cost-benefit analysis to optimize cutoff. • Common uses of CV:


B Area under the ROC curve (AUC) B Model assessment: To evaluate a model’s test set performance
Copyright © ArchiMedia Advantage Inc.

 Plot sensitivity against specificity for all cutoffs from 0 without using any test set.
to 1 and compute the area under the curve. B Hyperparameter tuning: To tune hyperparameters (=
All Rights Reserved

 Two special points on an ROC curve: parameters with values supplied in advance; not optimized
 by the model fitting algorithm) by picking the values that
(1, 0), if cutoff = 0,
(sensitivity, specificity) = produce the best CV performance (lowest MSE or highest
(0, 1), if cutoff = 1. accuracy).
www.ACTEXLearning.com

• Considerations when selecting the best model: • Effects of undersampling and oversampling on model

Exam PA Review Sheet (Last updated 06/04/2023)


results:
B (Prediction performance) The model should perform well on
test data w.r.t. certain performance metrics. +ve class becomes more prevalent in the balanced data
B (Interpretability) The model should be reasonably ⇓
interpretable, i.e., the predictions should be easily explained Predicted probabilities for +ve class will increase
in terms of the predictors and lead to specific insights. ⇓
B (Ease of implementation) The easier for a model to be For a fixed cutoff, sensitivity ↑ but specificity ↓
implemented (computationally, financially, or logistically),
the better the model. Controlling model complexity
Need Help? Email [email protected]

• Overfitting:
Sidebar: Unbalanced data (for binary targets)
B Definition: Model is trying too hard to capture not only the
• Meaning: One class is much more dominant than the other. signal, but also the noise specific to the training data.
• Problems with unbalanced data: B Indications: Small training error, but large test error
B A classifier implicitly places more weight on the majority class B Problem: An overfitted model fits training data well, but does

8
and tries to fit those observations well, but the minority class not generalize well to new, unseen data (poor predictions).
may be the +ve class. Not a useful model!

B A high accuracy can be deceptive. • Quantitative framework—Bias-variance trade-off:


• Solution 1—Undersampling: Keep all observations from the Feature selection Feature generation

ACTEX Learning
minority class, but draw fewer observations (“undersample”) Complexity
Bias ↑ Bias ↓
from the majority class. Variance ↓ Variance ↑

B Drawback: Less data ⇒ training becomes less robust and the Training error ↑ Training error ↓
Test error has a U-shape & %
classifier becomes more prone to overfitting.
Copyright © ArchiMedia Advantage Inc.

• Solution 2—Oversampling: Keep all observations from the B Bias-variance decomposition of expected test MSE:
All Rights Reserved

majority class, but draw more observations (“oversample”) from  2 


the minority class. ETr,Y0 Y0 − fˆ(X0 )
reducible error irreducible error
B Drawback: More data ⇒ heavier computational burden z }| { z }| {
B Caution: Should be done after training/test set split = [BiasTr (fˆ(X0 ))]2 + VarTr [fˆ(X0 )] + Var(ε0 )
www.ACTEXLearning.com

1.5 Model Validation

Exam PA Review Sheet (Last updated 06/04/2023)


Quantity Bias Variance
• Aim: To check that the selected model has no obvious
Mathematical Difference between the Amount of deficiencies and the model assumptions are largely satisfied.
definition expected value of variability of
prediction and the true prediction
• Validation method based on the training set: For a “nice”
GLM, the deviance residuals should:
value of signal function
Significance Part of the test error Part of the test error 1 (Purely random) Have no systematic patterns.
in PA caused by the model not caused by the model 2 (Homoscedasticity) Have approximately constant variance
being flexible enough to being too complex upon standardization.
capture the signal (overfitting)
3 (Normality) Be approximately normal (for most target
Need Help? Email [email protected]

(underfitting)
distributions).
• Practical implications of bias-variance trade-off:
Check “Residuals vs Fitted” plot for 1 & 2 ; Q-Q plot for
Need to set model complexity to a reasonable level 3.
⇓ • Validation methods based on the test set:
optimize bias-variance trade-off improve prediction
(

9
⇒ B Predicted vs. actual values of target: The two sets of values
avoid underfitting & overfitting performance
should be close (can check this quantitatively or graphically).
• Sidebar: Dimensionality vs. granularity B Benchmark model: Show that the recommended model
outperforms a benchmark model, if one exists (e.g., intercept-
B Granularity ↑ ⇒ model complexity tends to ↑ only GLM, purely random classifier), on the test set.
B Two main differences between the two concepts:

ACTEX Learning
1.6 Recommendations for Next Steps
Concept Applicability Comparability
• (Adjust the business problem) Changes in external factors, e.g.,
Dimensionality Specific to categorical Two categorical
market conditions, regulations, may cause initial assumptions to
variables variables can
Copyright © ArchiMedia Advantage Inc.

shift ⇒ need to modify the business problem to incorporate the


always be ordered
new conditions.
by dimension.
All Rights Reserved

Granularity Applies to both Not always


• (Consult with subject matter experts) Seek validation of model
results from external subject matter experts.
numeric and possible to order
categorical variables two variables by • (Gather additional data) Enlarge training data with new obs.
granularity and/or variables, and retrain the model to improve robustness.
www.ACTEXLearning.com

• (Apply new types of models) Try new types of models when new • Two key components:

Exam PA Review Sheet (Last updated 06/04/2023)


technology or implementation possibilities are available.
1 Target distribution: Choose one (in the linear exponential
• (Refine existing models) Try new combinations or family) that aligns with the characteristics of the target.
transformations of predictors, alternative hyperparameter values,
2 Link functions: Some important considerations:
alternative accuracy measures, etc.
• (Field test proposed model) Implement the recommended model B Ensure the predictions match the range of values of the
in the exact way it will be used to gain users’ confidence. target mean.
B Ensure ease of interpretation, e.g., log link.
2 Specific Types of Model B (Minor) Canonical links make convergence more likely.
(Note: The log link may or may not work when the target
2.1 GLMs
Need Help? Email [email protected]

variable has zero values Exclamation-Triangle; see Exercise 4.1.4 (c) in the manual.)
• Assumptions: LMs vs. GLMs
• Common e.g. of target distributions and link functions:
LMs GLMs
Variable Type Common Dist. Common Link
Independence Given the predictor values, the observations of
the target variable are independent. Real-valued with a Normal Identity

10
(Same for both LMs and GLMs.) bell-shaped dist. (Gaussian)

Target Given the predictor Given the predictor Binary (0/1) Binomial Logit
distribution values, the target values, the target Count (≥ 0, integers) Poisson Log
variable follows a normal distribution is a
+ve, continuous with Gamma, Log
distribution. member of the linear
right skew inverse Gaussian

ACTEX Learning
exponential family.
≥ 0, continuous with Tweedie Log
Mean The target mean directly A function (“link”) of
a large mass at zero
equals the linear the target mean equals
predictor: the linear predictor:
(Note: For gamma and inverse Gaussian, the target variable has to
Copyright © ArchiMedia Advantage Inc.

µ=η g (µ) =
link
η
linear
.
be strictly positive. Values of zero are not allowed. Exclamation-Triangle)
= β0 + β1 X1 + · · · + βp Xp . predictor
All Rights Reserved

Variance Constant, regardless of Varies with µ and the Feature generation


the predictor values predictor values • Methods for handling non-monotonic relations: GLMs,
(Note: The link function in a GLM is applied to the target mean µ; in their basic form, assume that numeric predictors have a
the target variable itself is not transformed. Exclamation-Triangle) monotonic relationship with the target variable.
www.ACTEXLearning.com

1 Polynomial regression: Add polynomial terms to the • Handling categorical predictors—Binarization::

Exam PA Review Sheet (Last updated 06/04/2023)


model equation:
B How it works: (Done in R behind the scenes.)
g(µ) = β0 + β1 X + β2 X 2 + · · · + βm X m + · · · .
| {z } Categorical predictor
polynomial terms

B Pros: Can take care of more complex relationships
A collection of dummy (binary) variables
between the target variable and predictors. The more
indicating one and only one level
polynomial terms included, the more flexible the fit.
(= 1 for that level, = 0 otherwise)
B Cons:

 Coefficients become harder to interpret (all
Need Help? Email [email protected]

Dummy variables serve as predictors in model equation


polynomial terms move together).
 Usually no clear choice of m; can be tuned by CV B Baseline level: The level at which all dummy variables
(EDA can also help) equal 0.
2 Binning: “Bin” the numeric variable and convert it into  R’s default: The alpha-numerically first level
a categorical variable with levels defined as non-overlapping
 Good practice: Reset it to the most common level.

11
intervals over the range of the original variable.
B Pros: No definite order among the coefficients of the • Interactions:
dummy variables corresponding to different bins ⇒ target
Need to “manually” include interaction terms
mean can vary highly irregularly over the bins.
of the product form Xj Xk
B Cons: ⇓

ACTEX Learning
 Usually no clear choice of the no. of bins and the Coefficient of Xj will vary with the value of Xk
associated boundaries
 Results in a loss of information (exact values of the
Interpretation of coefficients
numeric predictor gone)
• General statements:
Copyright © ArchiMedia Advantage Inc.

3 Adding piecewise linear functions: Add features of the


form (X − c)+ . B Coefficient estimates capture the effects (magnitude +
All Rights Reserved

B Pros: A simple way to allow the relationship between direction) of features on the target mean.
a numeric variable and the target mean to vary over B p-values express statistical significance of features; the
different intervals smaller, the more significant.
B Cons: Usually no clear choice of the break points
www.ACTEXLearning.com

• Specific statements based on log link: Assume all else equal.

Exam PA Review Sheet (Last updated 06/04/2023)


Area Backward Forward
B Numeric case: For a unit change in a numeric predictor with 1. Which model to Full model Intercept-only
estimated coefficient β̂j , start with? model
2. Add or drop Drop Add
multiplicative change % change
= eβ̂j , = eβ̂j − 1. variables?
in target mean in target mean
3. Which method Forward selection
B Categorical case: For a non-baseline level of a categorical tends to produce a
predictor with estimated coefficient β̂j , simpler model?

eβ̂j ×
Need Help? Email [email protected]

µ̂ = µ̂ .
@non-baseline level @baseline level • Selection criteria based on penalized likelihood:

Other modeling techniques: Offsets vs. weights B Idea: Prevent overfitting by requiring an included/retained
feature to improve model fit by at least a specified amount.
Offsets Weights B Two common choices:
Form of the Aggregate Average Criterion Definition Penalty per Parameter

12
target variable (e.g., total # claims in (e.g., average # claims
a group of similar in a group of similar
AIC −2l + 2(p + 1) 2
policyholders) policyholders)
BIC −2l + [ln(ntr )](p + 1) ln(ntr )

Do they affect Target mean is Variance is inversely (In R, −2l is treated as the deviance.)
the target mean directly proportional related to exposure: B AIC vs. BIC:

ACTEX Learning
or variance? to exposure,
(some terms)  For both, the lower the value, the better.
e.g., with log link, Var(Yi ) = .
Ei  BIC is more conservative and results in simpler models.
µi = Ei exp(· · · ). Observations with a
larger exposure will
• Manual binarization: Convert factor variables to dummy
variables manually before running stepwise selection.
Copyright © ArchiMedia Advantage Inc.

play a more important


role in model fitting. B Pros: To be able to add (resp. drop) individual factor levels
All Rights Reserved

that are statistically significant (resp. insignificant) w.r.t.


Stepwise selection baseline level

• Selection process: Sequentially add/drop features, one at a B Cons:


time, until there is no improvement in the selection criterion.  More steps in the stepAIC() procedure
www.ACTEXLearning.com

 Possibly non-intuitive results (e.g., only few levels of a 2 α: Mixing parameter

Exam PA Review Sheet (Last updated 06/04/2023)


factor are retained) B Controls the mix between ridge (α = 0) and lasso (α = 1)
(Note: Need to remember α = 0 is ridge regression and
Regularization α = 1 is lasso. Exclamation-Triangle)

• Idea: Reduce overfitting by shrinking the size of the coefficient B Provided that λ is large enough, increasing α from 0 to
estimates, especially those of non-predictive features. 1 makes more coefficient estimates zero.
B Cannot be tuned by cv.glmnet(); need to tune
• How it works: To optimize training loglikelihood (equivalently, manually.
training deviance) adjusted by a penalty term that reflects the
size of the coefficients, i.e., to minimize 2.2 Single Decision Trees
Need Help? Email [email protected]

deviance + regularization penalty. • Basics:


The formulation serves to strike a balance Balance-Scale between goodness B Idea: Divide CUT the feature space into a set of non-overlapping
of fit and model complexity. regions containing relatively homogeneous observations
(w.r.t. target).
• Common forms of penalty term:
B Deliverable: A set of classification rules based on the
Method Penalty Characteristic

13
values/levels of predictors and represented in the form of a
Lasso L = λ pj=1 |βj | Some coef. may be zero “tree” Tree
P

Ridge regression R = λ pj=1 βj2 None reduced to zero B Predictions: Observations in the same terminal node share the
P

Elastic net α L + (1 − α) R Some coef. may be zero same predicted mean (for numeric targets) or same predicted
• Two hyperparameters: class (for categorical targets).

ACTEX Learning
1 λ: Regularization (a.k.a. shrinkage) parameter • Recursive binary splitting:
B Controls the amount of regularization: B Two terms: The algorithm is...
bias2 ↑  Greedy: At each step, adopt the split that leads to the
(
more shrinkage
λ↑ ⇒ complexity ↓ ⇒ .
greatest reduction in impurity at that point, instead of
Copyright © ArchiMedia Advantage Inc.

variance ↓
looking ahead and selecting a split that results in a better
B Feature selection property: For elastic nets with tree in a future step. (Repeat until a stopping criterion
All Rights Reserved

α > 0 (lasso, in particular), some coefficient estimates is reached.)


become exactly zero when λ is large enough.  Top-down: Start from the “top” of the tree, go “down,”
B Typically tuned by CV: Choose λ with the smallest CV and sequentially partition the feature space in a series of
error. splits.
www.ACTEXLearning.com

B Node impurity measures: • Interpretation of trees: Things you can comment on:

Exam PA Review Sheet (Last updated 06/04/2023)


Tree Type Name of Measure Formula B No. of tree splits
Regression Residual sum of squares − ŷRm )2 B Split sequence, e.g., start with X1 , further split the larger
P
i∈Rm (yi

Classification error rate 1 − max1≤k≤K p̂mk bucket by X2 , . . .


Classification Gini index B Which are the most important predictors (usually those in
PK
k=1 p̂mk (1 − p̂mk )
Entropy − k=1 p̂mk log2 (p̂mk ) early splits)?
PK

Properties: B Which terminal nodes have the most observations? Any


sparse nodes?
 The smaller, the purer the observations in the node.
 Gini index and entropy are similar numerically. B Any prominent interactions?
Need Help? Email [email protected]

 Gini index and entropy are more sensitive to node B (Classification trees) Combinations leading to the +ve event
impurity than classification error rate
• Cost-complexity pruning:
Reason: They depend on all p̂mk , not just the max. class
proportion. B Rationale: To reduce tree complexity by pruning branches
from bottom that do not improve goodness of fit by a sufficient
• Tree parameters:

14
amount ⇒ prevent overfitting and ease interpretation.

Parameter Name in R Meaning Effect B How it works:


Minimum minbucket Min. # obs. in a Higher, tree
Step 1. Grow a large tree T0 . (Note: Don’t miss this step. Exclamation-Triangle)
bucket size terminal node less complex Step 2. Minimize the penalized objective function

Complexity Min. improvement Higher, tree relative training error +

ACTEX Learning
cp cp × |T | ,
parameter required for a split less complex (model fit to training data) (tree complexity)

to be made
over all subtrees of T0 , where
(not 100% right...) 
training RSS, for regression,
Copyright © ArchiMedia Advantage Inc.

Maximum maxdepth # edges from root Higher, tree =


depth node to furthest more complex error # misclassifications, for classification.
node
All Rights Reserved

B About the hyperparameter cp:


B Be sure to know how these parameters limit tree complexity!  cp ↑ ⇒ tree less complex (smaller)
B cp can be tuned by CV within rpart(); minbucket and  Typically tuned by CV: Set cp to the value that minimizes
maxdepth have to be tuned by trial and error. CV error (xerror in cptable).
www.ACTEXLearning.com

B Alternative: One-standard-error (1-SE) rule • Combining base predictions to form overall prediction:

Exam PA Review Sheet (Last updated 06/04/2023)


 How: Select the smallest tree whose CV error is within 1 B Case 1 (Regression trees): By averaging:
SE of the minimum CV error. B
1 X ˆ∗b
fˆrf (x) = f (x).
 Rationale: Select a simpler and more interpretable tree B b=1
with comparable prediction performance. (Occam’s razor)
B Case 2 (Classification trees): Two methods:
• Do variable transformations affect GLMs and trees? Probability Class
GLMs Trees base probabilities
(converted based on cutoff)
−→ base classes
Transformations Yes Yes ↓(averaged) ↓(take “majority vote”)
(converted based on cutoff)
on target
Need Help? Email [email protected]

(The transformations alter (The transformations can alter average probability −→ overall class
variable the values of the predictors the calculations of node
(The default is to take the majority vote.)
and target variable that go impurity measures, e.g., RSS,
• Key parameters:
into the likelihood function.) that define the tree splits.)

Transformations Yes Yes, unless the


B mtry: # features sampled as candidates at each split
on predictors (Same reasoning as above) transformations are  Lower mtry ⇒ greater variance reduction

15

monotonic, e.g., log  Common choice: p (classification) or p/3 (regression)
(Monotonic transformations  Typically tuned by CV
will not change the way tree
B ntree: # trees to be grown
splits are made.)
 Higher ntree, more variance reduction
 Often overfitting does not arise even if set to a large no.
2.3 Ensemble Trees

ACTEX Learning
 Set to a relatively small value to save run time
Random forests Boosting
• Idea: • Idea:
Copyright © ArchiMedia Advantage Inc.

B (Variance reduction) Combine the results of multiple trees B In each iteration, fit a tree to the residuals of the preceding
Tree Tree Tree Tree Tree Tree Tree Tree fitted to different bootstrapped training tree and subtract a scaled-down version of the current tree’s
All Rights Reserved

samples in parallel ⇒ reduce variance of overall predictions. predictions from the residuals to form the new residuals.
B (Randomization) Take a random sample of predictors as B Each tree focuses on observations the previous tree predicted
candidates for each split ⇒ reduce correlation between base poorly.
trees ⇒ further reduce variance of overall predictions. B Overall prediction: fˆ(x) = Bb=1 λfˆb (x).
P
www.ACTEXLearning.com

• Key parameters: • Partial dependence plots:

Exam PA Review Sheet (Last updated 06/04/2023)


B eta: Learning rate (or shrinkage) parameter B Definition of partial dependence: Model prediction obtained
 Effects of eta: Higher eta ⇒ algorithm converges faster after averaging the values/levels of variables not of interest:
ntr
1 X
but is more prone to overfitting. PD(x1 ) := fˆ( x1 , xi2 , . . . , xip ).
ntr i=1 |{z} | {z }
 Rule of thumb: Set to a relatively small value fixed averaged

B nrounds: Max. # rounds in the tree construction process B Use: Plot PD(x1 ) against various x1 to show the marginal
effect of X1 on the target variable.
 Effects of nrounds: Higher nrounds ⇒ algorithm learns
B Limitations:
better but is more prone to overfitting.
 Assume predictor of interest is independent of other
 Rule of thumb: Set to a relatively large value
Need Help? Email [email protected]

predictors.
• Random forests vs. boosted trees:  Some predictions may be based on practically
unreasonable combinations of predictor values.
Item Random Forest Boosting

Fitting process In parallel In series (sequential) 2.4 Pros and Cons of Different Models
Focus Variance Bias • Tips for recommending a model: Refer to the business

16
Overfitting Less vulnerable More vulnerable problem (prediction vs. interpretation) and characteristics of
Hyperparameter tuning Less sensitive More sensitive data (e.g., any complex, non-monotonic relations?)
• GLMs:
Two interpretational tools for ensemble trees
B Pros:
• Variable importance plots: 1 (Target distribution) GLMs excel in accommodating a

ACTEX Learning
B Definition of importance scores: The total drop in node wide variety of distributions for the target variable.
impurity (RSS for regression trees and Gini index for 2 (Interpretability) The model equation clearly shows how
classification trees) due to splits over a given predictor, the target mean depends on the features; coefficients =
averaged over all base trees: interpretable measure of directional effect of features.
Copyright © ArchiMedia Advantage Inc.

importance 1 X impurity 3 (Implementation) Simple to implement


= × .
score B reduction B Cons:
All Rights Reserved

all splits over


that predictor
1 (Complex relationships) Unable to capture non-
B Use: To identify important variables (those with a large score) monotonic (e.g., polynomial) or non-additive
B Limitation: Unclear how the important variables affect the relationships (e.g., interaction), unless additional features
target. are manually incorporated.
www.ACTEXLearning.com

2 (Interpretability) For some link functions (e.g., inverse 3 (Categorical variables) Categorical predictors are

Exam PA Review Sheet (Last updated 06/04/2023)


link), the coefficients may be difficult to interpret. automatically handled by separating their levels into
two groups without the need for binarization.
• Regularized GLMs:
4 (Variable selection) Variables are automatically selected
B Pros: as part of the model building process. Variables that
1 (Categorical predictors) Via the use of model matrices, do not appear in the tree are filtered out and the most
binarization of categorical variables is done automatically important variables show up at the top of the tree.
and each factor level treated as a separate feature to be B Cons:
removed.
1 (Overfitting) Strongly dependent on training data (prone
2 (Tuning) An elastic net can be tuned by CV using the
to overfitting) ⇒ predictions unstable with a high
Need Help? Email [email protected]

same criterion (e.g., MSE, accuracy) ultimately used to


variance ⇒ lower user confidence
judge the model against unseen test data.
2 (Numeric variables) Usually need to split based on
3 (Variable selection) For elastic nets with α > 0 , variable
a numeric predictor repeatedly to capture its effect
selection can be done by making λ large enough.
effectively ⇒ tree becomes large, difficult to interpret.
B Cons:
3 (Categorical variables) Tend to favor categorical

17
1 (Categorical predictors) Possible to see some non-intuitive predictors with a large no. of levels
or nonsensical results when only a handful of the levels (Reason: Too many ways to split ⇒ easy to find a
of a categorical predictor are selected. spurious split that looks good on training data, but
2 (Target distribution) Limited/restricted model forms doesn’t really exist in the signal.)
allowed by glmnet() (Weak point!)
3 (Interpretability) Coefficient estimates are more difficult • Ensemble trees:

ACTEX Learning
to interpret ∵ variables are standardized. (Weak point!)
B Pros: Much more robust and predictive than base trees by
• Single trees: combining the results of multiple trees

B Pros: B Cons:
Copyright © ArchiMedia Advantage Inc.

1 (Interpretability) If there are not too many buckets, trees 1 Opaque (“black box”), difficult to interpret
are easy to interpret because of the if/then nature of the (Reason: Many base trees are used, but variable
All Rights Reserved

classification rules and their graphical representation. importance or partial dependence plots can help.)
2 (Complex relationships) Trees excel in handling non- 2 Computationally prohibitive to implement
monotonic and non-additive relationships without the (Reason: Huge computational burden with fitting
need to insert extra features manually. multiple base trees.)
www.ACTEXLearning.com

3 Unsupervised Learning B Generated to capture as much information in the data (w.r.t.

Exam PA Review Sheet (Last updated 06/04/2023)


variance) as possible
• Supervised vs. unsupervised learning:
B Mutually uncorrelated (different PCs capture different
aspects of data)
Supervised Unsupervised
B Relationship between PC scores and PC loadings:
Target Present Absent (or ignored if present)
Goal To make inference or To extract relationships zm = X φm .
(scores) (loadings)
predictions for the target between variables
B Amount of variance explained decreases with PC order,
• Two reasons why unsupervised learning is often more i.e., PC1 explains the most variance and subsequent PCs
Need Help? Email [email protected]

challenging than supervised learning:


explain less and less.
1 (Objectives) Objectives in unsupervised learning are more
• Two applications of PCA:
fuzzy and subjective (no simple goal like prediction).
2 (Hard to assess results) Methods for assessing model quality 1 EDA: Plot the scores of the 1st PC vs. the scores of the 2nd
based on the target variable (e.g., CV) are generally not PC to gain a 2D view of the data in a scatterplot.

18
applicable. 2 Feature generation: Replace the original variables by PCs to
reduce overfitting and improve prediction performance.
3.1 Principal Components Analysis (PCA)
• Interpretation of PCs:
• Idea:
B Signs and magnitudes of PC loadings: What do the PCs

ACTEX Learning
B To transform a set of numeric variables into a smaller set of
represent, e.g., proxy, average, or contrast of which variables?
representative variables (PCs) ⇒ reduce dimension of data
Which variables are more correlated with one another?
B Especially useful for highly correlated data ⇒ a few PCs are
B Sizes of proportions of variance explained (PVEs):
enough to capture most information.
Copyright © ArchiMedia Advantage Inc.

Variance explained by mth PC


• Properties of PCs: PVEm = .
Total variance
All Rights Reserved

B Linear combinations of the original features: Are the first few PVEs large enough (related to the strong
correlations between variables)? If so, the PCs are useful.
zim = φ1m xi1 + φ2m xi2 + · · · + φpm xip .
• Biplots: Visualization of PCA output by displaying both the
with φ21m + φ22m + ··· + φ2pm = 1 (normalization). scores and loading vectors of the first two PCs. Example:
www.ACTEXLearning.com

• Drawbacks of PCA:

Exam PA Review Sheet (Last updated 06/04/2023)


B Loss of interpretability
(Reason: PCs as composite variables can be hard to
interpret.)
B Not good for non-linearly related variables
(Reason: PCs rely on linear transformations of variables.)
B PCA does dimension reduction, but not feature selection.
(Reason: PCs are constructed from all original features.)
B Target variable is ignored. (Remember: PC is unsupervised.)
Need Help? Email [email protected]

3.2 Cluster Analysis


• Idea:
B To partition observations into a set of non-overlapping

19
subgroups (“clusters”) and uncover hidden patterns.

B PC loadings on top and right axes ⇒ deduce meaning of PCs B Observations within each cluster should be rather similar to
one another.
B PC scores on bottom and left axes ⇒ deduce characteristics
of observations (based on meaning of PCs)
B Observations in different clusters should be rather different
(well separated).

ACTEX Learning
• Number of PCs (M ) to use: • Two feature generation methods based on clustering:
cumulative PVE ↑


 B Cluster groups: As a new factor variable
B Trade-off: M ↑ ⇒ dimension ↑
B Cluster means: As a new numeric variable
Copyright © ArchiMedia Advantage Inc.

(if y exists) model complexity ↑



B How to choose M : K-means clustering


All Rights Reserved

 Scree plot: Eyeball the plot and locate the “elbow”


• Idea: For a fixed K (a +ve integer), choose K clusters
(point at which the PVEs of subsequent PCs have
C1 , . . . , CK to minimize the total within-cluster SS, K
P
k=1 W (Ck ).
dropped off to a sufficiently low level).
 CV: Treat M as a hyperparameter to be tuned if y exists.
www.ACTEXLearning.com

• How the algorithm works: Hierarchical clustering

Exam PA Review Sheet (Last updated 06/04/2023)


B Step 1 (Initialization): Given K, randomly select K points in • Idea:
the feature space as initial cluster centers.
B Algorithm:
B Step 2 (Iteration): Repeat the following steps until the cluster
assignments no longer change:  Start with the individual observations, each treated as a
separate cluster.
(a) Assign each obs. to the cluster with the closest center.
 Successively fuse the closest pair of clusters, one at a time.
(b) Recalculate the K cluster centers (hence “K-means”).
 Stop when all clusters are fused into a single cluster
• Good practice: Set nstart to a large integer, e.g., ≥ 20.
containing all observations.
Reason:
The algorithm produces a local optimum, B Output: A “hierarchy” of clusters which can be visualized by
Need Help? Email [email protected]

which depends on the randomly selected a dendrogram


initial cluster centers. • Linkage: To measure the dissimilarity between two clusters, at
⇓ least one of which has ≥ 2 observations
Run the algorithm multiple times to improve the chance
of finding a better local optimum. Linkage The Inter-cluster Dissimilarity Is...
• Selecting the value of K by elbow method:

20
Complete (default) Maximal pairwise distance
B Make a plot of the proportion of variation explained
between-cluster SS Single Minimal pairwise distance
(= ) against K.
total SS Average Average of all pairwise distances
Between-cluster SS
Total SS Centroid Distance between the two cluster
1 centroids

ACTEX Learning
B Complete and average linkage are commonly used.
(Reason: They tend to result in more balanced clusters.)
elbow B Single linkage tends to produce extended, trailing clusters
Copyright © ArchiMedia Advantage Inc.

with single observations fused one-at-a-time.


B Centroid linkage may lead to inversion (some later fusions
All Rights Reserved

occur at a lower height than an earlier fusion).


K
0 1 2 3 4 5 • Dendrogram: An upside-down tree showing the sequence of
fusions and the inter-cluster dissimilarity (“Height”) when each
B Choose the “elbow,” beyond which the proportion of variation
fusion occurs on the vertical axis.
explained is marginal.
www.ACTEXLearning.com

Some insights from a dendrogram: Other issues

Exam PA Review Sheet (Last updated 06/04/2023)


• Scaling of variables matters for both PCA and clustering
B (Similarities between clusters) Clusters joined towards the
bottom of a dendrogram are rather similar to one another, B Without scaling:
while those fused towards the top are rather far apart. Variables with a large order of magnitude will
B (Considerations when choosing the no. of clusters) Try to cut dominate variance and distance calculations
the dendrogram at a height such that: ⇓
have a disproportionate effect on
 The resulting clusters have similar no. of obs. (balanced)
PC loadings & cluster groups
 The difference between the height and the next threshold
should be large enough ⇒ obs. in different clusters have B With scaling (generally recommended): All variables are on
Need Help? Email [email protected]

materially different characteristics. the same scale and share the same degree of importance.
• K-means vs. hierarchical clustering: • Alternative distance measures:
Correlation-based distance
Item K-means Hierarchical B Motivation: Focuses on shapes of feature values rather than
their exact magnitudes.
Is randomization Yes No

21
needed? (for initial cluster centers)
B Limitation: Only makes sense when p ≥ 3, for otherwise the
correlation between two observations always equals ±1.
Is the no. of Yes No
clusters (K needs to be specified) (Specify the height of the
• Clustering and curse of dimensionality:
pre-specified? dendrogram later) B Visualization of the results of cluster analysis becomes
Are the clusters No Yes problematic in high dimensions (p ≥ 3).

ACTEX Learning
nested? (a hierarchy of clusters) B As the number of dimensions increases, our intuition
breaks down and it becomes harder to differentiate between
observations that are close and those that are far apart.
Copyright © ArchiMedia Advantage Inc.
All Rights Reserved

♦ ♣ ♥ ♠ SMILE-BEAM THE END SMILE-BEAM ♠ ♥ ♣ ♦

You might also like