0% found this document useful (0 votes)

54 views21 pages

Pa FS

Uploaded by

Hamza Iftekhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views21 pages

Pa FS

Uploaded by

Hamza Iftekhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

www.ACTEXLearning.

com

MESSAGE FROM AMBROSE

Exam PA Review Sheet (Last updated 06/04/2023)

This review sheet (a.k.a. a “cheat sheet”) follows the order of topics in the ACTEX Study Manual for Exam PA and
provides a “helicopter” HELICOPTER view of the entire PA exam syllabus. As you probably know, the focus of Exam PA is on
conceptual understanding Brain and written communication COMMENT-DOTS. To write well and score high, there are quite a lot of
things you have to memorize in advance (e.g., describe best subset selection, explain how cost-complexity pruning
works, the pros and cons of GLMs vs. decision trees), and this cheat sheet collects, I believe, the most important
items in one place for your convenience. Although no substitute for a thorough review of the study manual, this
cheat sheet should be a valuable aid to enhance retention and memorization BOOK-OPEN. When I myself took Exam PA in
December 2019, I used a preliminary version of this cheat sheet and found it quite useful. (It was only 7 pages long
then!) Please feel free to refine it and put in additional facts and tips that you think are valuable.
Need Help? Email [email protected]

1 General Model Building Steps B (Questions) The issue can be addressed with a few well-
defined questions. Question-Circle
1.1 Problem Definition B (Data) Good and useful data is available for answering the
• Three main categories of predictive modeling problems: questions above.
(More than one category can apply in a given business problem.) B (Impact) The predictions will likely drive actions or increase

1
understanding.
Category Focus Aim
B (Better solution) Predictive analytics likely produces a
Descriptive What happened in To “describe” or interpret solution better than any existing approach.
the past observed trends by
identifying relationships
B (Update) We can continue to monitor and update the models
when new data becomes available.

ACTEX Learning
between variables

Predictive What will happen in To make accurate • How to produce a meaningful problem definition?
the future “predictions” B General strategy: Get to the root cause of the business issue
Prescriptive The impacts of To answer the “what if?” and make it specific enough to be solvable.
Copyright © ArchiMedia Advantage Inc.

different “prescribed” and “what is the best B Specific strategies:

decisions course of action” questions
All Rights Reserved

(Hypotheses) Use prior knowledge of the business problem

• Characteristics of predictive modeling problems: to ask questions Question-Circle and develop testable hypotheses.
(KPIs) Select appropriate key performance indicators to
B (Issue) There is a clearly identified and defined business issue
provide a quantitative basis for measuring success.
to be addressed.
www.ACTEXLearning.com

• Constraints to face: • Granularity: Refers to how precisely a variable is measured,

Exam PA Review Sheet (Last updated 06/04/2023)

i.e., level of detail for the information contained by the variable.
B The availability of easily accessible and high quality data
B Implementation issues, e.g., the presence of necessary Data quality issues
IT infrastructure and technology to fit complex models
• Reasonableness: Data values should be reasonable (make
efficiently, the cost and effort required to maintain the selected
sense) in the context of the business problem, e.g., variables such
model
as age, time, and income Dollar-Sign should be non-negative.

1.2 Data Collection and Validation • Consistency: Records in the data should be inputted
consistently on the same basis and rules, e.g.:
Data design
Need Help? Email [email protected]

B Same measurement unit for numeric variables

• Relevance: Need to ensure that the data is unbiased, i.e.,
representative of the environment where the model will operate. B Same coding scheme for categorical variables

B Population: Important for the data source to be a good proxy • Sufficient documentation: Examples of useful elements:
of the true population of interest. B A description of the dataset overall, including the data source
B Time frame: Choose the time period which best reflects the

2
B A clear description of each variable (definition and format)
business environment of interest.
B Notes about any past updates or other irregularities of the
In general, recent history is better than distant history.
dataset
• Sampling: The process of taking a subset of observations from B A statement of accountability for the correctness of the
the data source to generate the dataset dataset

ACTEX Learning
B Random sampling: “Randomly” draw observations from B A description of the governance processes used to manage the
the underlying population without replacement. Each record dataset
is equally likely to be sampled.
B Stratified sampling: Divide the underlying population into Other data issues
Copyright © ArchiMedia Advantage Inc.

a no. of non-overlapping “strata” (often w.r.t. target) non- • Personally identifiable information (PII): Information that
randomly, then randomly sample a set no. of observations can be used to trace an individual’s identity, e.g., name, SSN,
All Rights Reserved

from each stratum ⇒ get a more representative sample. address, photographs, and biometric records
A special case—systematic sampling: Draw observations
according to a set pattern; no random mechanism controlling
which observations are sampled.
www.ACTEXLearning.com

How to handle PII? 1.3 Exploratory Data Analysis (EDA)

Exam PA Review Sheet (Last updated 06/04/2023)

B Anonymization: Anonymize or de-identify the data to remove • Aim: Use summary statistics + graphical displays to gain
the PII. insights into the distribution of variables on their own and in
relation to one another (esp. the target variable).
B Data security: Ensure that the data receives sufficient
protection. Some typical uses:
B Terms of use: Be well aware of the terms and conditions, and B Clean and validate the data to make it ready for analysis
the privacy policy related to the collection and use of data.
B Identify potentially useful predictors
• Variables with legal/ethical concerns: B Generate useful features (e.g., variable transformations)
Need Help? Email [email protected]

B Sensitive variables: Differential treatment based on sensitive B (Important!) Decide which type of model (GLMs or trees)
variables may lead to unfair discrimination and raise equity is more suitable, e.g., for a complex, non-monotonic relation,
concerns. trees may do better
Examples: Race, ethnicity, gender, age, income Dollar-Sign, disability
• Univariate exploration tools:
status Wheelchair, or other prohibited classes
B Proxy variables: Variables that are closely related to (hence Variable Summary Visual Observations

3
serve as a “proxy” of) prohibited variables. Type Statistics Displays

Examples: Numeric Mean, Histograms, B Any (right)

median, Chart-area skew?
Occupation (possibly a proxy of gender)
variance, boxplots
Geographical location (possibly a proxy of age and B Any unusual
minimum,
income) values?

ACTEX Learning
maximum
• Target leakage: (Important to watch out for!) Categorical Class Bar charts B Which levels are
B Definition: When predictors in a model “leak” information frequencies Chart-bar most common?
about the target variable that would not be available when B Any sparse
Copyright © ArchiMedia Advantage Inc.

the model is deployed in practice levels?

B Key to detecting target leakage—Timing: These variables are B (For binary
All Rights Reserved

observed at the same time as or after the target variable. targets)

B Problem with this issue: These variables cannot serve as Presence of
predictors in practice and would lead to artificially good imbalance
model performance if mistakenly included.
www.ACTEXLearning.com

• Bivariate exploration tools:

Exam PA Review Sheet (Last updated 06/04/2023)

Issue 2: Skewness (esp. right skewness due to outliers)
Variable Summary Visual Observations Problems Extreme values:
Pair Statistics Displays B Exert a disproportionate effect on model fit
Numeric Correlations Scatterplots Any noticeable B Distort visualizations (e.g., axes expanded
× (only for linear relationships, e.g., inordinately to take care of outliers)
Numeric relations) monotonic Chart-line, Possible Apply transformations to reduce right
non-linear? Solutions skewness:
B Log transformation
Numeric Mean/median Split Any sizable differences
× of numeric boxplots, in the means/medians (works only for strictly positive variables;
Need Help? Email [email protected]

Categorical variable split histograms remedy: add a small positive number to each
among the factor
value of the variable if there are zeros)
by categorical (stacked or levels?
variable dodged) B Square root transformation
(works for non-negative variables)
Categorical 2-way Bar charts Any sizable differences
Options to handle outliers:
× frequency (stacked, in the class
(Remove) If an outlier is unlikely to have a

4
Categorical table Table dodged, or proportions among B
material effect on the model, then OK to
filled) different factor levels?
remove it.
• Common data issues for numeric variables: B (Keep) If the outliers make up only an
insignificant proportion of the data, then OK
to leave them in the data.

ACTEX Learning
Issue 1: Highly correlated predictors
B (Modify) Modify the outliers to make them
Problems B Difficult to separate out the individual effects of more reasonable, e.g., change negative values
different predictors on the target variable to zero.

B For GLMs, coefficients become widely varying in B (Using robust model forms) Fit models by
Copyright © ArchiMedia Advantage Inc.

sign and magnitude, and difficult to interpret. minimizing the absolute error (instead of
squared error) between predicted values and
All Rights Reserved

Possible B Drop one of the strongly correlated predictors. the observed values.
Solutions B Use PCA to compress the correlated predictors Reason: Absolute error places much less
into a few PCs. relative weight on the large errors and reduces
the impact of outliers on the fitted model.
www.ACTEXLearning.com

B Trade-off: To strike a balance Balance-Scale between:

Exam PA Review Sheet (Last updated 06/04/2023)

Issue 3: Should they be converted to a factor? Ensuring each level has a sufficient no. of observations
Considerations “Yes” if... Preserving the differences in the behavior of the target
B Variable has a small no. of distinct values, e.g.,
variable among different factor levels for prediction
quarter of the year (1 to 4). B Tip: Knowledge of the meaning of the variables is often useful
B Variable values are merely numeric labels (no when making combinations, e.g., regrouping hour of day as
sense of numeric order, e.g., group no.). “morning,” “afternoon,” and “evening.”
B Variable has a complex relationship with (Use common sense and check the data dictionary!)
target variable ⇒ factor conversion gives • Interaction:
models (esp. GLMs) more flexibility to
Need Help? Email [email protected]

capture relationship B Definition: Relationship between a predictor and the target

variable depends on the value/level of another predictor.
“No” if...
(Tip: Good to include the definition in your response
B Variable has a large no. of distinct values, e.g., whenever an exam subtask tests interaction!)
hour of the day (would cause a high dimension
B Graphical displays to detect interactions:
and overfitting if converted into a factor).

5
B Variable values have a sense of numeric order Predictor Numeric Target Categorical Target
that may be useful for predicting the target Combination
variable.
Numeric Scatterplot colored Boxplot for numeric
B Variable has a simple monotonic relationship × by categorical predictor split by
with target ⇒ its effect can be effectively Categorical predictor target and faceted by

ACTEX Learning
captured by treating it as a numeric variable. categorical predictor
B Future observations will have new variable Categorical Boxplot for target Bar chart for one
values (e.g., calendar year) × split by one predictor filled by
Categorical predictor and target and faceted by
Copyright © ArchiMedia Advantage Inc.

• Common issue for categorical predictors: Sparse levels faceted by the the other predictor
B Problem with high dimensionality/granularity: Sparse factor other predictor
All Rights Reserved

levels reduce robustness of models and may cause overfitting. Numeric Bin one of the predictors (i.e., cut it into
B A solution: Combine sparse levels with more populous levels × several ranges), or try a decision tree.
where the target variable behaves similarly to form more Numeric
representative and interpretable groups.
www.ACTEXLearning.com

B Interaction vs. correlation: Literally similar, but different Common performance metrics

Exam PA Review Sheet (Last updated 06/04/2023)

Interaction: Concerns a 3-way relationship (1 target • General
variable and 2 predictors)
Correlation: Concerns the relationship between two B Regression vs. classification problems:
numeric predictors Regression: When target is numeric (quantitative)
Classification: When target is categorical (qualitative)
1.4 Model Construction and Evaluation
(Note: The predictors can be numeric or categorical. Exclamation-Triangle)
Training/test set split B What do metrics computed on training and test sets measure:
• How? Training: Goodness of fit to training data
Need Help? Email [email protected]

Before fitting models, split the data into the training set Test: Prediction performance on new, unseen data
(70-80%) and the test set (20-30%) by stratified sampling. B Loss function: Most performance metrics use a loss function
ARROW-CIRCLE-DOWN to capture the discrepancy between the actual and predicted
Models are fitted to ARROW-CIRCLE-RIGHT Training set
values for each observation of the target variable.
Prediction performance is evaluated on ARROW-CIRCLE-RIGHT Test set
Examples:
Test set observations must be truly unseen to the trained model.
Square loss (most common for numeric targets)

6
• Why do the split? Absolute loss
B Model performance on the training set tends to be overly Zero-one loss (mostly for categorical targets)
optimistic and favor complex models.
• Metrics for regression problems:
B Test set provides a more objective ground for assessing the v
performance of models on new, unseen data.
u n

ACTEX Learning
u1 X
B RMSE: t (yi − ŷi )2
B Split replicates the way the models will be used in practice. n i=1

• Why use stratified sampling: To produce representative

n
1 X (yi − ŷi )2
B Pearson χ 2
statistic: (often for count
training and test sets w.r.t. target variable (not predictors). n i=1 ŷi
Copyright © ArchiMedia Advantage Inc.

data)
• Trade-off about the (sizes of the two sets:
Training is more robust • Metrics for (binary) classification problems:
All Rights Reserved

Larger training set ⇒

Evaluation on test set is less reliable B Classification rule:
• Alternative: Do the split based on a time variable, e.g., year, Predicted Predicted
to evaluate how well a model extrapolates past time trends to > cutoff ⇔ = “+”
probability for “+” class
future, unseen years.
www.ACTEXLearning.com

B Confusion matrices: Typically ranges between 0.5 (random classifier) and 1

Exam PA Review Sheet (Last updated 06/04/2023)

Reference (= Actual) (perfect classifier).
Prediction − + • Summary of performance metrics:
− TN FN
+ FP TP Target Type Model Metrics Criterion

TN + TP proportion of Numeric (R)MSE, Pearson chi-square Lower,

Accuracy = =
n correctly classified obs. better
FN + FP proportion of Categorical Accuracy, sensitivity, specificity, Higher,
Classification error rate = = AUC better
n misclassified obs.
TP proportion of +ve obs.
Need Help? Email [email protected]

Sensitivity = = Cross-validation (CV)

TP + FN correctly classified as +ve
TN proportion of -ve obs. • How it works:
Specificity = =
TN + FP correctly classified as -ve For a fixed +ve integer k (e.g., 10), randomly split the training
TP proportion of +ve predictions data into k folds of approximately equal size
Precision = =
FP + TP truly belonging to +ve class ⇓

7
Weighted average relation: Train the model on all but one folds and
n− n+ measure performance on left-out fold
accuracy = × specificity + × sensitivity. ⇓
n n
Repeat with each fold left out in turn
B How confusion matrix metrics vary with cutoff:
to get k performance values
Sensitivity ↓

ACTEX Learning
(
⇓
Cutoff ↑ ⇒
Specificity ↑ Average to get overall CV metric

May use a cost-benefit analysis to optimize cutoff. • Common uses of CV:

B Area under the ROC curve (AUC) B Model assessment: To evaluate a model’s test set performance
Copyright © ArchiMedia Advantage Inc.

Plot sensitivity against specificity for all cutoffs from 0 without using any test set.
to 1 and compute the area under the curve. B Hyperparameter tuning: To tune hyperparameters (=
All Rights Reserved

Two special points on an ROC curve: parameters with values supplied in advance; not optimized
 by the model fitting algorithm) by picking the values that
(1, 0), if cutoff = 0,
(sensitivity, specificity) = produce the best CV performance (lowest MSE or highest
(0, 1), if cutoff = 1. accuracy).
www.ACTEXLearning.com

• Considerations when selecting the best model: • Effects of undersampling and oversampling on model

Exam PA Review Sheet (Last updated 06/04/2023)

results:
B (Prediction performance) The model should perform well on
test data w.r.t. certain performance metrics. +ve class becomes more prevalent in the balanced data
B (Interpretability) The model should be reasonably ⇓
interpretable, i.e., the predictions should be easily explained Predicted probabilities for +ve class will increase
in terms of the predictors and lead to specific insights. ⇓
B (Ease of implementation) The easier for a model to be For a fixed cutoff, sensitivity ↑ but specificity ↓
implemented (computationally, financially, or logistically),
the better the model. Controlling model complexity
Need Help? Email [email protected]

• Overfitting:
Sidebar: Unbalanced data (for binary targets)
B Definition: Model is trying too hard to capture not only the
• Meaning: One class is much more dominant than the other. signal, but also the noise specific to the training data.
• Problems with unbalanced data: B Indications: Small training error, but large test error
B A classifier implicitly places more weight on the majority class B Problem: An overfitted model fits training data well, but does

8
and tries to fit those observations well, but the minority class not generalize well to new, unseen data (poor predictions).
may be the +ve class. Not a useful model!

B A high accuracy can be deceptive. • Quantitative framework—Bias-variance trade-off:

• Solution 1—Undersampling: Keep all observations from the Feature selection Feature generation

ACTEX Learning
minority class, but draw fewer observations (“undersample”) Complexity
Bias ↑ Bias ↓
from the majority class. Variance ↓ Variance ↑

B Drawback: Less data ⇒ training becomes less robust and the Training error ↑ Training error ↓
Test error has a U-shape & %
classifier becomes more prone to overfitting.
Copyright © ArchiMedia Advantage Inc.

• Solution 2—Oversampling: Keep all observations from the B Bias-variance decomposition of expected test MSE:
All Rights Reserved

majority class, but draw more observations (“oversample”) from 2

the minority class. ETr,Y0 Y0 − fˆ(X0 )
reducible error irreducible error
B Drawback: More data ⇒ heavier computational burden z }| { z }| {
B Caution: Should be done after training/test set split = [BiasTr (fˆ(X0 ))]2 + VarTr [fˆ(X0 )] + Var(ε0 )
www.ACTEXLearning.com

1.5 Model Validation

Exam PA Review Sheet (Last updated 06/04/2023)

Quantity Bias Variance
• Aim: To check that the selected model has no obvious
Mathematical Difference between the Amount of deficiencies and the model assumptions are largely satisfied.
definition expected value of variability of
prediction and the true prediction
• Validation method based on the training set: For a “nice”
GLM, the deviance residuals should:
value of signal function
Significance Part of the test error Part of the test error 1 (Purely random) Have no systematic patterns.
in PA caused by the model not caused by the model 2 (Homoscedasticity) Have approximately constant variance
being flexible enough to being too complex upon standardization.
capture the signal (overfitting)
3 (Normality) Be approximately normal (for most target
Need Help? Email [email protected]

(underfitting)
distributions).
• Practical implications of bias-variance trade-off:
Check “Residuals vs Fitted” plot for 1 & 2 ; Q-Q plot for
Need to set model complexity to a reasonable level 3.
⇓ • Validation methods based on the test set:
optimize bias-variance trade-off improve prediction
(

9
⇒ B Predicted vs. actual values of target: The two sets of values
avoid underfitting & overfitting performance
should be close (can check this quantitatively or graphically).
• Sidebar: Dimensionality vs. granularity B Benchmark model: Show that the recommended model
outperforms a benchmark model, if one exists (e.g., intercept-
B Granularity ↑ ⇒ model complexity tends to ↑ only GLM, purely random classifier), on the test set.
B Two main differences between the two concepts:

ACTEX Learning
1.6 Recommendations for Next Steps
Concept Applicability Comparability
• (Adjust the business problem) Changes in external factors, e.g.,
Dimensionality Specific to categorical Two categorical
market conditions, regulations, may cause initial assumptions to
variables variables can
Copyright © ArchiMedia Advantage Inc.

shift ⇒ need to modify the business problem to incorporate the

always be ordered
new conditions.
by dimension.
All Rights Reserved

Granularity Applies to both Not always

• (Consult with subject matter experts) Seek validation of model
results from external subject matter experts.
numeric and possible to order
categorical variables two variables by • (Gather additional data) Enlarge training data with new obs.
granularity and/or variables, and retrain the model to improve robustness.
www.ACTEXLearning.com

• (Apply new types of models) Try new types of models when new • Two key components:

Exam PA Review Sheet (Last updated 06/04/2023)

technology or implementation possibilities are available.
1 Target distribution: Choose one (in the linear exponential
• (Refine existing models) Try new combinations or family) that aligns with the characteristics of the target.
transformations of predictors, alternative hyperparameter values,
2 Link functions: Some important considerations:
alternative accuracy measures, etc.
• (Field test proposed model) Implement the recommended model B Ensure the predictions match the range of values of the
in the exact way it will be used to gain users’ confidence. target mean.
B Ensure ease of interpretation, e.g., log link.
2 Specific Types of Model B (Minor) Canonical links make convergence more likely.
(Note: The log link may or may not work when the target
2.1 GLMs
Need Help? Email [email protected]

variable has zero values Exclamation-Triangle; see Exercise 4.1.4 (c) in the manual.)
• Assumptions: LMs vs. GLMs
• Common e.g. of target distributions and link functions:
LMs GLMs
Variable Type Common Dist. Common Link
Independence Given the predictor values, the observations of
the target variable are independent. Real-valued with a Normal Identity

10
(Same for both LMs and GLMs.) bell-shaped dist. (Gaussian)

Target Given the predictor Given the predictor Binary (0/1) Binomial Logit
distribution values, the target values, the target Count (≥ 0, integers) Poisson Log
variable follows a normal distribution is a
+ve, continuous with Gamma, Log
distribution. member of the linear
right skew inverse Gaussian

ACTEX Learning
exponential family.
≥ 0, continuous with Tweedie Log
Mean The target mean directly A function (“link”) of
a large mass at zero
equals the linear the target mean equals
predictor: the linear predictor:
(Note: For gamma and inverse Gaussian, the target variable has to
Copyright © ArchiMedia Advantage Inc.

µ=η g (µ) =
link
η
linear
.
be strictly positive. Values of zero are not allowed. Exclamation-Triangle)
= β0 + β1 X1 + · · · + βp Xp . predictor
All Rights Reserved

Variance Constant, regardless of Varies with µ and the Feature generation

the predictor values predictor values • Methods for handling non-monotonic relations: GLMs,
(Note: The link function in a GLM is applied to the target mean µ; in their basic form, assume that numeric predictors have a
the target variable itself is not transformed. Exclamation-Triangle) monotonic relationship with the target variable.
www.ACTEXLearning.com

1 Polynomial regression: Add polynomial terms to the • Handling categorical predictors—Binarization::

Exam PA Review Sheet (Last updated 06/04/2023)

model equation:
B How it works: (Done in R behind the scenes.)
g(µ) = β0 + β1 X + β2 X 2 + · · · + βm X m + · · · .
| {z } Categorical predictor
polynomial terms
⇓
B Pros: Can take care of more complex relationships
A collection of dummy (binary) variables
between the target variable and predictors. The more
indicating one and only one level
polynomial terms included, the more flexible the fit.
(= 1 for that level, = 0 otherwise)
B Cons:
⇓
Coefficients become harder to interpret (all
Need Help? Email [email protected]

Dummy variables serve as predictors in model equation

polynomial terms move together).
Usually no clear choice of m; can be tuned by CV B Baseline level: The level at which all dummy variables
(EDA can also help) equal 0.
2 Binning: “Bin” the numeric variable and convert it into R’s default: The alpha-numerically first level
a categorical variable with levels defined as non-overlapping
Good practice: Reset it to the most common level.

11
intervals over the range of the original variable.
B Pros: No definite order among the coefficients of the • Interactions:
dummy variables corresponding to different bins ⇒ target
Need to “manually” include interaction terms
mean can vary highly irregularly over the bins.
of the product form Xj Xk
B Cons: ⇓

ACTEX Learning
Usually no clear choice of the no. of bins and the Coefficient of Xj will vary with the value of Xk
associated boundaries
Results in a loss of information (exact values of the
Interpretation of coefficients
numeric predictor gone)
• General statements:
Copyright © ArchiMedia Advantage Inc.

3 Adding piecewise linear functions: Add features of the

form (X − c)+ . B Coefficient estimates capture the effects (magnitude +
All Rights Reserved

B Pros: A simple way to allow the relationship between direction) of features on the target mean.
a numeric variable and the target mean to vary over B p-values express statistical significance of features; the
different intervals smaller, the more significant.
B Cons: Usually no clear choice of the break points
www.ACTEXLearning.com

• Specific statements based on log link: Assume all else equal.

Exam PA Review Sheet (Last updated 06/04/2023)

Area Backward Forward
B Numeric case: For a unit change in a numeric predictor with 1. Which model to Full model Intercept-only
estimated coefficient β̂j , start with? model
2. Add or drop Drop Add
multiplicative change % change
= eβ̂j , = eβ̂j − 1. variables?
in target mean in target mean
3. Which method Forward selection
B Categorical case: For a non-baseline level of a categorical tends to produce a
predictor with estimated coefficient β̂j , simpler model?

eβ̂j ×
Need Help? Email [email protected]

µ̂ = µ̂ .
@non-baseline level @baseline level • Selection criteria based on penalized likelihood:

Other modeling techniques: Offsets vs. weights B Idea: Prevent overfitting by requiring an included/retained
feature to improve model fit by at least a specified amount.
Offsets Weights B Two common choices:
Form of the Aggregate Average Criterion Definition Penalty per Parameter

12
target variable (e.g., total # claims in (e.g., average # claims
a group of similar in a group of similar
AIC −2l + 2(p + 1) 2
policyholders) policyholders)
BIC −2l + [ln(ntr )](p + 1) ln(ntr )

Do they affect Target mean is Variance is inversely (In R, −2l is treated as the deviance.)
the target mean directly proportional related to exposure: B AIC vs. BIC:

ACTEX Learning
or variance? to exposure,
(some terms) For both, the lower the value, the better.
e.g., with log link, Var(Yi ) = .
Ei BIC is more conservative and results in simpler models.
µi = Ei exp(· · · ). Observations with a
larger exposure will
• Manual binarization: Convert factor variables to dummy
variables manually before running stepwise selection.
Copyright © ArchiMedia Advantage Inc.

play a more important

role in model fitting. B Pros: To be able to add (resp. drop) individual factor levels
All Rights Reserved

that are statistically significant (resp. insignificant) w.r.t.

Stepwise selection baseline level

• Selection process: Sequentially add/drop features, one at a B Cons:

time, until there is no improvement in the selection criterion. More steps in the stepAIC() procedure
www.ACTEXLearning.com

Possibly non-intuitive results (e.g., only few levels of a 2 α: Mixing parameter

Exam PA Review Sheet (Last updated 06/04/2023)

factor are retained) B Controls the mix between ridge (α = 0) and lasso (α = 1)
(Note: Need to remember α = 0 is ridge regression and
Regularization α = 1 is lasso. Exclamation-Triangle)

• Idea: Reduce overfitting by shrinking the size of the coefficient B Provided that λ is large enough, increasing α from 0 to
estimates, especially those of non-predictive features. 1 makes more coefficient estimates zero.
B Cannot be tuned by cv.glmnet(); need to tune
• How it works: To optimize training loglikelihood (equivalently, manually.
training deviance) adjusted by a penalty term that reflects the
size of the coefficients, i.e., to minimize 2.2 Single Decision Trees
Need Help? Email [email protected]

deviance + regularization penalty. • Basics:

The formulation serves to strike a balance Balance-Scale between goodness B Idea: Divide CUT the feature space into a set of non-overlapping
of fit and model complexity. regions containing relatively homogeneous observations
(w.r.t. target).
• Common forms of penalty term:
B Deliverable: A set of classification rules based on the
Method Penalty Characteristic

13
values/levels of predictors and represented in the form of a
Lasso L = λ pj=1 |βj | Some coef. may be zero “tree” Tree
P

Ridge regression R = λ pj=1 βj2 None reduced to zero B Predictions: Observations in the same terminal node share the
P

Elastic net α L + (1 − α) R Some coef. may be zero same predicted mean (for numeric targets) or same predicted
• Two hyperparameters: class (for categorical targets).

ACTEX Learning
1 λ: Regularization (a.k.a. shrinkage) parameter • Recursive binary splitting:
B Controls the amount of regularization: B Two terms: The algorithm is...
bias2 ↑ Greedy: At each step, adopt the split that leads to the
(
more shrinkage
λ↑ ⇒ complexity ↓ ⇒ .
greatest reduction in impurity at that point, instead of
Copyright © ArchiMedia Advantage Inc.

variance ↓
looking ahead and selecting a split that results in a better
B Feature selection property: For elastic nets with tree in a future step. (Repeat until a stopping criterion
All Rights Reserved

α > 0 (lasso, in particular), some coefficient estimates is reached.)

become exactly zero when λ is large enough. Top-down: Start from the “top” of the tree, go “down,”
B Typically tuned by CV: Choose λ with the smallest CV and sequentially partition the feature space in a series of
error. splits.
www.ACTEXLearning.com

B Node impurity measures: • Interpretation of trees: Things you can comment on:

Exam PA Review Sheet (Last updated 06/04/2023)

Tree Type Name of Measure Formula B No. of tree splits
Regression Residual sum of squares − ŷRm )2 B Split sequence, e.g., start with X1 , further split the larger
P
i∈Rm (yi

Classification error rate 1 − max1≤k≤K p̂mk bucket by X2 , . . .

Classification Gini index B Which are the most important predictors (usually those in
PK
k=1 p̂mk (1 − p̂mk )
Entropy − k=1 p̂mk log2 (p̂mk ) early splits)?
PK

Properties: B Which terminal nodes have the most observations? Any

sparse nodes?
The smaller, the purer the observations in the node.
Gini index and entropy are similar numerically. B Any prominent interactions?
Need Help? Email [email protected]

Gini index and entropy are more sensitive to node B (Classification trees) Combinations leading to the +ve event
impurity than classification error rate
• Cost-complexity pruning:
Reason: They depend on all p̂mk , not just the max. class
proportion. B Rationale: To reduce tree complexity by pruning branches
from bottom that do not improve goodness of fit by a sufficient
• Tree parameters:

14
amount ⇒ prevent overfitting and ease interpretation.

Parameter Name in R Meaning Effect B How it works:

Minimum minbucket Min. # obs. in a Higher, tree
Step 1. Grow a large tree T0 . (Note: Don’t miss this step. Exclamation-Triangle)
bucket size terminal node less complex Step 2. Minimize the penalized objective function

Complexity Min. improvement Higher, tree relative training error +

ACTEX Learning
cp cp × |T | ,
parameter required for a split less complex (model fit to training data) (tree complexity)

to be made
over all subtrees of T0 , where
(not 100% right...) 
training RSS, for regression,
Copyright © ArchiMedia Advantage Inc.

Maximum maxdepth # edges from root Higher, tree =

B About the hyperparameter cp:

B Be sure to know how these parameters limit tree complexity! cp ↑ ⇒ tree less complex (smaller)
B cp can be tuned by CV within rpart(); minbucket and Typically tuned by CV: Set cp to the value that minimizes
maxdepth have to be tuned by trial and error. CV error (xerror in cptable).
www.ACTEXLearning.com

B Alternative: One-standard-error (1-SE) rule • Combining base predictions to form overall prediction:

Exam PA Review Sheet (Last updated 06/04/2023)

How: Select the smallest tree whose CV error is within 1 B Case 1 (Regression trees): By averaging:
SE of the minimum CV error. B
1 X ˆ∗b
fˆrf (x) = f (x).
Rationale: Select a simpler and more interpretable tree B b=1
with comparable prediction performance. (Occam’s razor)
B Case 2 (Classification trees): Two methods:
• Do variable transformations affect GLMs and trees? Probability Class
GLMs Trees base probabilities
(converted based on cutoff)
−→ base classes
Transformations Yes Yes ↓(averaged) ↓(take “majority vote”)
(converted based on cutoff)
on target
Need Help? Email [email protected]

(The transformations alter (The transformations can alter average probability −→ overall class
variable the values of the predictors the calculations of node
(The default is to take the majority vote.)
and target variable that go impurity measures, e.g., RSS,
• Key parameters:
into the likelihood function.) that define the tree splits.)

Transformations Yes Yes, unless the

B mtry: # features sampled as candidates at each split
on predictors (Same reasoning as above) transformations are Lower mtry ⇒ greater variance reduction

15
√
monotonic, e.g., log Common choice: p (classification) or p/3 (regression)
(Monotonic transformations Typically tuned by CV
will not change the way tree
B ntree: # trees to be grown
splits are made.)
Higher ntree, more variance reduction
Often overfitting does not arise even if set to a large no.
2.3 Ensemble Trees

ACTEX Learning
Set to a relatively small value to save run time
Random forests Boosting
• Idea: • Idea:
Copyright © ArchiMedia Advantage Inc.

B (Variance reduction) Combine the results of multiple trees B In each iteration, fit a tree to the residuals of the preceding
Tree Tree Tree Tree Tree Tree Tree Tree fitted to different bootstrapped training tree and subtract a scaled-down version of the current tree’s
All Rights Reserved

samples in parallel ⇒ reduce variance of overall predictions. predictions from the residuals to form the new residuals.
B (Randomization) Take a random sample of predictors as B Each tree focuses on observations the previous tree predicted
candidates for each split ⇒ reduce correlation between base poorly.
trees ⇒ further reduce variance of overall predictions. B Overall prediction: fˆ(x) = Bb=1 λfˆb (x).
P
www.ACTEXLearning.com

• Key parameters: • Partial dependence plots:

Exam PA Review Sheet (Last updated 06/04/2023)

B eta: Learning rate (or shrinkage) parameter B Definition of partial dependence: Model prediction obtained
Effects of eta: Higher eta ⇒ algorithm converges faster after averaging the values/levels of variables not of interest:
ntr
1 X
but is more prone to overfitting. PD(x1 ) := fˆ( x1 , xi2 , . . . , xip ).
ntr i=1 |{z} | {z }
Rule of thumb: Set to a relatively small value fixed averaged

B nrounds: Max. # rounds in the tree construction process B Use: Plot PD(x1 ) against various x1 to show the marginal
effect of X1 on the target variable.
Effects of nrounds: Higher nrounds ⇒ algorithm learns
B Limitations:
better but is more prone to overfitting.
Assume predictor of interest is independent of other
Rule of thumb: Set to a relatively large value
Need Help? Email [email protected]

predictors.
• Random forests vs. boosted trees: Some predictions may be based on practically
unreasonable combinations of predictor values.
Item Random Forest Boosting

Fitting process In parallel In series (sequential) 2.4 Pros and Cons of Different Models
Focus Variance Bias • Tips for recommending a model: Refer to the business

16
Overfitting Less vulnerable More vulnerable problem (prediction vs. interpretation) and characteristics of
Hyperparameter tuning Less sensitive More sensitive data (e.g., any complex, non-monotonic relations?)
• GLMs:
Two interpretational tools for ensemble trees
B Pros:
• Variable importance plots: 1 (Target distribution) GLMs excel in accommodating a

ACTEX Learning
B Definition of importance scores: The total drop in node wide variety of distributions for the target variable.
impurity (RSS for regression trees and Gini index for 2 (Interpretability) The model equation clearly shows how
classification trees) due to splits over a given predictor, the target mean depends on the features; coefficients =
averaged over all base trees: interpretable measure of directional effect of features.
Copyright © ArchiMedia Advantage Inc.

importance 1 X impurity 3 (Implementation) Simple to implement

all splits over

that predictor
1 (Complex relationships) Unable to capture non-
B Use: To identify important variables (those with a large score) monotonic (e.g., polynomial) or non-additive
B Limitation: Unclear how the important variables affect the relationships (e.g., interaction), unless additional features
target. are manually incorporated.
www.ACTEXLearning.com

2 (Interpretability) For some link functions (e.g., inverse 3 (Categorical variables) Categorical predictors are

Exam PA Review Sheet (Last updated 06/04/2023)

link), the coefficients may be difficult to interpret. automatically handled by separating their levels into
two groups without the need for binarization.
• Regularized GLMs:
4 (Variable selection) Variables are automatically selected
B Pros: as part of the model building process. Variables that
1 (Categorical predictors) Via the use of model matrices, do not appear in the tree are filtered out and the most
binarization of categorical variables is done automatically important variables show up at the top of the tree.
and each factor level treated as a separate feature to be B Cons:
removed.
1 (Overfitting) Strongly dependent on training data (prone
2 (Tuning) An elastic net can be tuned by CV using the
to overfitting) ⇒ predictions unstable with a high
Need Help? Email [email protected]

same criterion (e.g., MSE, accuracy) ultimately used to

variance ⇒ lower user confidence
judge the model against unseen test data.
2 (Numeric variables) Usually need to split based on
3 (Variable selection) For elastic nets with α > 0 , variable
a numeric predictor repeatedly to capture its effect
selection can be done by making λ large enough.
effectively ⇒ tree becomes large, difficult to interpret.
B Cons:
3 (Categorical variables) Tend to favor categorical

17
1 (Categorical predictors) Possible to see some non-intuitive predictors with a large no. of levels
or nonsensical results when only a handful of the levels (Reason: Too many ways to split ⇒ easy to find a
of a categorical predictor are selected. spurious split that looks good on training data, but
2 (Target distribution) Limited/restricted model forms doesn’t really exist in the signal.)
allowed by glmnet() (Weak point!)
3 (Interpretability) Coefficient estimates are more difficult • Ensemble trees:

ACTEX Learning
to interpret ∵ variables are standardized. (Weak point!)
B Pros: Much more robust and predictive than base trees by
• Single trees: combining the results of multiple trees

1 (Interpretability) If there are not too many buckets, trees 1 Opaque (“black box”), difficult to interpret
are easy to interpret because of the if/then nature of the (Reason: Many base trees are used, but variable
All Rights Reserved

classification rules and their graphical representation. importance or partial dependence plots can help.)
2 (Complex relationships) Trees excel in handling non- 2 Computationally prohibitive to implement
monotonic and non-additive relationships without the (Reason: Huge computational burden with fitting
need to insert extra features manually. multiple base trees.)
www.ACTEXLearning.com

3 Unsupervised Learning B Generated to capture as much information in the data (w.r.t.

Exam PA Review Sheet (Last updated 06/04/2023)

variance) as possible
• Supervised vs. unsupervised learning:
B Mutually uncorrelated (different PCs capture different
aspects of data)
Supervised Unsupervised
B Relationship between PC scores and PC loadings:
Target Present Absent (or ignored if present)
Goal To make inference or To extract relationships zm = X φm .
(scores) (loadings)
predictions for the target between variables
B Amount of variance explained decreases with PC order,
• Two reasons why unsupervised learning is often more i.e., PC1 explains the most variance and subsequent PCs
Need Help? Email [email protected]

challenging than supervised learning:

explain less and less.
1 (Objectives) Objectives in unsupervised learning are more
• Two applications of PCA:
fuzzy and subjective (no simple goal like prediction).
2 (Hard to assess results) Methods for assessing model quality 1 EDA: Plot the scores of the 1st PC vs. the scores of the 2nd
based on the target variable (e.g., CV) are generally not PC to gain a 2D view of the data in a scatterplot.

18
applicable. 2 Feature generation: Replace the original variables by PCs to
reduce overfitting and improve prediction performance.
3.1 Principal Components Analysis (PCA)
• Interpretation of PCs:
• Idea:
B Signs and magnitudes of PC loadings: What do the PCs

ACTEX Learning
B To transform a set of numeric variables into a smaller set of
represent, e.g., proxy, average, or contrast of which variables?
representative variables (PCs) ⇒ reduce dimension of data
Which variables are more correlated with one another?
B Especially useful for highly correlated data ⇒ a few PCs are
B Sizes of proportions of variance explained (PVEs):
enough to capture most information.
Copyright © ArchiMedia Advantage Inc.

Variance explained by mth PC

B Linear combinations of the original features: Are the first few PVEs large enough (related to the strong
correlations between variables)? If so, the PCs are useful.
zim = φ1m xi1 + φ2m xi2 + · · · + φpm xip .
• Biplots: Visualization of PCA output by displaying both the
with φ21m + φ22m + ··· + φ2pm = 1 (normalization). scores and loading vectors of the first two PCs. Example:
www.ACTEXLearning.com

• Drawbacks of PCA:

Exam PA Review Sheet (Last updated 06/04/2023)

B Loss of interpretability
(Reason: PCs as composite variables can be hard to
interpret.)
B Not good for non-linearly related variables
(Reason: PCs rely on linear transformations of variables.)
B PCA does dimension reduction, but not feature selection.
(Reason: PCs are constructed from all original features.)
B Target variable is ignored. (Remember: PC is unsupervised.)
Need Help? Email [email protected]

3.2 Cluster Analysis

• Idea:
B To partition observations into a set of non-overlapping

19
subgroups (“clusters”) and uncover hidden patterns.

B PC loadings on top and right axes ⇒ deduce meaning of PCs B Observations within each cluster should be rather similar to
one another.
B PC scores on bottom and left axes ⇒ deduce characteristics
of observations (based on meaning of PCs)
B Observations in different clusters should be rather different
(well separated).

ACTEX Learning
• Number of PCs (M ) to use: • Two feature generation methods based on clustering:
cumulative PVE ↑


 B Cluster groups: As a new factor variable
B Trade-off: M ↑ ⇒ dimension ↑
B Cluster means: As a new numeric variable
Copyright © ArchiMedia Advantage Inc.

(if y exists) model complexity ↑




B How to choose M : K-means clustering

Scree plot: Eyeball the plot and locate the “elbow”

• Idea: For a fixed K (a +ve integer), choose K clusters
(point at which the PVEs of subsequent PCs have
C1 , . . . , CK to minimize the total within-cluster SS, K
P
k=1 W (Ck ).
dropped off to a sufficiently low level).
CV: Treat M as a hyperparameter to be tuned if y exists.
www.ACTEXLearning.com

• How the algorithm works: Hierarchical clustering

Exam PA Review Sheet (Last updated 06/04/2023)

B Step 1 (Initialization): Given K, randomly select K points in • Idea:
the feature space as initial cluster centers.
B Algorithm:
B Step 2 (Iteration): Repeat the following steps until the cluster
assignments no longer change: Start with the individual observations, each treated as a
separate cluster.
(a) Assign each obs. to the cluster with the closest center.
Successively fuse the closest pair of clusters, one at a time.
(b) Recalculate the K cluster centers (hence “K-means”).
Stop when all clusters are fused into a single cluster
• Good practice: Set nstart to a large integer, e.g., ≥ 20.
containing all observations.
Reason:
The algorithm produces a local optimum, B Output: A “hierarchy” of clusters which can be visualized by
Need Help? Email [email protected]

which depends on the randomly selected a dendrogram

initial cluster centers. • Linkage: To measure the dissimilarity between two clusters, at
⇓ least one of which has ≥ 2 observations
Run the algorithm multiple times to improve the chance
of finding a better local optimum. Linkage The Inter-cluster Dissimilarity Is...
• Selecting the value of K by elbow method:

20
Complete (default) Maximal pairwise distance
B Make a plot of the proportion of variation explained
between-cluster SS Single Minimal pairwise distance
(= ) against K.
total SS Average Average of all pairwise distances
Between-cluster SS
Total SS Centroid Distance between the two cluster
1 centroids

ACTEX Learning
B Complete and average linkage are commonly used.
(Reason: They tend to result in more balanced clusters.)
elbow B Single linkage tends to produce extended, trailing clusters
Copyright © ArchiMedia Advantage Inc.

with single observations fused one-at-a-time.

occur at a lower height than an earlier fusion).

K
0 1 2 3 4 5 • Dendrogram: An upside-down tree showing the sequence of
fusions and the inter-cluster dissimilarity (“Height”) when each
B Choose the “elbow,” beyond which the proportion of variation
fusion occurs on the vertical axis.
explained is marginal.
www.ACTEXLearning.com

Some insights from a dendrogram: Other issues

Exam PA Review Sheet (Last updated 06/04/2023)

• Scaling of variables matters for both PCA and clustering
B (Similarities between clusters) Clusters joined towards the
bottom of a dendrogram are rather similar to one another, B Without scaling:
while those fused towards the top are rather far apart. Variables with a large order of magnitude will
B (Considerations when choosing the no. of clusters) Try to cut dominate variance and distance calculations
the dendrogram at a height such that: ⇓
have a disproportionate effect on
The resulting clusters have similar no. of obs. (balanced)
PC loadings & cluster groups
The difference between the height and the next threshold
should be large enough ⇒ obs. in different clusters have B With scaling (generally recommended): All variables are on
Need Help? Email [email protected]

materially different characteristics. the same scale and share the same degree of importance.
• K-means vs. hierarchical clustering: • Alternative distance measures:
Correlation-based distance
Item K-means Hierarchical B Motivation: Focuses on shapes of feature values rather than
their exact magnitudes.
Is randomization Yes No

21
needed? (for initial cluster centers)
B Limitation: Only makes sense when p ≥ 3, for otherwise the
correlation between two observations always equals ±1.
Is the no. of Yes No
clusters (K needs to be specified) (Specify the height of the
• Clustering and curse of dimensionality:
pre-specified? dendrogram later) B Visualization of the results of cluster analysis becomes
Are the clusters No Yes problematic in high dimensions (p ≥ 3).

ACTEX Learning
nested? (a hierarchy of clusters) B As the number of dimensions increases, our intuition
breaks down and it becomes harder to differentiate between
observations that are close and those that are far apart.
Copyright © ArchiMedia Advantage Inc.
All Rights Reserved

♦ ♣ ♥ ♠ SMILE-BEAM THE END SMILE-BEAM ♠ ♥ ♣ ♦

Mathematicians Dont Work With Numbers Richard Poulo Z Library
No ratings yet
Mathematicians Dont Work With Numbers Richard Poulo Z Library
177 pages
Evaluating Information Systems PDF
0% (1)
Evaluating Information Systems PDF
373 pages
Deeper Understanding (Fall 2009) Manual For FM
100% (1)
Deeper Understanding (Fall 2009) Manual For FM
670 pages
Dcova Framework
No ratings yet
Dcova Framework
7 pages
Merged Mat 204 + Mat 171
No ratings yet
Merged Mat 204 + Mat 171
480 pages
Slides PDF
100% (1)
Slides PDF
418 pages
Lecture Notes MTH5124: Actuarial Mathematics I
No ratings yet
Lecture Notes MTH5124: Actuarial Mathematics I
117 pages
Exam Pa Note
No ratings yet
Exam Pa Note
73 pages
Financial Mathematics
No ratings yet
Financial Mathematics
234 pages
PA - FS Förml
No ratings yet
PA - FS Förml
23 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Introduction To Quantitative Analysis
No ratings yet
Introduction To Quantitative Analysis
6 pages
CSCI946 w3 - DataPrep
No ratings yet
CSCI946 w3 - DataPrep
58 pages
C 1 X
No ratings yet
C 1 X
42 pages
Math MW Finals
No ratings yet
Math MW Finals
16 pages
Modeling Tail Dependence Using Copulas - Literature Review: Jan de Kort March 15, 2007
No ratings yet
Modeling Tail Dependence Using Copulas - Literature Review: Jan de Kort March 15, 2007
43 pages
Dav Sem 6
No ratings yet
Dav Sem 6
25 pages
Credibility Theory 2
No ratings yet
Credibility Theory 2
45 pages
Computational Psychometrics: New Methodologies For A New Generation of Digital Learning and Assessment
No ratings yet
Computational Psychometrics: New Methodologies For A New Generation of Digital Learning and Assessment
265 pages
Lecture 1 Introduction PM
No ratings yet
Lecture 1 Introduction PM
21 pages
MFE Manual
100% (1)
MFE Manual
549 pages
Question 4 Module
No ratings yet
Question 4 Module
26 pages
Advanced Statistical Computing PDF
No ratings yet
Advanced Statistical Computing PDF
329 pages
Multivariate Material
No ratings yet
Multivariate Material
58 pages
Metrobank Moneybility First Edition (2023.03)
No ratings yet
Metrobank Moneybility First Edition (2023.03)
154 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
Video Report
No ratings yet
Video Report
13 pages
Predictive Modeling
No ratings yet
Predictive Modeling
27 pages
Msds Brochure
No ratings yet
Msds Brochure
2 pages
Data Sources BAFBANA
No ratings yet
Data Sources BAFBANA
6 pages
The Independent Chip Model and Risk Aversion
No ratings yet
The Independent Chip Model and Risk Aversion
11 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
BigData QB (C.format)
No ratings yet
BigData QB (C.format)
6 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Dlsuppp2015 Teaching of Mathematics in The Modern World-The Dlsu Experience - Slides Enocon
No ratings yet
Dlsuppp2015 Teaching of Mathematics in The Modern World-The Dlsu Experience - Slides Enocon
26 pages
EDA Assignment 1 Devyani1
No ratings yet
EDA Assignment 1 Devyani1
7 pages
Probability in Computer Science
100% (1)
Probability in Computer Science
353 pages
Elims MSP Math Wiz 2016
No ratings yet
Elims MSP Math Wiz 2016
3 pages
Least Squares Problems: How To State and Solve Them, Then Evaluate Their Solutions
100% (1)
Least Squares Problems: How To State and Solve Them, Then Evaluate Their Solutions
63 pages
Research Mcqs by Shaugnazy
No ratings yet
Research Mcqs by Shaugnazy
17 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
8 pages
MSP JR Math Wizard 2012 Elims
No ratings yet
MSP JR Math Wizard 2012 Elims
1 page
Data Science PPT-2
No ratings yet
Data Science PPT-2
34 pages
2019 Pitagoras
No ratings yet
2019 Pitagoras
4 pages
Biostatistics PPT - 6
No ratings yet
Biostatistics PPT - 6
35 pages
Tutorial 5 & 6 - Solution
No ratings yet
Tutorial 5 & 6 - Solution
6 pages
Bayesian in Actuarial Application
100% (1)
Bayesian in Actuarial Application
22 pages
Introduction To Stochastic Modeling 3rd Solution Manual PDF
50% (2)
Introduction To Stochastic Modeling 3rd Solution Manual PDF
2 pages
Copulas - Course Notes
No ratings yet
Copulas - Course Notes
11 pages
Principles of Biostatistics: Class Notes To Accompany The Textbook by Pagano and Gauvreau
No ratings yet
Principles of Biostatistics: Class Notes To Accompany The Textbook by Pagano and Gauvreau
125 pages
The Predictive Analytics Model
No ratings yet
The Predictive Analytics Model
6 pages
Stat Is Tika
100% (1)
Stat Is Tika
44 pages
MSDS-Brochure 2019
No ratings yet
MSDS-Brochure 2019
2 pages
Adapt (FM1)
No ratings yet
Adapt (FM1)
28 pages
STAT 650 - Foundations of Data Science Syllabus
No ratings yet
STAT 650 - Foundations of Data Science Syllabus
13 pages
Case Study: Segofer Technical Services: Design and Implementation of An Automated Inventory Management System
No ratings yet
Case Study: Segofer Technical Services: Design and Implementation of An Automated Inventory Management System
149 pages
Common Analytics Interview Questions
No ratings yet
Common Analytics Interview Questions
4 pages
Lecture 3 EdgeDetection
No ratings yet
Lecture 3 EdgeDetection
52 pages
M.E. Cse
No ratings yet
M.E. Cse
69 pages
Bootstrap PDF
No ratings yet
Bootstrap PDF
24 pages
ARCHML
100% (1)
ARCHML
60 pages
HW1 C SP15 Solution
No ratings yet
HW1 C SP15 Solution
5 pages
OLS Assumptions
No ratings yet
OLS Assumptions
40 pages
SAgarwal CV April 2011
No ratings yet
SAgarwal CV April 2011
2 pages
Sample Chapter From Yufeng Guo's Study Guide
No ratings yet
Sample Chapter From Yufeng Guo's Study Guide
22 pages
BBA Project Work Guidelines
No ratings yet
BBA Project Work Guidelines
32 pages
2002
No ratings yet
2002
7 pages
MFE Formulas
No ratings yet
MFE Formulas
7 pages
Distribution Name Parameters and Domains Important Facts
No ratings yet
Distribution Name Parameters and Domains Important Facts
4 pages
Regression Modelling With Actuarial and Financial Applications - Key Notes
No ratings yet
Regression Modelling With Actuarial and Financial Applications - Key Notes
3 pages
Lesson 2 - Accounting As An Information System
No ratings yet
Lesson 2 - Accounting As An Information System
26 pages
Detailed Sales Forecasting Presentation
No ratings yet
Detailed Sales Forecasting Presentation
10 pages
FORECAST ACCURACY - Supply Chain Management
No ratings yet
FORECAST ACCURACY - Supply Chain Management
29 pages
2024 Philosophical Psychology
No ratings yet
2024 Philosophical Psychology
28 pages
Ardl or Bound Test (Autoregrassive Distributed Lag) : Saeed Aas Khna Meo
No ratings yet
Ardl or Bound Test (Autoregrassive Distributed Lag) : Saeed Aas Khna Meo
6 pages
Hassett Siebert Helen 2008 Sex Differences in Rhesus Monkey Toy Preferences Parallel Those of Children 2008 Hormones and Behavior
No ratings yet
Hassett Siebert Helen 2008 Sex Differences in Rhesus Monkey Toy Preferences Parallel Those of Children 2008 Hormones and Behavior
6 pages
Consultation Hours (Jan-2020)
No ratings yet
Consultation Hours (Jan-2020)
2 pages
Debre Markos University: College of Agriculture and Natural Resource Department of Agricultural Economics
No ratings yet
Debre Markos University: College of Agriculture and Natural Resource Department of Agricultural Economics
24 pages
4 In-Class Examples (Excel)
No ratings yet
4 In-Class Examples (Excel)
36 pages
Performance Measures Mark Report
No ratings yet
Performance Measures Mark Report
6 pages
Naumann2013 Presentation - An Introduction To Data Profiling PDF
No ratings yet
Naumann2013 Presentation - An Introduction To Data Profiling PDF
58 pages
Assignment 6 Solutions
No ratings yet
Assignment 6 Solutions
6 pages
IP1 Guidelines V3
No ratings yet
IP1 Guidelines V3
7 pages
Questions Answers Topic 5
No ratings yet
Questions Answers Topic 5
5 pages
TSCS Week5 Trends
No ratings yet
TSCS Week5 Trends
19 pages
Business Analytics Viva Questions
No ratings yet
Business Analytics Viva Questions
9 pages
Ucsi University Faculty of Business and Information Science (Fobis) Course Outline Cum Teaching Plan
No ratings yet
Ucsi University Faculty of Business and Information Science (Fobis) Course Outline Cum Teaching Plan
3 pages
2019-2020 Final Exam-Probability
No ratings yet
2019-2020 Final Exam-Probability
2 pages
STA8005: Multivariate Analysis For High-Dimensional Data Tutorial - Week 3 (MVN)
No ratings yet
STA8005: Multivariate Analysis For High-Dimensional Data Tutorial - Week 3 (MVN)
1 page
Measures of Dispersion Assignment
No ratings yet
Measures of Dispersion Assignment
3 pages
Tutorial (Assigned Number) PDF
No ratings yet
Tutorial (Assigned Number) PDF
1 page
Ali Izaan - 1001748265
No ratings yet
Ali Izaan - 1001748265
1 page
Student Academic Performance Form Student Meeting Record
No ratings yet
Student Academic Performance Form Student Meeting Record
1 page
The Role of Cost Accounting System in The Pricing Decision Making PDF
No ratings yet
The Role of Cost Accounting System in The Pricing Decision Making PDF
9 pages
Statistik Era Ayu Wandira
No ratings yet
Statistik Era Ayu Wandira
3 pages
Manzan Ass1 Spring2024
No ratings yet
Manzan Ass1 Spring2024
3 pages
CBC - Module of Instruction Basic 6
No ratings yet
CBC - Module of Instruction Basic 6
3 pages

Pa FS

Uploaded by

Pa FS

Uploaded by

www.ACTEXLearning.

MESSAGE FROM AMBROSE

Exam PA Review Sheet (Last updated 06/04/2023)

different “prescribed” and “what is the best B Specific strategies:

(Hypotheses) Use prior knowledge of the business problem

• Constraints to face: • Granularity: Refers to how precisely a variable is measured,

Exam PA Review Sheet (Last updated 06/04/2023)

B Same measurement unit for numeric variables

How to handle PII? 1.3 Exploratory Data Analysis (EDA)

Exam PA Review Sheet (Last updated 06/04/2023)

Examples: Numeric Mean, Histograms, B Any (right)

the model is deployed in practice levels?

observed at the same time as or after the target variable. targets)

• Bivariate exploration tools:

Exam PA Review Sheet (Last updated 06/04/2023)

B Trade-off: To strike a balance Balance-Scale between:

Exam PA Review Sheet (Last updated 06/04/2023)

capture relationship B Definition: Relationship between a predictor and the target

Exam PA Review Sheet (Last updated 06/04/2023)

• Why use stratified sampling: To produce representative

Larger training set ⇒

B Confusion matrices: Typically ranges between 0.5 (random classifier) and 1

Exam PA Review Sheet (Last updated 06/04/2023)

TN + TP proportion of Numeric (R)MSE, Pearson chi-square Lower,

Sensitivity = = Cross-validation (CV)

May use a cost-benefit analysis to optimize cutoff. • Common uses of CV:

Exam PA Review Sheet (Last updated 06/04/2023)

B A high accuracy can be deceptive. • Quantitative framework—Bias-variance trade-off:

majority class, but draw more observations (“oversample”) from  2 

1.5 Model Validation

Exam PA Review Sheet (Last updated 06/04/2023)

shift ⇒ need to modify the business problem to incorporate the

Granularity Applies to both Not always

Exam PA Review Sheet (Last updated 06/04/2023)

Variance Constant, regardless of Varies with µ and the Feature generation

1 Polynomial regression: Add polynomial terms to the • Handling categorical predictors—Binarization::

Exam PA Review Sheet (Last updated 06/04/2023)

Dummy variables serve as predictors in model equation

3 Adding piecewise linear functions: Add features of the

• Specific statements based on log link: Assume all else equal.

Exam PA Review Sheet (Last updated 06/04/2023)

play a more important

that are statistically significant (resp. insignificant) w.r.t.

• Selection process: Sequentially add/drop features, one at a B Cons:

Possibly non-intuitive results (e.g., only few levels of a 2 α: Mixing parameter

Exam PA Review Sheet (Last updated 06/04/2023)

deviance + regularization penalty. • Basics:

α > 0 (lasso, in particular), some coefficient estimates is reached.)

Exam PA Review Sheet (Last updated 06/04/2023)

Classification error rate 1 − max1≤k≤K p̂mk bucket by X2 , . . .

Properties: B Which terminal nodes have the most observations? Any

Parameter Name in R Meaning Effect B How it works:

Complexity Min. improvement Higher, tree relative training error +

Maximum maxdepth # edges from root Higher, tree =

B About the hyperparameter cp:

Exam PA Review Sheet (Last updated 06/04/2023)

Transformations Yes Yes, unless the

• Key parameters: • Partial dependence plots:

Exam PA Review Sheet (Last updated 06/04/2023)

importance 1 X impurity 3 (Implementation) Simple to implement

all splits over

Exam PA Review Sheet (Last updated 06/04/2023)

same criterion (e.g., MSE, accuracy) ultimately used to

3 Unsupervised Learning B Generated to capture as much information in the data (w.r.t.

Exam PA Review Sheet (Last updated 06/04/2023)

challenging than supervised learning:

Variance explained by mth PC

Exam PA Review Sheet (Last updated 06/04/2023)

3.2 Cluster Analysis

(if y exists) model complexity ↑

B How to choose M : K-means clustering

Scree plot: Eyeball the plot and locate the “elbow”

• How the algorithm works: Hierarchical clustering

Exam PA Review Sheet (Last updated 06/04/2023)

which depends on the randomly selected a dendrogram

with single observations fused one-at-a-time.

majority class, but draw more observations (“oversample”) from 2