0% found this document useful (0 votes)
23 views9 pages

PA Summary Sheet

The document provides a comprehensive overview of data processing techniques, variable types, sampling methods, and model accuracy in statistical learning. It discusses various transformation techniques, handling of missing data, and the bias-variance trade-off in modeling, along with specific metrics for evaluating regression and classification models. Key concepts such as multivariate graphical analyses, univariate and bivariate summaries, and assumptions of multiple linear regression are also covered.

Uploaded by

adventurine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views9 pages

PA Summary Sheet

The document provides a comprehensive overview of data processing techniques, variable types, sampling methods, and model accuracy in statistical learning. It discusses various transformation techniques, handling of missing data, and the bias-variance trade-off in modeling, along with specific metrics for evaluating regression and classification models. Key concepts such as multivariate graphical analyses, univariate and bivariate summaries, and assumptions of multiple linear regression are also covered.

Uploaded by

adventurine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

PA

Updated 09/23/24

DATA PROCESSING Multivariate Graphical Analyses


Data Processing
Common to depict a third variable by color-coding or faceting.
Variable Types
• Target variable: Variable of interest to be predicted or analyzed Transformation Techniques
• Predictor variable: Used to reveal patterns of the target • Transformations alter or convert variables to enhance dataset
• Continuous variable: Measures a quantity that can take any • Variables refer to data in original recorded form; features are
value within an interval derived predictors deemed suitable
• Count variable: Records discrete counts • Removing variables is a part of the feature creation process
• Factor: Records levels (i.e. categories) • Target leakage occurs when a predictor is derived from or
influenced by the target
Other Data Terms
Handling Skewed Variables
• Dimensionality: Number of factor levels
• Right-skewed variables are good candidates to be log-
• Granularity: Degree of level precision
transformed, as it compresses the range of values
• Structured data: Suitable in tabular form
• Log transformation cannot be done if the variable has non-
• Unstructured data: Not suitable in tabular form positive values without shifting
• Semi-structured data: Has elements of both structured and • Square root and cube root transformations are similar to a log
unstructured transformation

Types of Sampling Data Format


• Random sampling: Equal probability of sampling each record • A variable is standardized by subtracting the mean and then
• Stratified sampling: Sample from each stratum dividing by the standard deviation
• Oversampling: Sample higher proportion from minority • Binarization transforms a factor into one or more numeric
• Undersampling: Sample lower proportion from majority variables with two possible values
• Systematic sampling: Follow specific pattern or set criterion • To decide whether a variable is better as numeric or factor,
consider the “math test” and how many unique values the
Reasons for Sampling variable has
• Managing dataset size Identifying Relationships
• Handling irrelevant or misleading data • Polynomial transformations capture non-linear relationships
• Addressing imbalanced data • Dividing a numeric variable into intervals and transforming it
• Facilitating model testing into a factor can potentially produce a non-linear pattern
• Reasons to combine levels of a factor include handling
Common Univariate Numerical Summaries categories with insufficient data and grouping similar
• Mean categories together, especially in terms of the target
• Variance • Releveling a factor changes its reference level
• Quantiles (e.g. median)
Transforming Multiple Variables
• Frequency
• A compound variable is a factor that merges multiple factors by
Skewness Rule of Thumb pairing each level from one factor with each level from the other
• Mean < median: possible left skewness factor; it is for combining distinct but overlapping factors and to
• Mean > median: possible right skewness include interactions
• Mean ≈ median: likely symmetric distribution • PCA is a dimensionality reduction technique
• Clustering groups similar data points together; multiple
Common Bivariate Numerical Summaries numeric variables are transformed into one factor
• Correlation: Measures linearity on scale between −1 and 1
• Statistics by level
• Frequency (between two variables)

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 1


Data Errors Parametric
PARAMETRIC Regression
REGRESSION MODELS Models
• Identifying and handling data errors typically happens during
or after EDA Key Terminology in Statistical Learning
• Strategies to identify errors include checking the minimum and • The systematic component, often denoted as 𝑓𝑓, is what an
maximum of numeric variables, the possible levels of factors, observation’s target gravitates to, usually the mean target
and the relationship between variables • 𝑓𝑓 is a function of the predictors; it captures how the target
• Strategies to handle errors include removing outliers, and predictors relate
correcting inconsistencies, and removing rows or variables with • 𝑓𝑓& is a model’s estimate/estimator of 𝑓𝑓
significant errors • The random component, sometimes denoted as 𝜀𝜀, is the
part of the target that cannot be explained with predictors
Missing Data
Contrasting Terms
• Should distinguish between missing at random vs. not missing
Supervised Has a target variable
at random
vs.
• Strategies to handle missing data include removing a column, No target variable
Unsupervised
removing rows, and imputing values; the appropriate approach
Regression The target is continuous or count
depends on the extent of missingness and the variable’s nature
vs.
• Imputation does not replicate the inherent variability present in The target is categorical (i.e. binary)
Classification
real data
Parametric Functional form of 𝑓𝑓 specified
vs.
Transforming Unstructured Data Functional form of 𝑓𝑓 not specified
Non-Parametric
• Unstructured data need preprocessing into structured data
Flexibility 𝑓𝑓&’s ability to follow the data closely
• Examples include sentiment analysis and keyword
vs.
identification 𝑓𝑓&’s ability to be understood
Interpretability

Key Ideas
• Unsupervised learning techniques are useful
for feature creation
• The disadvantage of parametric methods is the danger
of choosing a form for 𝑓𝑓 that is far from the truth
• The disadvantage of non-parametric methods is the need
for an abundance of observations
• Flexibility and interpretability are usually inversely related
• Non-parametric methods tend to be more flexible than
parametric methods

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 2


Model Accuracy • Bias: Average closeness between 𝑓𝑓& and 𝑓𝑓, i.e. the component of
• 𝑦𝑦 denotes the actual target; 𝑦𝑦) denotes the target prediction model error due to 𝑓𝑓& being not complex/flexible enough
• 𝑓𝑓& and 𝑦𝑦) are often interchangeable • An inflexible 𝑓𝑓& has low variance and high bias; a very flexible 𝑓𝑓&
has high variance and low bias
Accuracy Metric
• The most accurate model strikes the right balance between
• A metric may measure model accuracy directly, or measure
variance and bias, thus minimizing(maximizing) the test
it inversely, i.e. model error
error(accuracy) metric
• A metric can be computed using the training set or the test set
• The training metric is not suitable to gauge model
Modeling Considerations
accuracy/error; use the test metric instead
Before Modeling
• Usually, as flexibility increases, the training error(accuracy)
• Defining the problem clearly comes from gaining clarity on the
metric decreases(increases)
business issue, developing testable hypotheses, and identifying
• Usually, as a function of flexibility, the test error(accuracy)
key performance indicators
metric is u-shaped(n-shaped)
• Checklist whether predictive modeling is the right approach:
• Usually, at the same flexibility level, the training
o Business issue is clear
error(accuracy) metric is lower(higher) than the test
o Relevant data is available to address the issue
error(accuracy) metric
o Confident that predictions will be useful/practical
• A very flexible 𝑓𝑓& likely suffers from overfitting; an excellent
o Confident that modeling is a superior solution
training metric with a much worse test metric signals an overfit
o The model can be monitored and updated
Accuracy with Regression • Recognizing types of analyses:
𝑓𝑓 denotes the mean target. o Descriptive analytics: Focus on studying the past to identify
relationship and patterns among the variables
RMSE = +∑#!$%(𝑦𝑦! − 𝑦𝑦)! )" /𝑛𝑛
o Predictive analytics: Focus on anticipating the future by using
MSE = ∑#!$%(𝑦𝑦! − 𝑦𝑦)! )" /𝑛𝑛
models to make accurate predictions
MAE = ∑#!$%|𝑦𝑦! − 𝑦𝑦)! | /𝑛𝑛
o Prescriptive analytics: Focus on the outcome of decisions
• MSE operates on a larger scale than RMSE
• MAE is more robust to outliers in the target While Modeling
• Adjust the business problem
Accuracy with Classification
• Consult an expert
• Confusion matrix: Summarizes the actual and predicted targets
• Collect more data
of a dataset, i.e.
• Attempt different models
𝑦𝑦
𝑦𝑦) • Refine current models
0 1
0 True negative False negative After Modeling
Modeling work ends by implementing or abandoning.
1 False positive True positive
• Classification error rate: Percentage of observations
Data Partition
with wrong predictions
• Stratified sampling is the preferred method for dividing a
• Accuracy (rating): Complement of classification error rate
dataset into training and test sets
• Sensitivity (a.k.a. recall, TPR): Percentage of positive
• Random number generators and seeds are used to simulate
observations (𝑦𝑦 = 1) with correct predictions
randomness and maintain reproducibility
• Specificity (a.k.a. TNR): Percentage of negative observations
• When creating training and test sets, the target is commonly
(𝑦𝑦 = 0) with correct predictions
used as the stratifying variable to ensure similar distributions
• FPR: Complement of specificity
• Precision (a.k.a. positive predictive value): Percentage of
Validation Set
positive predictions (𝑦𝑦) = 1) that are correct
A dataset could be partitioned into these three sets:
• The common model accuracy metric is AUC, which combines
• Training set: Observations used for fitting
sensitivity and specificity
• Validation set: Observations used for validating model fits
Bias-Variance Trade-Off • Test set: Observations used for final evaluation
• For fixed inputs 𝑥𝑥% , … , 𝑥𝑥& , the test RMSE can be decomposed
The common exam practice is to split the dataset into “training
into variance, bias, and irreducible error
set” and “test set”, such that “test set” and “validation set” are used
• Variance: Amount of variation in 𝑓𝑓&’s shape, i.e. the component of interchangeably.
model error due to 𝑓𝑓& being too complex/flexible

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 3


Multiple Linear Regression (MLR) Plots of Residuals
𝑌𝑌 = 𝛽𝛽' + 𝛽𝛽% 𝑥𝑥% + ⋯ + 𝛽𝛽& 𝑥𝑥& + 𝜀𝜀 • Residuals vs. predictions (𝑒𝑒 vs. 𝑦𝑦)) plot
" The residuals are well-behaved if
𝜀𝜀’s are normally distributed with mean 0 and variance 𝜎𝜎 .
o Residuals seem to average to 0 at reasonable intervals
Notation
o Spread of residuals does not change
𝛽𝛽( The 𝑗𝑗th regression coefficient
o Points appear to be randomly scattered
𝑏𝑏( Estimate of 𝛽𝛽(
• 𝑞𝑞𝑞𝑞 plot
" Variance of target
𝜎𝜎 Check for normality and outliers
𝑒𝑒 Residual
• Scale-location plot
SSE Error sum of squares
• Standardized residuals vs. leverage plot
Estimation
Perfect Collinearity
𝑦𝑦) = 𝑏𝑏' + 𝑏𝑏% 𝑥𝑥% + ⋯ + 𝑏𝑏& 𝑥𝑥&
• Prevents OLS from obtaining unique coefficient estimates
Ordinary least squares (OLS) finds the 𝑏𝑏’s that minimize SSE; the
• Arises when the model has more coefficients than necessary
MLR objective function is SSE.
• Examples include
The estimate of 𝜎𝜎 is the residual standard error = o Three dummy variables for a three-level factor
+SSE/(𝑛𝑛 − 𝑝𝑝 − 1). o Interaction of factors with missing level pairings
o Interaction of numerical predictor and factor when the
SSE = ∑#!$%(𝑦𝑦! − 𝑦𝑦)! )" = ∑#!$% 𝑒𝑒!" on the training set. numerical predictor doesn’t vary for a factor level
Key Ideas
• Only 𝑤𝑤 − 1 dummy variables are needed to represent 𝑤𝑤 levels R Formula Syntax
of a factor; the reference level does not have a dummy variable • Explained by: ~
• Two predictors have an interaction when the relationship • Include: +
between the target and one predictor changes based • All remaining variables: .
on the other predictor’s value; faceted plots are useful • Exclude: -
for identifying interactions • Interact: :
• 𝑡𝑡 tests only analyze one regression coefficient; a small 𝑝𝑝-value • Cross (include and interact): *
suggest a significant predictor to be kept
• 𝐹𝐹 tests compare nested MLR models to test multiple coefficients Symbols will take their arithmetic meaning when within the I()
• Removing a dummy variable can lead to the combining of levels, function, e.g. to create a squared term.
creating a new reference level
• Hierarchical principle: When including an interaction, always Stepwise Selection
keep its individual terms Information Criteria
• Designed to mimic the test RMSE; computed using only the
MLR Assumptions training set
1. E[𝜀𝜀] = 0 • For each predictor added to a model, an information criterion
2. Var[𝜀𝜀] = 𝜎𝜎 " (homoscedasticity) deems it an improvement if twice the maximized log-likelihood
3. 𝜀𝜀’s are independent increases by more than a certain amount; that amount for
4. 𝜀𝜀’s are normally distributed o AIC is 2
5. 𝑥𝑥( is not a linear combination of the other predictors, for 𝑗𝑗 = o BIC is the natural log of the number of observations
0, 1, … , 𝑝𝑝 • BIC tends to favor models with fewer predictors, relative to AIC

Concerns Forward Selection


Starts with the null model and iteratively adds one term that
• Residuals with non-zero averages (red flag)
improves model fit the most until no further improvement,
• Heteroscedasticity (violation)
according to a chosen information criterion.
• Dependent 𝜀𝜀’s (violation)
• Non-normal 𝜀𝜀’s (violation) Backward Selection
• Outliers (red flag) Stars with the full model and iteratively removes one term that
• Collinearity (red flag) / Perfect collinearity (violation) hinders model fit the most until no further improvement,
• High dimensions (red flag) according to a chosen information criterion.

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 4


Key Ideas One-Standard-Error Rule
• Binarization: Process of creating dummy variables manually; For hyperparameters that measure flexibility, rather than
they are considered separately in stepwise selection selecting the value that produces the best CV metric, we may
• Both procedures are greedy and perform variable selection instead select the value producing a CV metric within one
• Forward selection works in high-dimensional settings standard error of the best that results in the lowest flexibility.
• Backward selection maximizes potential of finding
complementing predictors Generalized Linear Models (GLM)
• Forward selection with BIC tends to result in fewer predictors; • Any distribution belonging to the linear exponential family can
backward selection with AIC tends to result in more predictors be chosen; its mean 𝜇𝜇 is embedded in its probability function;
its variance is typically a function of the mean
Regularization • There are two modeling choices: distribution and link function
Ridge Regression • A link function establishes the relationship between 𝜇𝜇 and the
Seeks coefficient estimates that minimize SSE + 𝜆𝜆 ∑&($% 𝑏𝑏(" . predictors, i.e. the function 𝑔𝑔 where 𝑔𝑔(𝜇𝜇) = 𝐱𝐱 ) 𝜷𝜷
• MLR is a special case of GLM with the target normally
𝜆𝜆 controls the shrinkage of the estimates; it is inversely related to distributed and the identity link
flexibility.
Distributions
Lasso Regression Ideal choices are as follows:
&
Seeks coefficient estimates that minimize SSE + 𝜆𝜆 ∑($%L𝑏𝑏( L. • For continuous targets: normal, gamma, inverse Gaussian
Elastic Net Regression • For binary targets: binomial (i.e. Bernoulli)
• A more general regularized regression where the restricted • For count targets: Poisson
quantity is 𝛼𝛼 ∑&($%L𝑏𝑏( L + (1 − 𝛼𝛼) ∑&($% 𝑏𝑏(" Link Functions
• 𝛼𝛼 = 0 simplifies to ridge, 𝛼𝛼 = 1 simplifies to lasso Function Name 𝒈𝒈(𝝁𝝁) Ideal For
Key Ideas Identity link 𝜇𝜇 Real-valued 𝜇𝜇
• All three require initial scaling of predictors
Log link ln 𝜇𝜇 Positive-valued 𝜇𝜇
• All three binarize factors
𝜇𝜇
• All three reduce model flexibility and discourage overfitting Logit link ln Z [ 𝜇𝜇 between 0 and 1
1 − 𝜇𝜇
• All three handle high-dimensional data well
• With a finite 𝜆𝜆, lasso and elastic net can perform variable Estimation
selection, while ridge cannot 𝑔𝑔(𝜇𝜇̂ ) = 𝐱𝐱 ) 𝐛𝐛 ⇒ 𝜇𝜇̂ = 𝑔𝑔*% (𝐱𝐱 ) 𝐛𝐛)
• Lasso and elastic net may violate the hierarchical principle, Maximum likelihood estimation (MLE) finds the 𝑏𝑏’s that maximize
while ridge cannot the log-likelihood; the GLM objective function is the log-likelihood.
• In a GLM setting, SSE is replaced by the negative log-likelihood
Numerical algorithms are typically used for estimation, as closed-
glmnet() Hyperparameters form solutions are often not available; choosing the canonical link
• alpha: resemblance towards lasso versus ridge can help the estimates converge.
• lambda: shrinkage parameter
GLM Metrics
• Maximized log-likelihood
𝒌𝒌-Fold Cross-Validation
• Deviance
Tunes hyperparameters following these steps:
• Pearson chi-square statistic
1. Randomly divide the dataset into 𝑘𝑘 folds
• AIC and BIC
2. Select one combination of hyperparameter values
3. For 𝑣𝑣 = 1, … , 𝑘𝑘, obtain the 𝑣𝑣th fit by training with all Negative Binomial Distribution
observations except those in the 𝑣𝑣th fold Good alternative to Poisson for count targets.
4. For 𝑣𝑣 = 1, … , 𝑘𝑘, use 𝑦𝑦) from the 𝑣𝑣th fit to calculate an accuracy/
Tweedie Distribution
error metric (e.g. RMSE) with observations in the 𝑣𝑣th fold
Good option for a target with many counts of 0 but is otherwise
5. Average the 𝑘𝑘 metrics in step 4 to calculate the CV metric
continuous on positive values.
6. Repeat steps 3 to 5 for all other combinations of
hyperparameter values; the best combination produces Pearson and Deviance Residuals
the best CV metric • Analyze plots of Pearson/deviance residuals the same way as
described for MLR with regular residuals
• Under MLR assumptions, we may take Pearson, deviance, and
regular residuals as interchangeable

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 5


Overdispersion A model that ignores exposures is equivalent to all observations
• When data variability exceeds the model's estimated variance having the same number of exposures.
• A deviance greater than its degrees of freedom signals
Offset
overdispersion
The term that adds to 𝐱𝐱 ) 𝜷𝜷, which together equals 𝑔𝑔(𝜇𝜇). Since the
• Using a quasi likelihood adjusts the modeled variance; it does
log link is typical in this situation, the offset is usually the natural
not change 𝐛𝐛, but 𝑧𝑧 tests will become 𝑡𝑡 tests
log of the exposures.
𝑏𝑏( Interpretation
Key R Takeaways – GLM
For every unit increase in 𝑥𝑥( , the predicted
Identity link Functions of family objects, with canonical link defaults, include:
target changes by 𝑏𝑏(
• binomial(link = "logit")
For every unit increase in 𝑥𝑥( , the predicted • gaussian(link = "identity")
Logit link +! • Gamma(link = "inverse")
odds changes by a factor of 𝑒𝑒
• inverse.gaussian(link = "1/mu^2")
For every unit increase in 𝑥𝑥( , the predicted • poisson(link = "log")
Log link
target changes by a factor of 𝑒𝑒 +! • quasibinomial(link = "logit")
• quasipoisson(link = "log")
This assumes all other predictors remain constant, and that 𝑥𝑥( has
no higher-order terms or interaction with another predictor. For glm(), its arguments include:
• family: a family object function
Weights
Positive constants that: • weights: column name of weighting variable
• Measure the relative importance of each observation • offset: expressions for offset
• Scale the variances of the target for all observations For glmnet(), its argument family can be "gaussian",
• Are intuitive as integers, i.e. target interpretable as an average "binomial", or "poisson".
Miscellaneous GLM Concepts
• Typically assumes a monotonic relationship between the target
and a predictor DECISION TREES Decision Trees
• Nested models can be compared using the likelihood ratio test
• Regularization uses the negative log-likelihood instead of SSE Introduction to Decision Trees
• No underlying assumptions about target and predictors
GLM for Binary Targets • Decision tree components:
Odds o Root node
, -../ o Splits
For 𝑌𝑌 ∼ Bernoulli (thus 𝜇𝜇 = Pr(𝑌𝑌 = 1)), odds = ⇔ 𝜇𝜇 = .
%*, %0-../
o Terminal nodes / leaf nodes / leaves
Logistic Regression o Edges / branches
GLM with a binomial target and the logit link. o Parent and child nodes
o Tree depth
Area Under the Curve (AUC) • Trees are made of a series of splitting rules that divide the
• For classification, target predictions 𝑦𝑦) come from setting a predictor space into multiple regions/nodes
cutoff; observations with predicted probability (𝜇𝜇̂ for GLM) • Splitting rules are determined using recursive binary splitting, a
above the cutoff are predicted as positives greedy algorithm that selects the best split at each node
• The receiver operating characteristic (ROC) curve is the plot of • Flexibility is measured by the number of terminal nodes
sensitivity and specificity from considering all cutoff values
• AUC is the area under the ROC curve; a higher AUC implies there Regression Trees
are more cutoffs with higher sensitivity and/or specificity • The best split minimizes SSE
• Each node’s prediction is the average of the target values in it
GLM for Count Targets • Stopping criteria limit tree growth based on:
Exposures o Minimum number of observations required to attempt split
A Poisson random variable counts within a given interval, where o Minimum number of observations required in each terminal
scaling the interval results in scaling the mean. An exposure is a node after splitting
chosen interval measure. Thus, it makes sense for a Poisson mean o Maximum depth of any node
to equal the number of exposures times the rate. o Minimum reduction in SSE required for split to occur

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 6


Classification Trees Key R Takeaways – Decision Trees
Differs from regression trees in: For rpart(), its key arguments are:
• The best split maximizes the information gain in Gini index or • method: "anova", "class", or "poisson"
entropy • parms: list(split = "gini") for Gini index or
• Each node’s prediction is the majority class in it list(split = "information") for entropy
With two classes (0 and 1), each node’s • control: See rpart.control()
Gini index, 𝐺𝐺 = 1 − 𝑝𝑝'" − 𝑝𝑝%" . For rpart.control(), its key arguments are:
Entropy, 𝐸𝐸 = −𝑝𝑝' log " (𝑝𝑝' ) − 𝑝𝑝% log " (𝑝𝑝% ).
• minsplit: minimum number of observations in node
Information gain is the difference between the measure for the • minbucket: minimum number of observations permitted for a
parent node and the weighted average of the child nodes. terminal node
• maxdepth: maximum depth of terminal nodes
Pruning Trees • cp: complexity parameter
• Pruning helps to identify optimal (smaller) tree size • maxcompete: maximum number of competitor splits stored for
• Cost complexity pruning balances tree error with complexity by each split
minimizing SSE) + cp × |𝑇𝑇| × SSE' for regression trees • maxsurrogate: maximum number of surrogate splits stored
• For classification trees, the classification error rate is used for each split
instead of SSE • usesurrogate: indicates how to use the surrogate splits
• Hyperparameter cp can be tuned using cross-validation
Use predict() with argument type = "class" for predicted
Other Topics class or type = "prob" for predicted probabilities.
Competitor and Surrogate Splits
Use prune.rpart() to prune the tree to the desired level based
• Competitor splits are the best alternative splits from the
on the chosen cp value.
predictors not chosen as the best split
• Surrogate splits are backup rules for handling missing data in Random Forests
the primary split
• Aggregating predictions from multiple trees from fits on
Pruning vs. Reconstructing Trees multiple bootstrapped datasets
Pruning an existing tree is theoretically better than constructing a • Individual trees have high variance, low bias; aggregating
new tree from scratch using the optimal cp value. reduces the high variance
• Considering a random subset of features at each split reduces
Advanced Concepts
risk of correlated tree predictions, further reducing variance
• Balanced vs. unbalanced trees
• Bagging is a special case of random forests
• A right-skewed target can lead to unbalanced tree focusing on
the right tail; consider transforming the target first Advantages and Disadvantages
• Collinearity can cause instability in tree structure and mask • Decreased expected error by reducing variance without
variable significance, though not as bad as in parametric models affecting bias
• Interactions can be captured naturally • Improved predictive power and robustness in exchange for
• Monotone transformations of numerical features don’t affect lower interpretability and ease of explanation
the tree splits (besides changing the split point value) • Retains most advantages of decision trees
• Factors with many levels are more likely to be chosen for splits
Boosting
Advantages
• Aggregating predictions from multiple trees that are fitted
• No assumptions about how target and predictors relate
consecutively using updated datasets
• High interpretability and ease of explanation
• Individual trees have high bias, low variance; aggregating
• Minimal data preprocessing required
reduces the high bias
• Automatic feature selection
Key Hyperparameters
Disadvantages
• Number of trees, 𝐵𝐵
• Prone to overfitting
• Number of splits, 𝑑𝑑
• Lower predictive accuracy compared to more advanced models
• Shrinkage parameter, 𝜆𝜆: Controls amount of information gained
• Unstable and sensitive to training data
from each tree
• Greedy

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 7


Algorithm Key R Takeaways – Ensemble Methods
1. Set 𝑓𝑓&(𝐱𝐱) = 0 and 𝑟𝑟! = 𝑦𝑦! for all 𝑖𝑖 in the training set For randomForest(), its key hyperparameters are:
2. For 𝑏𝑏 = 1, … , 𝐵𝐵, • ntree: number of trees
o Fit a tree 𝑓𝑓& + with 𝑑𝑑 splits to the training data (𝑟𝑟 as target) • mtry: number of candidate predictors for each split
o Update 𝑓𝑓& by adding the shrunken version of the new tree: • nodesize: minimum number of observations permitted for a
𝑓𝑓&(𝐱𝐱) ← 𝑓𝑓&(𝐱𝐱) + 𝜆𝜆𝑓𝑓& + (𝐱𝐱) terminal node
o Update residuals for all 𝑖𝑖: 𝑟𝑟! ← 𝑟𝑟! − 𝜆𝜆𝑓𝑓& + (𝐱𝐱! ) • maxnodes: maximum number of terminal nodes
3. Output the boosted model: 𝑓𝑓&(𝐱𝐱) = ∑1+$% 𝜆𝜆𝑓𝑓& + (𝐱𝐱)
For gbm(), its key hyperparameters are:
Advantages and Disadvantages • n.trees: number of trees
• Generally more predictive than most other models, but at the • interaction.depth: maximum depth of terminal nodes
expense of interpretability • shrinkage: shrinkage parameter
• Retains most advantages of decision trees • n.minobsinnode: minimum number of observations
• Compared to random forests, tends to overfit but can be permitted for a terminal node
mitigated
• bag.fraction: portion of observations used for each tree

Feature Importance For xgb.train(), its key hyperparameters are:


• Measures contribution of each predictor to model’s predictions • nrounds: number of trees
by quantifying reduction in error across all trees • max_depth: maximum depth of terminal nodes
• Two main approaches: • eta: shrinkage parameter
o Direct evaluation of fitted model, such as mean decrease in • gamma: minimum reduction of splitting measure
SSE or Gini index • subsample: portion of observations used for each tree
o Calculate change in accuracy after permuting data • colsample_bytree: portion of predictors used for each tree

Partial Dependence method for


Model Hyperparameters
• Calculates the marginal effect of a predictor on the target after train()
integrating out the other predictors, showing how predictions Decision
rpart cp
change on average as the selected predictor varies Trees
• Limitations include:
Random
o Does not depict exact relationships rf mtry
Forests
o Focuses on one predictor, masking interactions
o Some average predictions may be unrealistic n.trees
interaction.depth
Partial Dependence Plot Steps gbm
shrinkage
1. Select a predictor n.minobsinnode
2. Identify all possible values of the predictor in the training set nrounds
3. Modify the training set by setting all rows of the predictor to Boosting max_depth
one of its possible values eta
4. Use the trained model to predict the target on the modified data xgbTree gamma
5. Average these predictions and record the corresponding subsample
predictor value used in step 3 colsample_bytree
6. Repeat steps 3 to 5 for all other possible values of the predictor min_child_weight
7. Plot the average predictions against the predictor values alpha
recorded during step 5 Elastic Net glmnet
lambda

Oversampling and Undersampling


• For classification problems, oversampling duplicates
observations in the minority class; undersampling deletes
observations in the majority class
• Should be performed on the training set only
• Oversampling may overfit the minority class; undersampling
may underfit the majority class

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 8


UNSUPERVISED FEATURE CONSTRUCTION
Unsupervised Feature Construction 𝒌𝒌-Means Clustering
• Groups similar observations using Euclidean distance
Principal Components Analysis • Aims to minimize the total within-cluster variation
• Finds a lower-dimensional representation of a dataset while • The elbow method helps decide the optimal number of clusters
retaining maximum information • The algorithm may converge to a local minimum; multiple runs
• Creates numerical variables called principal components are recommended to aim for the global minimum
Notation Algorithm
𝑧𝑧2 The 𝑚𝑚th principal component 1. Choose 𝑘𝑘, the number of clusters
The 𝑚𝑚th principal component score for the 𝑖𝑖th 2. Randomly assign each observation to one of the 𝑘𝑘 clusters
𝑧𝑧!,2
observation 3. Calculate each cluster’s centroid
𝜙𝜙(,2 The 𝑗𝑗th loading for the 𝑚𝑚th principal component 4. Reassign each observation to the closest centroid
5. Repeat steps 3 and 4 until the cluster assignments don’t change
𝑧𝑧2 = 𝜙𝜙%,2 𝑥𝑥% + 𝜙𝜙",2 𝑥𝑥" + ⋯ + 𝜙𝜙4,2 𝑥𝑥4

Visuals Hierarchical Clustering


• A biplot visualizes both the loadings and the scores • Creates clusters by joining the two closest clusters, then the
• A scree plot visualizes variance explained and helps determine next two closest, until all observations are in a single cluster
the number if principal components to retain, e.g. identify the • Produces a dendrogram; agglomerative hierarchical clustering
elbow and retain all principal components before the elbow starts with each observation as its own cluster at the bottom
• Dissimilarity measures include Euclidean distance and
Key Properties
correlation-based distance
• Best lower-dimensional approximations of the dataset
• Have maximized variances Linkage Inter-cluster Dissimilarity
• Uncorrelated with each other
Complete Largest pairwise dissimilarity
• Total variance of dataset is preserved
Single Smallest pairwise dissimilarity
Technical Details
• Works with numeric variables; factors need to be binarized Average Average pairwise dissimilarity
• Loadings and scores are unique up to sign flip Centroid Dissimilarity between centroids
• Addresses collinearity
• Non-distinct principal components explain no variance • Complete and average linkages often produce balanced
• Centering original variables is important for consistency in dendrogram; single linkage tends to create skewed
implementation dendrograms with trailing clusters
• Scaling original variables is important to prevent those with • Centroid linkage can lead to inversions
large variances from dominating the loadings Algorithm
Purposes For 𝑘𝑘 = 𝑛𝑛, 𝑛𝑛 − 1, … , 2:
• Feature transformation 1. Calculate the inter-cluster dissimilarity between all 𝑘𝑘 clusters
• Feature extraction 2. The two clusters with the lowest inter-cluster dissimilarity are
• Dimension reduction (not variable selection) fused
• Creating uncorrelated variables
Comparing Clustering Methods
• Visualization tool
• Both face validation challenges, lack robustness, and are
• Identification of latent variables
susceptible to the curse of dimensionality
• Both are only applicable to numeric variables
Clustering
• Original variables are typically scaled/standardized beforehand
• Group observations that are similar in the same cluster, and
observations that are different in different clusters • Outliers can significantly affect cluster assignments
• Creates a factor based on the groupings • 𝑘𝑘-means requires specifying the number of clusters beforehand;
hierarchical determines the number of clusters after running
the algorithm
• 𝑘𝑘-means involves randomization; hierarchical is deterministic
• 𝑘𝑘-means creates non-nested clusters; hierarchical creates
nested clusters

© 2024 Coaching Actuaries. All Rights Reserved www.coachingactuaries.com PA Formula Sheet 9

You might also like