Exam Pa Note
Exam Pa Note
Aes() determine the role different variables play in the plot. Such as color/size/shape.
Case Study: Personal injury insurance dataset. (NOTE: This case study illustrate how to
manipulate aes() and geom() to visualize scatterplot)
Aes() determine what relationship we want to see in the plot (color blue and red mapping
example) and geom() determine how we want to see the relationships (in red or in blue).
2.2.1 Univariate Data Exploration – exploration that sheds light on the distribution of only ONE
variable at a time.
Numeric Variables
Descriptive statistics. Statistical summaries are mainly used to reveal two aspects of the
distribution of a numeric variable:
Central tendency – mean/median.
Dispersion – Common measure of dispersion include Variance/Standard Deviation and
Inter-quartile range. All measure in a way how spread out the values of the numeric
variables are over its range.
Statistical summary – more standard deviation meaning more dispersed or more spread out.
Histogram – by adjusting BIN value could lead to better visualization, larger the BIN the
smoother but not necessarily better.
Problem with skewed data and possible solutions – In predictive modeling, it is often
undesirable to have a right-skewed target variable for two reasons:
Predictive power – Our objective is to study the association between the target variable
and predictors in the data over a wide range of variable values. If most of the
observations of the target variable cluster narrowly within a small-value range, it will be
difficult to investigate the effect of predictors on target variable globally.
Model fitting – a number of predictive models such as linear model/decision trees are
fitted by minimizing the sum of squared discrepancies between observed values and the
predicted values of the target variables. If target variable is right skewed, then the
outliers or extreme values will contribute substantially to the sum and have a
disproportionate effect on the model.
To correct skewness – Apply a monotone concave function to make the distribution more
symmetric.
Log transformation – This transformation could apply to variable of interests that are
strictly positive.
Square root transformation – could be applied to non-negative.
Categorical Variables
Descriptive statistics – Frequency tables: for categorical variables, they do not always
have a natural order, so the mean and median may not make sense. Frequency tables
will help us to understand the distribution of categorical variable.
Graphical displays – Bar charts: the more number of levels of a categorical variable, the
more difficult frequency tables to read. Bar charts extract the information from a
frequency table and present numeric counts visually, highlighting the relative of each
level in the variable.
The Geom_bar() function is for visualizing the distribution of categorical variable given
individual (raw) data, while the geom_col() function is for the same purpose, given
grouped (summarized) data.
2.2.2 Bivariate Data exploration – there are three types of bivariate combination
We can see with the claim amount is larger for claims with legal rep than without
regardless inury code.
Classification of Variables:
Target variable – interested in predicting.
Predictors – Associated variables which can be used to predict the target.
Numeric Variables:
1. Discrete
2. Continuous
Categorical Variables:
1. Gender
2. Smoking status
With the use of descriptive statistics and graphical displays, clean the data for incorrect,
unreasonable, and inconsistent entries, and understand the characteristics of and the key
relationships among variables in the data. The observations are we make may suggest an appropriate
type of predictive model to use and the best form for the predictors to enter the model.
How to do the split? – If one of the variables in the data is time variable, then we can split on
the basis of time, older observations to training set and recent observations to test set.
How many observations should we assign to the two sets? – Trade-off is that more training
data will make for a more robust predictive model more capable of learning the patterns in
the data and less susceptible to noise. With too little data is set for the test set, however, the
assessment of the prediction performance of the trained model on new, unseen
observations will be less reliable.
Common performance metrics – Evaluate the performance of the model, the performance metrics which
measures the extent to which the model matches the observations in the data. This requires Loss Function.
Regression problems: usually by looking at the discordances between each observed value of the
target variable and the corresponding predicted value, It is called Residual if the observation
is in the training set, and Prediction Error if the observation is in the test set. The smaller the RMSE
the better fit of the model to training/testing data.
Mean absolute error (MAE) – It places much smaller weight on large losses and makes the fitted
more robust against outliers. It’s differentiable and eases model fitting
Classification problem: the predictions y hats are simply labels, or certain factor levels of the target.
We can use zero-one loss function:
And use these indicator functions in aggregate to develop the classification error rate.
The sum of the 1s is the counts of the number of training observations incorrectly classified. The
smaller the training classification error rate the better fit of the classifier.
Cross-validation: you can assess the prediction performance of a model without using additional test
data. CV could be used to tune hyperparameters WITHOUT having to further divide the training set
into two parts.
1. For a given positive integer k, randomly split the training data into k folds of approx..
equal size.
2. One of k folds is left out and the predictive model is fitted into k-1 folds. Then the fitted
model is used to make a prediction for each observation in the left-out fold and a
performance metric is computed on that fold.
3. Repeat this process with each fold left out in turn to get k performance values.
4. The overall prediction performance of the model can be estimate as the AVERAGE of the
k performance values.
Note: If we have a set of candidate hyperparameter values, then we can perform step 2 above for
each set of values under consideration and select the combination that produces the best model
performance.
Training error: the more complex or flexible a model, the lower its training error. (model being more
flexible is capable of accommodating a wider variety of patterns in the training data.)
Test error: it exhibits an U-shape, first declining as model complexity increases, then starting to
increase eventually as the model becomes inordinately complex.
1. Case 1: Underfitting
The model is too simple to capture the training data effectively. Without learning
enough about the signal in the data, an underfitted model matches both training and
test data POORLY.
2. Case 2: Overfitting
The model is overwhelmingly complex, is overfitting the data.
The Signal: general relationship between the target variable and predictors.
The Noise: patterns caused by random chance rather by true properties of the
unknown function f.
Bias vs. variance. – They quantify the accuracy and precision of prediction, respectively.
For Bias:
On the top left and top right, the two model predictions on average lie in the middle, hitting
the true signal value approximately. These two models have small bias, we say that they
make accurate predictions.
In contrast, bottom left and right model predictions are far away from center on average and
thus they have large bias, or their predictions are inaccurate.
For Variance:
Top and bottom left models, the predictions are concentrated in a small region, they have
small variance and they are precise. High variance on the right side models are inprecise.
A more flexible model generally has a lower bias but a higher variance than less flexible model.
The construction of a model fares well to the future, we should strike a balance between bias and variance
and it should at least approximately correspond to the minimum point of the U-shape curve followed by the
test error. A moderate flexible model that fits training data reasonably well and strikes a good balance
between having a high bias and a high variance.
To Strike a balance between bias-variance trade-off, feature generation and selection could help us to
effectively control model complexity.
Variable: a raw measurement that is recorded and constitutes the original dataset before any
transformations are applied.
Features: derivatives from the original variables and provided an alternative of raw variables and will
serve as final inputs into a predictive model.
Feature Generation – A process of generating new features based on existing variables in the data.
Predictive Power: Transform the data into a more useful form (or scale) so that predictive model can
better “absorb” the information and capture relationship between the target variable and signal of
the data.
Interpretability: New features can make a model easier to interpret.
Feature selection and dimension reduction – A procedure of dropping features (or variables) with limited
predictive power and therefore reducing the dimension of the data.
Predictive Power: Feature selection is an attempt to control model complexity and prevent overfitting
by reducing the variance at the cost of slightly rise in bias.
Interpretability: Given two comparatively effective model, we prefer the simpler and more
interpretable one by selecting the useful features and dropping the not-so-useful ones.
Dimensionality: of a categorical predictor, refer to the number of possible levels that the variable has.
2. Combining similar categories: If the target variable behaves similarly (the mean,
median, other distributional measures) in two categories of a categorical
predictor, we can reduce the dimension of the predictor by consolidating these
two categories, EVEN if they are not sparse, without losing much information.
Granularity revisit.
One way to reduce the susceptibility of a predictive model to overfitting is to reduce the granularity of a
categorical predictor, recording the information contained by the predictor at a less detailed level and making
the number of factor levels more manageable. The optimal level of granularity is the one that optimize the
bias-variance trade-offs.
This approach consists in choosing the estimates of the B’s to make the sum of “squared” differences between
the observed target values and fitted values under the model:
NOTE: When Random errors are normally distributed, the OLS estimator B0,B1,B2… coincide with the
maximum likelihood estimators.
Ideally, we would want the predicted/fitted values to be sufficiently close to the corresponding observed
target value but not overly close in order to avoid overfitting. On test set, we want them to be as close as
possible so our linear model is as predictive as possible.
Residual – the discrepancy between the observed value and the fitted value and measures how well the fitted
model matches the ith training observation.
A scaled goodness-of-fit measure of a linear model, it’s defined as the proportion of variation of the target
variable that can be explained by the fitted model.
R^2 is on a scale from 0 to 1, higher value of R^2, the better the fit of the model to training set.
Problem with RSS and R^2 – They are merely a goodness of fit measures of a linear model with no explicit
regard to its complexity or prediction performance.
Traditional model selection methods: Hypothesis testing.
T-test – Loosely speaking, it is a measure of the partial effect of Xj on target variable. I.e., the effect of
adding Xj to the model after accounting for the effects of other variables in the model.
F- test – testing joint significance of the entire set of predictors.
For AIC:
Ideally, we would like both terms to be small but they are typically in conflict with each other – the model fits
well to the training data is usually more complex.
For BIC:
For BIC in a reasonable large sample size, the additional parameter has to improve the goodness of fit by a lot
for it to be included. With a stringent penalty term.
Model Diagnostics. – A set of quantitative and graphical tools that are used to identify evidence against the
model assumption. If any apparent deficiencies are present, we may refine the specification of the model
taking these deficiencies into account.
If a linear model is properly specified according to the model equation and the model assumptions are valid.
Then the residuals should resemble the random errors and have the following “nice” properties:
Normal Q-Q Plot:
Used for checking the normality of the random errors. It should lie closely on the 45 degree straight line if it is
normally distributed.
However, if there’s systematic departures from the 45 degree straight line, often in two tails, suggest that the
normality assumption is not entirely fulfilled.
3.2.3 Feature Generation
How do we specify a useful linear model?
Numeric Predictors
Basic form: We can interpret the regression coefficient Beta1 as – a unit increase in X is associated with an
increase of Beta1 in Y on average, hold all other predictors fixed.
Pros: by using the polynomial regression, we’re able to handle a substantially more complex
relationships between the target variable and predictors than linear ones.
Cons:
1. Interpretability – difficult to interpret. It’s hard to say Beta1 is the expected change in Y
associated with a unit change in X, holding others fixed, polynomial has X^2.
2. Choice of m – No simple rule as how to choose the value of m.(hyperparameter)
Method 2: Binning – using piecewise constant functions.
An alternative to polynomial regression for incorporating non-linearity into a linear model is to not treat a
numeric variable as numeric. Rather, we “bin” the numeric variable and convert it into an ordered categorical
variable with different levels that are defined as non-overlapping intervals over the range of the original
variable.
Pros: Binning liberates the regression function from assuming any particular shape. No definite order
among the coefficient of dummy variables corresponding to different bins. The larger the number of
bins, the wider of variety of relationships between target variable and predictors we can fit, the more
flexible the model. Thus, it’s bias drops however, variance will increase.
Cons: No simple rule on how many bins we should use/Loss of information due to binning, we used
to have exact numbers but now we only have range of intervals.
An extension of binning is to use piecewise linear functions. The regression function is “linear” over different
intervals in the range of the numeric variables.
Pros: Simple but powerful: To handle non-linearity. As we add more and more payoff functions with
different strike prices, we’re able to fit a wider range of non-linear relationships. Unlike binning, this
method recognizes the variation of the target mean within each piece and retains the original values
of X.
Cons: Break points has to be provided in advance.
Categorical predictors
The problem with categorical predictors – they cannot be handled algebraically in the same way as numeric
predictors and require special treatment.
Binarization: It is an important feature generation method; it turns a given categorical predictor into a
collection of artificial “binary” variables. (a.k.a dummy variables) Each will serve as an indicator of one and
only one level of the categorical predictor.
It creates a binary variable that equals 1 for every observation whose categorical predictor assumes that level,
and 0 otherwise. This way, it shows whether or not each observation does/does not possess a certain
characteristics.
Interaction: If the association between the target variable and one predictor depends on the value of
another predictor.
Two or more predictors may interact with each other and affect the target variable on a joint basis.
Beta1 X1 and Beta2 X2 capture the main effects due to two numeric predictors and Beta3 X1 X2 produces the
interaction effect. NOTE: even if X1 and X2 are strongly correlated, it has nothing to do with whether the
relationship between X1 and Y varies with the value of X2.
Sidebar: Collinearity
Definition: Collinearity exists in a linear model when two or more features are closely, if not exactly, linearly
related.
Problem with collinearity. Presence of collinear variables means that some of the variables do not bring
much additional information because their values could be largely deduced from the values of other
variables, leading to redundancy.
Variance inflation – such as wildly large positive coefficient estimate for one feature accompanied by
a similarly large negative coefficient estimate for its correlated counterpart.
Interpretation of coefficient – hard to interpret a coefficient estimate as the expected impact of the
corresponding feature on the target variable when other variables are held constant, because
variables strongly correlated with the given feature tend to move together.
Best subset selection: Fitting a separate linear model for each possible combination of the available features
and selecting the “best” subset of features to form the best model. NOTE: this method is computationally
prohibitive if we have too many features.
Remarks:
Intercept Beta0 is not part of the regularization penalty
Standardization of predictors. (same scale)
Pros:
R auto binarizes categorical predictors and each factor level is treated separately to be removed. So
we can assess the significance of each factor level rather than the entire categorical predictor.
Tuning by CV – using the same criterion MSE that will ultimately be used to assess the model against
unseen test data.
Cons:
Interpretability – like ridge regression, where all the features are retained.
NOTE: EXERCISE 3.2.21 offers insights on differences between stepwise vs. regularization
3.3 Case Study 1: Fitting Linear Models in R
Univariate data exploration:
For Newspaper, the median is remarkably lower than its mean, indicating there is a certain amount of right
skewness in its distribution
Bivariate data exploration – pairwise relationship between target variable and each predictor
Correlation
TV and Radio are strongly positive correlated with sales, thus spending more money on these two would
theoretically drive-up sales.
3.3.2 Simple Linear Regression
OLS coefficient estimates of the fitted linear model
Based on the screenshot below:
NOTE: Median, mean and maximum – mean exceeds the median suggests that the distribution of balance is
skewed to the right.
If there are zeroes, then log transformation is not applicable
If there are zeroes, apply square root transformation
Split boxplots
Only Student/not has significant difference to balance. All other three factors show no significant
differences.
3.4.2 Model Construction and Feature selection
Married”NO” is the baseline as well as Student”NO”, we can say that the credit card balance of an
Unmarried person is 20.46469 more than a married person.
NOTE that Caucasian has less Standard error than Asian - sparse levels tend to have larger standard
error because the coefficient estimates for those are based on less data.
Binarization – Converting categorical predictors with three or more levels into dummy variables.
Advantages – Able to drop/add individual factor levels when we’re doing feature selection.
Disadvantages – Stepwise selection might take longer due to many more factor levels to choose
from, thus increased computational burden.
Feature selection by the stepAIC() function – forward(adding each feature) or Backward(removing each
feature) will produce the greatest improvement in the model according to a certain criterion.
BIC tend to favor model with fewer feature because it’s more conservative. For the more complex
model to beat a less complex model in terms of BIC, the goodness of fit of the former has to far
surpass the latter to make up for the increase in the penalty term.
AIC due to its propensity, tend to retain more features; the client has the option of considering more
features and less risk of missing out important features.
3.4.4 Regularization
Reduce magnitude of the coefficient estimates via the use of a penalty term and serves to prevent overfitting.
For glmnet() function to work, binarization of categorical variable is required because it only works with
quantitative inputs.
Glmnet() also standardize the predictors to put them on the same scale.
Lambda = 0, 10, 100, 500, 1000. The bigger lambda, the stronger the penalty effect, thus the penalty function
has the ability to shrink the predictors with limited predicting power all the way to exactly zero in the elastic
net function.
One-standard-error (1-SE) rule – the simplest regularized regression model whose CV error is within “one
standard error” of the minimum error.
4.1 Conceptual Foundations of GLMs (differences between
linear model and GLMs)
Distribution of the target variable – the target variable only needs to be a member of so-called
Exponential family of distributions. Whereas linear model needs to be normally distributed.
Relationship between the target mean and linear predictor – GLM sets a function of target mean to
be linearly related to the predictors.
Component 2: The link function – Given the target distribution, we can proceed to specify the link function.
Positive mean – Poisson, Gamma and inverse Gaussian distribution, unbound from above. We use log
link for it.
Appropriate prediction - Unit Value mean – for binary target variables, the target mean is the
probability of the event of interest occurs, which is always between 0 and 1. Thus, the link function
should ensure that the mean implied by GLM is unit-value. Using Logit link:
IMPORTANT! Interpretation of GLM coefficients. – quantifying the change in target mean in response to
changes in the predictors in terms of the model coefficients. This depends of the choice of link function.
Log link: it always ensures positive predictors and easy to interpret.
As for categorial variable
Is at baseline.
Logit Link – almost always for Binary data, which is another form of log link, applied to the odds – a
unit change in a numeric predictor with coefficient Beta j is associated with a multiplicative change of
exp(beta j) in the odds. It ensures model prediction always valued between 0 and 1. Makes the model
easily interpretable.
Weights – Observation in grouped data with a larger exposure have smaller variance and higher
degree of precision. Thus, we attach higher weight to this more credible observation. They carry
more weight in the estimation of GLM coefficients. The goal is to improve the reliability of the fitting
procedure.
Offsets – The larger the number of polices, the larger the total number of claims (target variable).
Thus the number of policies can set as an “offset” to better account for total number of claims. The
number of policies is in direct proportion to total number of claims.
Weights – to use weights properly, the observation of the target variable should be averaged by
exposure. Due to averaging, the variance of each observation is inversely related to the size of the
exposure. Weights do not affect mean of the target variable directly.
Offsets – Observations are aggregated over the exposure units. The exposure, serving as an offset is
directly proportional to the mean of the target variable but leaves its variance unaffected.
1.1 Deviance reduces to RSS for linear model – we can view deviance as a
generalization of RSS that works even for non-normal target variables in the
exponential family.
1.2 Deviance should be used carefully! – We can ONLY compare GLMs with the same
target distribution, so they share the same maximized likelihood of the saturated
model.
1.3 Residuals derived from deviance
NOTE: Deviance of a GLM parallels the RSS of a linear model in a sense that it is merely a goodness of
fit measure on the training set and always decrease when new predictors are added.
A Traditional model selection method: Likelihood ratio test. – a generalization of the t-test and F-
test.
Restrictions of the LRT:
Case 1 – If the predicted probability is higher than the cutoff, then the event is predicted to occur
Case 2 – if the predicted probability is lower than the cutoff, then the event is predicted not to
occur.
Classification error rate:
Sensitivity – True positive rate, the proportion of positive observation in the data that are correctly
classified as positive.
Specificity – True negative rate, the proportion of negative observation in the data that are
correctly classified as negative.
Precision – proportion of positive predictions that truly belong to the positive class.
Relationship between accuracy, sensitivity and specificity – Accuracy is a weighted average of
sensitivity and specificity.
Effects of changing the cutoff. – the selection of cutoff is a trade-off between having high sensitivity
or having high specificity
Extreme case 1: If the cutoff is 0, then everyone will predict positive. Thus, we will have extremely
high sensitivity equals 1 but 0 specificity
Increasing the cutoff: As we increase the cutoff, more and more observation will be classified as
negative. Thus the specificity will increase at a cost of losing a small amount of sensitivity.
Extreme case 2: if the cutoff is 1, then everyone will predict negative. Thus sensitivity = 0 but
specificity = 1.
ROC Curves. (Receiver Operating Characteristic) – begin at (1,0) and end at (0,1) correspond to a cutoff of 1
and 0.
The predictive performance of a classifier can be summarized by computing the area under the curve (AUC),
the exact value of the AUC may NOT mean much for the quantitative assessment of a classifier but the
relative value of the AUC that matters. The higher the better.
AUC = 1: the highest possible value of the AUC is 1, the classifier with this value has a perfect
discriminatory power. Able to separate all the observations correctly.
AUC = 0.5: Naïve classifier that classifies the observation Purely randomly without information
contained in the predictors.
AUC = 0: bad classifier that always incorrectly separate the observation.
Sidebar: Unbalance Data. – One class of the binary target variable is much more dominant than the other
class in terms of proportion of observations.
Solutions:
Undersampling: it produces roughly balanced data by drawing fewer observations from the negative
class(undersampled class) and retaining all of the positive class observations.
Drawback – now our classifier is based on less data, thus could be less robust and more prone to
underfitting.
Oversampling: Keeps the original data but oversamples with replacement the positive class, reduce
the imbalance between the two classes. NOTE: it only performs on the training set after the split of
training/test set. Otherwise, our test set might contain some same observations as training set so it’s
not completely unseen.
For inj with 6 levels – treating it as numeric implicitly implies that it has a global monotonic relationship
with claim amounts across all injury groups. To relax this restriction, we can convert inj as a factor
To get a feel for how OLS model performs, we compare the RMSE of the test with the RMSE of the
intercept-only model fitted to the training set, which uses the mean of amt on the training set as
prediction.
GLM 1: Log-link normal GLM on claim size.
GLM ensures the positive predictions and uses log-link to connect mean size to linear predictor. A
drawback is that it allows for the possibility that observations of the target variables are negative.
Using log-link ensures positive prediction, a characteristic of the target variable. and it makes the model
coefficient easier to interpret.
For “Residuals vs. Fitted” – most of the residuals point scatter around 0s with no special pattern
with some positive outliers such as 5422, 10224 and 18870, and seems like more spread of larger
positive than negative residuals.
For “Normal Q-Q plot” – this allows us to assess the normality of the standardized deviance
residuals. Most of the points lies on the 45 degree line from left end to mid but starts to deviate
towards right end, indicating there are some skewness that the gamma distribution failed to
handled, using a distribution with a fatter tail maybe more suitable.
Since the gamma distribution used log link, the expected claim size is a multiplied by exp(beta j) for every
unit increase in a numeric predictor with coefficient beta j or a qualitative predictor moves from its baseline
level to a new level represented by a dummy variable with coefficient beta j, holding everything else constant.
4.3 Case study 2: GLMs for Binary Target Variable
Target leakage – removing any variables that will be known as target variables is observed.
Questionable data that do not make sense – negative in age for example.
Numeric Predictors – using boxplot to see the relationships between target variable and numeric variables.
Vehicle values do not have much of impact on claim occurrence.
Exposure has a significant impact on claim occurrence as expected, the more one drives the more
likely one will have an accident.
Categorical Predictors – Combine similar levels in a factor to reduce dimension of the data.
Binarization – to binarize a feature with multiple levels allow us to view each level categorical predictor
separately and add/remove one level at a time.
Odds-based interpretation –
For every unit increase in log_veh_value, the odds of occurrence is 1.1716752 – 1 = 17.17%. Holding
everything else constant. The higher the vehicle value, the more likely a drive will submit a claim.
For every unit increase in agecat.1, there is an increase of odds of occurrence of 1.2934442 – 1 =
29.34442%, holding everything else constant. The younger the driver, the higher chances the driver
will experience a claim.
Probability-based interpretation – A unit increase in a predictor Xj will increase the probability by ## much,
if Xj is currently this value. So the “average policyholder” is the first row.
10% increase in exposure will lead to a 0.0785 – 0.0718 = 0.0067 increase in claim occurrence
predicted probability
Being in agecat.5 will lead to a 0.0566 – 0.0718 = -0.0152 decrease in claim occurrence.
4.4 Case Study 3: GLMs for count and Aggregate loss variable.
4.4.1 Data Exploration and Preparation
The number of claims
The sum of claim payments
Picking and justifying offsets.
Insured years is an good offset for claims – longer the insured policy, larger the number of claims
submitted.
Claim is a good offset for payment – the more claims submitted, the more payments.
Some additional observations. – factor conversions/ Univariate and bivariate data exploration.
4.4.3 Predictions
Predict Claims first using count prediction model and then predict PaymentPerClaim using severity
model, finally get the product of them.
Chapter 5
Tree-Based Models
Decision tree divide the feature space into a finite set of non-overlapping region containing relatively
homogeneous observations more amendable to analysis and prediction.
How it works: To predict a given observation, we simply locate the region to which it belongs and use the
mean or the most common class of target variable in that region as prediction.
Node – a point on a decision tree that corresponds to a subset of the feature space.
Binary tree: decision tree in which each node is followed by two sub-nodes.
Depth: number of tree slips needed to go from root node to furthest terminal node. Measure of the
complexity of a decision tree.
Regression problem: for numeric target variable, use the average of the target variable in the region as
predicted value. To quantify the variability of a numeric target variable in a particular node of a decision tree,
we use Residual Sum of Squares – the sum of the squared discrepancies between the mean of target variable
and the fitted value over all training observations in a certain node. The lower the better.
Classification problem: use the most common class of the target variable in that region as the predicted
value. Classification error rate – measure of impurity of a tree node.
Construction of decision trees: Recursive binary splitting. – Every time we make a binary split, there are
two inter-related decision:
The predictor used to form the split
Given the predictor, what’s the Cutoff value
Each split should lead to greatest information gain or equivalently, the greatest reduction impurity.
Classification error rate cares only about the maximum class proportion, the more proportion the lower
Em, and Gini index and Entropy could also capture this and Em does not say enough about node impurity.
Controlling tree complexity by cost-complexity pruning. – Although decision tree can be fitted efficiently but
they can also be easily overfitting, capture too much signal as well as noise of the training data.
Penalty term Cp = 0, there is no price to pay for making the tree more complex, the tree is identical to
the large, overblown tree without pruning.
Cp increase from 0 to 1, the penalty term becomes greater, the tree branches are snipped off
retroactively to form a new, larger terminal nodes. Squared bias of the tree predictions will increase
but variance will drop.
Cp = 1, trivial tree. The drop in the relative training error due to extra split can never compensate for
the increase in complexity penalty.
A nested family of trees: if c1 < c2, then c2 must form a smaller tree and all of its nodes, terminal
and non-terminal must form a subset of the nodes of the larger tree fitted with c1.
Hyperparameter tuning: using Cross-validation method. For a given Cp value, we use it to fit a tree on
all but one of the k folds, generate predictions on the held-out fold and measure the performance
based on RSS for Regression tree and classification error rate on classification tree. Repeat this
process with each of the k-folds left out in turn and compute overall performance. Choose whichever
Cp value has the lowest RSS or Classification error rate.
[IMPORTANT!] GLM vs. decision tree.
Numeric predictors:
1. GLMs: it assumes that the effect of a numeric predictor on the target mean is monotonic. Non
monotonic relationships can be captured by introducing higher power polynomial terms
2. Trees: Tree splits are made, possibly repeatedly and separating the training observations into distinct
groups according to its ordered values. The predicted mean(regression) or possibility (classification)
in these groups are allowed to behave irregularly as a function of numeric predictor, without
imposing a monotonic structure on the predicted mean. Thus, the tree excel in handling complex
relationships.
Categorical predictors:
Interactions:
Collinearity:
Variable transformations
Merits
1. Interpretability – Trees can be displayed graphically, which makes them easier to interpret. As there
are not many buckets, trees are easier to explain to non-technical audiences.
2. Handling complex relationships between predictors and target – realizing non-linear relationships
and interactions between target variable and predictors. Unlike GLMs which needs to manually insert
polynomial or interaction terms before fitting into model.
3. Categorical variables – able to handle categorical variables without the need of binarization or
selecting a baseline level.
4. Variable selection – Trees automatically select important variables show up on the top nodes, and
filter out non-important ones. Unlike GLMs, including all variables and require to perform Stepwise
selection or regularization to filter irrelevant ones.
Demerits
1. Overfitting – More prone to overfitting and tend to produce unstable predictions with a high variance
even with pruning.
2. Numeric variable – To capture the effects of a numeric variable, we need to make tree splits based
on this variable repeatedly. This gives a rise to a complex, large depth tree that is hard to interpret.
GLM only requires a single coefficient to be estimated for a numeric predictor. (assuming a
monotonic relationship)
3. Categorical variable – trees tend to favor variables with a large number of levels over those with
fewer levels. This can lead to a large information gained for a particular training data but it is actually
not the true signal. Again, Overfitting. (Combining factor levels into more representative groups prior
to fitting decision tree can remedy this.)
Motivation behind ensemble methods – a decision tree is sensitive to noise and tends to overfit even with
pruning. With small perturbations in the training data, the fitted decision tree could be vastly different and
the resulting predictions can be highly unstable.
Ensemble method – Allows us to hedge the instability of decision trees and substantially improve their
prediction performance in many cases. Instead of relying on one model, we take the results of multiple base
models in aggregate to make an overall prediction.
Bias – Capturing complex relationships in the data due to use of the multiple base models each
working on different parts of the complex relationships.
Variance – Improving the stability of overall predictions by averaging the predictions of the individual
base models.
Drawback – computationally prohibitive to implement and difficult to interpret due to the need to
deal with hundreds and thousands of base models.
Boostrap. Samples that are generated by randomly sampling the original training observations with
replacement.
Random forest: Generating multiple boostrapped samples of the training set and fitting unpruned
base trees in parallel independently on each of the boostrapped training sample.
Numeric target variable: The overall prediction is the simple average(default) of the B base
predictions.
Categorical target variable: “Majority vote” approach, pick the predicted class that is most
commonly occurring.
Randomization at each split. Random forest is injected with a randomization element into the
growing process of the base trees.
At each split, a random sample of m (less or equal to p) predictors is chosen as the split candidates
out of p predictors. The one feature that makes the most impurity reduction is used to construct the
split. A new random sample of m predictor is made at each split, so a predictor that is sampled in one
split can still be sampled and used in another split.
Merit:
1. Random forest is more robust, although the base trees are unpruned and therefore low bias and high
variance, by averaging the results of these trees contribute substantial variance reduction, especially
when number of base trees is large and produce much more precise predictions. (Bias is generally
not reduced by random forest.)
Demerits:
1. Interpretability – random forests are difficult to interpret due to multiple base models. Difficult to see
how predictions depend on each feature.
2. Computational power – it takes longer time to implement compared to a base decision tree.
5.1.3 Ensemble tree: Boosting – boosting builds a sequence of independent trees using information from
previous grown tree. We fit a tree to the residuals of the preceding tree and subtract a scaled-down version
of current tree’s prediction to the preceding tree’s residual to form a new residual. The whole process is
repeated with the effects being that each tree will focus on predicting the observations that the previous tree
predicted poorly.
The overall predictions of a boosting tree is the sum of a scaled-down prediction of each base tree.
1. Minisplit: minimum numbers of observation must exist in a node in order to make the split. (the
smaller, the more complex the model)
2. Minibucket: minimum numbers of observation must exist in any terminal node.
3. Cp: complexity parameter, penalizing the tree by its size when the cost-complexity pruning is
performed. Higher the cp, the heavier the penalty and simpler the model.
4. Maxdepth: Number of branches from root node to terminal node.
Turning Wage as numeric into Wage_flag as binary for high earner greater or equal to $100K a year equals 1
otherwise 0.
For Regression tree, the splits are chosen to MINIMIZE the residual sum of squares. For a right-skewed
target variable, the large observed values will exert a disproportionate effects on the RSS calculations.
GLMs are assumed to have a monotonic relationship between Target variable and predictors.
Tree 2: Pruning Tree 1 (full tree) using a minimizer of Xerror – Choosing the one with the lowest CV Error
(xerror).
Tree 3: Pruning Tree 1 using the one-standard-error rule – To select the smallest tree whose CV error is
within “one-standard-error” from the minimum CV error.
Mtry: Among all the parameters of a random forest, the number of features considered in each split,
represented by mtry The larger the value of mtry, the less variance reduction we get because they have
similar base trees. E.g, if mtry = 2, then only two predictors should be sampled and considered in every split
of every base tree built.
Ntree: Building more base trees tends to improve prediction performance of a random forest. With more
base trees, the variance reduction contributed by averaging becomes more significant and the model
prediction become more precise.
Variable importance
Eta: skrinkage parameter/learning rate. 0 < eta < 1. Higher the learning rate, the faster the model will reach
optimality and fewer the number of iterations required, though the resulting model will more likely overfit.
For larger eta, the accuracy starts to drop beyond 100 nrounds. Larger the eta, the faster the model
converges and leads to overfitting.
For smaller eta, the accuracy first rise, it’s a sign that the base trees of the boosted model are
learning the signal more and more effectively. When nrounds is beyond 150 or 200, the accuracy
starts to drop.
Principal components analysis (PCA) is an advanced data analytic technique that transforms a high-
dimensional dataset into a smaller, more manageable set of representative variables that capture most of the
information of the original dataset.
Applications of PCA.
1. Application 1: data visualization – Via PCA, we have successfully reduced the dimensional of the
original dataset from p variables to a smaller set of variables while retain most of the information
measured by variance.
2. Application 2: Feature generation – Unsupervised learning can be applied to “improve predictive
modeling outcomes.”. NOTE: once we created new PC from original variables, these predictors are
mutually uncorrelated, so one variable is confounded by another variable is no longer an issue. By
reducing the dimension of the data and complexity of the model, we hope to optimize the bias-
variance trade-off and improve the prediction performance of the model.
Proportion of variance explained. One problem with PCA is that what m number of PCs we should choose.
We want to choose the smallest m required to visualize or understand the data well. This could be done by
assessing the variance explained by each PC in comparison to total variance.
Choosing the number of PCs by a scree plot. – Choose the smallest number of PCs required to explain a
sizable proportion of variance.
Importance of centering and scaling. – another issue when applying PCA, whether to mean-center the
variable and/or scale the variables to have unit variance.
Centering: whether a variable is added or subtracted by their sample mean to yield mean zero should
have NO effect on PC loadings. PC loadings are defined to maximize the variance of the PC scores,
variance remain unchanged when the values of a variable are added/subtracted by a constant.
Scaling:
1. If we conduct PCA on their original scale, we are determining the PC loadings based on the
COVARIANCE matrix of the variables.
2. If we conduct PCA on their standardized variables, we are determining the PC loadings based
on the CORRELATION matrix of the variables.
Scaling is important to PCA results because if we have a dataset containing variables with vastly different
magnitude, one variable is range from 0 to 10 and another is from 0 to 100,000, then because the PC loadings
are defined to maximize the sample variance, the larger variance will get assign with larger weight. However,
there is no guarantee this variable explains much of the underlying patterns of the data. It is not desirable for
the results of a PCA to depend on an arbitrary choice of scaling, thus, scaling is recommended.
PCA with categorical features. – We need to binarize the variables in advance. Then run PCA on each
dummy variables and use the first few PCs to summarize most of the information.
Drawbacks of PCA.
Interpretability – PCs can be complicated linear combinations or original features. What do they
actually represent?
Not good for non-linear relationships – by construction, PCA involves using linear transformations to
summarize and visualize high-dimensional datasets where the variables are highly linearly correlated.
PCA is not doing feature selection – PCA reduce dimensionality of the model but it’s not feature
selection because each PC is a linear combination of original features, none of the original features
are removed.
Target variable is ignored. When using PC in a supervised learning setting, the PC loading and scores
are generated completely independent of the target variable. We are assuming the feature directions
exhibit the most variation and most associated with the target variable. Often there is no guarantee
it’s true.
Data description.
Murder/Assault/Rape are strongly positive correlated. This suggests that PCA may be an effective
technique to “compress” these three crime levels into one single measure.
UrbanPop does not have a strong linear relationship with the three crime-related variables.
Interpretation of the PCs.
Interpretation of biplot.
6.2 Cluster Analysis
Cluster analysis works by algorithmically partition the observations in a dataset into a set of distinct, non-
overlapping subgroups, better known as clusters, with the goal of discovering hidden patterns in the data
that would otherwise be missed.
Observations within each cluster share similar characteristics. (Measured by feature values.)
Observations in different clusters are rather different from one another.
6.2.1 K-means Clustering – Specify the number of clusters K upfront and assign each observation to one and
only one of the K clusters, and each of the cluster hosts relatively homogeneous observations.
Within-cluster SS and total SS. The variation of each observation that live in each cluster is quantified by the
within-cluster sum of square (SS). It is defined as the sum of squared distances between each observation in
Ck and the centroid of Ck.
The smaller the Within-cluster SS, the more homogeneous the observations within each cluster and the
better the separation works.
K-means clustering algorithm. Producing local optimum when K and n are large.
Step 1: Initialization – Randomly select K points in the feature space, and K will serve as the initial
cluster centers.
Step 2: Iteration – repeat the following steps until the cluster assignment no longer change.
a. Assign each observation to the closet cluster centroid. (Using Eulidean distance)
b. Recalculate the center of each of the K clusters.
This algorithm involves iteratively recalculating the K means of the clusters. Thus named “K-means”
clustering.
Inter-cluster dissimilarity. It consists of a series of fusions of observations in the data. It starts with the
individual observations, each is treated as a single cluster and successively fuses the closet pair of clusters,
one pair at a time. The process goes on iteratively until all the clusters are fused into a single cluster
containing all of the observations.
Average linkage – compute all pairwise distances and then take average.
Centroid linkage – take average of the feature values to get the centroids, then compute the
distance between the two centroids.
Dendrogram. Upside-down tree showing the sequence of fusions and the inter-cluster dissimilarity.
Clusters are nested – The clusters formed by cutting the dendrogram at one height must be nested
within those formed by cutting the dendrogram at a greater height. [1,3,4] and [2,5] at 10 height
then [2,5], [1], [3,4] at 4 height.
One dendrogram suffices – The height of the cut controls the number of clusters. But it needs not
to be pre-specified.
With p variables larger than or equal to 3, visualization is much harder. With 2 dimensions, we can visualize
the cluster assignments by using two-dimensional scatterplot.
As the number of dimensions increases, our intuition breaks down and it becomes harder to differentiate
between observations that are close and those that are far apart.