0% found this document useful (0 votes)
105 views73 pages

Exam Pa Note

pa not e1

Uploaded by

Mason Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views73 pages

Exam Pa Note

pa not e1

Uploaded by

Mason Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Chapter 2 Data Exploration and Visualization

Graphs such as scatterplot, histogram, boxplot and bar chart.

2.1 Making ggplots

2.1.1 Basic Features

Aes() determine the role different variables play in the plot. Such as color/size/shape.

Geom() determine points/lines/histograms/boxplots. To modify visual characteristics.

Case Study: Personal injury insurance dataset. (NOTE: This case study illustrate how to
manipulate aes() and geom() to visualize scatterplot)

Aes() determine what relationship we want to see in the plot (color blue and red mapping
example) and geom() determine how we want to see the relationships (in red or in blue).

2.1.2 Customizing Your Plots(Fine tuning a ggplot)


2.2 Data Exploration

Exploratory data analysis (EDA) –


 Data Validation – Check potential data errors or unreasonable model results such as
negative value in age. To remove these inappropriate data values prior to analysis.
 Characteristics of variables – help us to decide on an appropriate type of predictive
model that is likely to meet our business needs.

2.2.1 Univariate Data Exploration – exploration that sheds light on the distribution of only ONE
variable at a time.

 Numeric Variables

Descriptive statistics. Statistical summaries are mainly used to reveal two aspects of the
distribution of a numeric variable:
 Central tendency – mean/median.
 Dispersion – Common measure of dispersion include Variance/Standard Deviation and
Inter-quartile range. All measure in a way how spread out the values of the numeric
variables are over its range.

 Graphical display – To visualize the distribution of numeric variables using


histogram/boxplots etc.

Statistical summary – more standard deviation meaning more dispersed or more spread out.

Histogram – by adjusting BIN value could lead to better visualization, larger the BIN the
smoother but not necessarily better.

Problem with skewed data and possible solutions – In predictive modeling, it is often
undesirable to have a right-skewed target variable for two reasons:

 Predictive power – Our objective is to study the association between the target variable
and predictors in the data over a wide range of variable values. If most of the
observations of the target variable cluster narrowly within a small-value range, it will be
difficult to investigate the effect of predictors on target variable globally.
 Model fitting – a number of predictive models such as linear model/decision trees are
fitted by minimizing the sum of squared discrepancies between observed values and the
predicted values of the target variables. If target variable is right skewed, then the
outliers or extreme values will contribute substantially to the sum and have a
disproportionate effect on the model.
To correct skewness – Apply a monotone concave function to make the distribution more
symmetric.
 Log transformation – This transformation could apply to variable of interests that are
strictly positive.
 Square root transformation – could be applied to non-negative.

Boxplots – a convenient graphical aids to visualize the distribution of a numeric variable.

Categorical Variables

Case study 1: Given raw data

 Descriptive statistics – Frequency tables: for categorical variables, they do not always
have a natural order, so the mean and median may not make sense. Frequency tables
will help us to understand the distribution of categorical variable.
 Graphical displays – Bar charts: the more number of levels of a categorical variable, the
more difficult frequency tables to read. Bar charts extract the information from a
frequency table and present numeric counts visually, highlighting the relative of each
level in the variable.

Case study 2: Given summarized data

 The Geom_bar() function is for visualizing the distribution of categorical variable given
individual (raw) data, while the geom_col() function is for the same purpose, given
grouped (summarized) data.

2.2.2 Bivariate Data exploration – there are three types of bivariate combination

Combination 1: numeric vs. numeric


 Descriptive statistics: correlation coefficient summarize the linear relationship, the larger
the correlation in magnitude, the stronger the degree of linear association between the
two variables.
 However, the two variable could not have a linear relationship (zero correlation) but
could be related in more subtle ways.
 Graphical display: the relationship between two numeric variables is typically visualized
by a scatterplot. Incorporating one categorical variable in the scatterplot could inspect
whether the relationship between two numeric variables varies with the levels of the
third. (interaction)
Combination 2: numeric vs. categorical
 To summarize the association between numeric and categorical variables, we can
partition the data into different subsets, one subset for each categorical level and
compute the mean of the numeric variable in each subset. If these conditional means
varying substantially may suggest a strong relationship between the two variables.
 Graphical display: the conditional distribution of numeric variable given a second,
categorical variable is best visualized by split boxplots.

We can see with the claim amount is larger for claims with legal rep than without
regardless inury code.

Combination 3: Categorical vs. categorical

 Descriptive statistics: Frequency tables to compare and split bar charts to


visualize proportion.
3.1 A Primer on Predictive Analytics

3.1.1 Basic Terminology


 Descriptive – Focus on what happened in the past and aim to explain the observed trends by
identifying the relationships between different variables in the data.
 Predictive – Focus on what will happen in the future and is concerned with making accurate
“predictions”.
 Prescriptive – use different combination of optimization and simulation to investigate and
quantify the impact of different “prescribe” actions in different scenarios.

Classification of Variables:
 Target variable – interested in predicting.
 Predictors – Associated variables which can be used to predict the target.

Numeric Variables:
1. Discrete
2. Continuous
Categorical Variables:
1. Gender
2. Smoking status

Supervised vs. Unsupervised


 Supervised learning problems: There is a target variable “guiding” our analysis. Our goal is to understand
the relationship between the target variables and the predictors, and/or make an accurate future
predictions for the target based on predictors.(GLM and decision tree)
 Unsupervised learning problems: There is no target variable, we are interested in extracting relationships
and structures between different variables in the data. (Principal components analysis and cluster analysis)
Regression vs. Classification problems
Refer Supervised learning problems with a numeric target variable as Regression and target variable is categorical
as classification
3.1.2 Model Building Process

Stage 1: Problem Definition


 Characteristics of predictive modeling problems – Business issue that needs to be addressed/
Issue can be addressed with a few well-defined questions/ Good and useful data is available
for answering the questions/ Impact – the predictions will likely drive actions or increase
understanding / better solution – PA will likely produce a solution better than any existing
approach / Update – Continue monitor and update the models when new data becomes
available.
 Problem definition – Hypothesis – more actuarial exams passed leads to higher the salary
 Constraints – Availability of easily accessible and high-quality data/ implement issue.

Stage 2: Data collection and validation


Data design
 Relevance
1. Population – it is important that the data source aligns with the true population we
are interested in.
2. Time Frame – recent history is more predictive of the near future than distant
history.
Sampling – Taking a subset of observations from the data source to generate our dataset.
 Random Sampling
1. Randomly draw observations from the underlying population without replacement
until we have the required number of observations. (Note that respondent bias
sometimes exists)
 Stratified Sampling
1. Stratified sampling involves in dividing the underlying population into a number of
non-overlapping groups in a non-random fashion, and randomly sampling (with or
without replacement) a set number of observations from each group. This sampling
method has the notable advantage of ensuring that every stratum is properly
represented in the collected data.
Granularity – How precisely a variable in a dataset is measured or how detailed the information
contained by the variable.

Data quality issues:


 Reasonableness: for a dataset to be useful, the data values should, at a minimum, be
reasonable.
 Consistency: Numeric variable, the units should all be in the same level, - kg/lbs etc.
Categorical variable: e.g IA/NY/CT or Iowa/New York/Connecticut
 Sufficient documentation: a good dataset should also be sufficiently documented so that
other users can easily gain an accurate understanding of different aspect of data.

Variables with legal/ethical concerns:


 Proxy variable: variable that is closely related to another variable and hence serves as its
“proxy”. E.g nurses are mostly female and mine workers are mostly male.
 Target leakage: the predictors in a model include information about the target variable that
will not be available when the model is applied in practice. These predictors are strongly
associated with target variables but their values are not known until the target variable is
observed.
Stage 3: Exploratory Data Analysis

With the use of descriptive statistics and graphical displays, clean the data for incorrect,
unreasonable, and inconsistent entries, and understand the characteristics of and the key
relationships among variables in the data. The observations are we make may suggest an appropriate
type of predictive model to use and the best form for the predictors to enter the model.

Stage 4: Model Construction, Evaluation and selection

 Training/test set split -

 How to do the split? – If one of the variables in the data is time variable, then we can split on
the basis of time, older observations to training set and recent observations to test set.
 How many observations should we assign to the two sets? – Trade-off is that more training
data will make for a more robust predictive model more capable of learning the patterns in
the data and less susceptible to noise. With too little data is set for the test set, however, the
assessment of the prediction performance of the trained model on new, unseen
observations will be less reliable.

Common performance metrics – Evaluate the performance of the model, the performance metrics which
measures the extent to which the model matches the observations in the data. This requires Loss Function.
 Regression problems: usually by looking at the discordances between each observed value of the
target variable and the corresponding predicted value, It is called Residual if the observation
is in the training set, and Prediction Error if the observation is in the test set. The smaller the RMSE
the better fit of the model to training/testing data.

 Mean absolute error (MAE) – It places much smaller weight on large losses and makes the fitted
more robust against outliers. It’s differentiable and eases model fitting

 Classification problem: the predictions y hats are simply labels, or certain factor levels of the target.
We can use zero-one loss function:
And use these indicator functions in aggregate to develop the classification error rate.

The sum of the 1s is the counts of the number of training observations incorrectly classified. The
smaller the training classification error rate the better fit of the classifier.

 Cross-validation: you can assess the prediction performance of a model without using additional test
data. CV could be used to tune hyperparameters WITHOUT having to further divide the training set
into two parts.
1. For a given positive integer k, randomly split the training data into k folds of approx..
equal size.
2. One of k folds is left out and the predictive model is fitted into k-1 folds. Then the fitted
model is used to make a prediction for each observation in the left-out fold and a
performance metric is computed on that fold.
3. Repeat this process with each fold left out in turn to get k performance values.
4. The overall prediction performance of the model can be estimate as the AVERAGE of the
k performance values.
Note: If we have a set of candidate hyperparameter values, then we can perform step 2 above for
each set of values under consideration and select the combination that produces the best model
performance.

Selecting the best model:


 Prediction performance: For regression problem: choose the model with smallest RMSE and for
Categorical problem choose the model with the smallest Classification error rate.
 Interpretability: A model that makes good predictions is not always preferable. For the final model to
earn trust from stakeholder with no expertise in predictive analytics, it should be reasonably
interpretable.
 Ease of implementation: A good model should be the easier to be implemented, the better.

Stage 5: Model Validation:


 Training set: for GLMs, there is a set of model diagnostic tools designed to check the model
assumptions based on training set.
 Test set: For both GLMs and decision tree, to compare the predicted values and the observed
values of the target variable on test set. (quantitative like summary or visual like graphs)

Stage 6: Model Maintenance:


 Adjust business problem – regulation cause assumption to shift, market conditions etc.
 Consult subject matter experts – understand limitations on what can be reasonably implemented.
 Gather additional data: as time passes by, gather new observations and retrain model to ensure that
it will continue to be predictive in new environments. Or gather new variables.
 Apply new types of models – try new types of models with different strengths and weaknesses when
they are available.
 Refined existing mode: Try new combinations or transformations of predictors, different training/test
splits, alter hyperparameter values.
 Field test proposed model: implement the recommended model in the exact way it will be used to
gain users’ confidence.

3.1.3 Bias-Variance Trade-Off

Complexity does not guarantee good prediction performance

Goodness of fit vs. Prediction accuracy

 Training error: the more complex or flexible a model, the lower its training error. (model being more
flexible is capable of accommodating a wider variety of patterns in the training data.)
 Test error: it exhibits an U-shape, first declining as model complexity increases, then starting to
increase eventually as the model becomes inordinately complex.

1. Case 1: Underfitting
The model is too simple to capture the training data effectively. Without learning
enough about the signal in the data, an underfitted model matches both training and
test data POORLY.
2. Case 2: Overfitting
The model is overwhelmingly complex, is overfitting the data.
 The Signal: general relationship between the target variable and predictors.
 The Noise: patterns caused by random chance rather by true properties of the
unknown function f.

Decomposition of the expected test error.


 Bias: The more complex a model, the lower the bias (in magnitude) – due to its higher ability to
capture the signal of data, bias is the part of the expected test error that is caused by the model not
being flexible enough to capture the underlying signal.
 Variance: a more flexible model has a higher variance because it is matching the training observation
more closely and therefore more sensitive to the training data.

Bias vs. variance. – They quantify the accuracy and precision of prediction, respectively.

For Bias:
 On the top left and top right, the two model predictions on average lie in the middle, hitting
the true signal value approximately. These two models have small bias, we say that they
make accurate predictions.
 In contrast, bottom left and right model predictions are far away from center on average and
thus they have large bias, or their predictions are inaccurate.
For Variance:
 Top and bottom left models, the predictions are concentrated in a small region, they have
small variance and they are precise. High variance on the right side models are inprecise.

A more flexible model generally has a lower bias but a higher variance than less flexible model.

The construction of a model fares well to the future, we should strike a balance between bias and variance
and it should at least approximately correspond to the minimum point of the U-shape curve followed by the
test error. A moderate flexible model that fits training data reasonably well and strikes a good balance
between having a high bias and a high variance.

Visualizing Bias-Variance Trade-off: a MINI study case


 Accuracy: 0 degree of freedom: low bias and low variance. Thus, the model is too simple and
underfitted, it’s far away from the dash line which is the target, failed to capture true signal
of the data. The models with 1,2,4 and 8 are all making accurate predictions, in that their
predictions are close to the true signal value.
 Precision: 4 and 8 have a higher variance given that their size of the boxes and larger
discrepancies between outliers and the boxes, thus they have higher MSE than model 1 and
2. They seem to be overfitting.

3.1.4 Feature Generation and Selection

To Strike a balance between bias-variance trade-off, feature generation and selection could help us to
effectively control model complexity.

 Variable: a raw measurement that is recorded and constitutes the original dataset before any
transformations are applied.
 Features: derivatives from the original variables and provided an alternative of raw variables and will
serve as final inputs into a predictive model.

Feature Generation – A process of generating new features based on existing variables in the data.
 Predictive Power: Transform the data into a more useful form (or scale) so that predictive model can
better “absorb” the information and capture relationship between the target variable and signal of
the data.
 Interpretability: New features can make a model easier to interpret.

Feature selection and dimension reduction – A procedure of dropping features (or variables) with limited
predictive power and therefore reducing the dimension of the data.
 Predictive Power: Feature selection is an attempt to control model complexity and prevent overfitting
by reducing the variance at the cost of slightly rise in bias.
 Interpretability: Given two comparatively effective model, we prefer the simpler and more
interpretable one by selecting the useful features and dropping the not-so-useful ones.
 Dimensionality: of a categorical predictor, refer to the number of possible levels that the variable has.

Ways to reduce dimensionality of a categorical predictor:


1. Combining sparse categories with others: Combine categories with very few
observations with other categories. Due to the small number of observations,
it’s difficult to estimate the effects of these categories on target variable reliably.

1.a - Ensure that each level has a sufficient number of observations


1.b – To preserve the differences in the behavior of the target variable among
different factor level.

2. Combining similar categories: If the target variable behaves similarly (the mean,
median, other distributional measures) in two categories of a categorical
predictor, we can reduce the dimension of the predictor by consolidating these
two categories, EVEN if they are not sparse, without losing much information.
Granularity revisit.
One way to reduce the susceptibility of a predictive model to overfitting is to reduce the granularity of a
categorical predictor, recording the information contained by the predictor at a less detailed level and making
the number of factor levels more manageable. The optimal level of granularity is the one that optimize the
bias-variance trade-offs.

Differences between Granularity and dimensionality:


 Applicability: Dimensionality is a concept specifically to categorical variables, while granularity applies
to both numeric and categorical variables.
 Comparability: we can order two categorical variables by their dimensions (by counting their # of
factor levels.). However, it’s not possible to order them by granularity. For variable 1 to be more
granular than variable 2, variable 1 must contained more detailed information than variable 2 where
we can deduce variable 2 using variable 1.

3.2 Linear Models: Conceptual Foundations


3.2.1 Model Formulation

Model fitting by Ordinary least squares (OLS)

This approach consists in choosing the estimates of the B’s to make the sum of “squared” differences between
the observed target values and fitted values under the model:

NOTE: When Random errors are normally distributed, the OLS estimator B0,B1,B2… coincide with the
maximum likelihood estimators.

3.2.2 Model Evaluation and Validation


Goodness of fit measures. – measure how well the linear model fits the training data.

Ideally, we would want the predicted/fitted values to be sufficiently close to the corresponding observed
target value but not overly close in order to avoid overfitting. On test set, we want them to be as close as
possible so our linear model is as predictive as possible.

Residual – the discrepancy between the observed value and the fitted value and measures how well the fitted
model matches the ith training observation.

Residual sum of squares (RSS)

Coefficient of determination R^2

A scaled goodness-of-fit measure of a linear model, it’s defined as the proportion of variation of the target
variable that can be explained by the fitted model.

R^2 is on a scale from 0 to 1, higher value of R^2, the better the fit of the model to training set.
Problem with RSS and R^2 – They are merely a goodness of fit measures of a linear model with no explicit
regard to its complexity or prediction performance.
Traditional model selection methods: Hypothesis testing.

 T-test – Loosely speaking, it is a measure of the partial effect of Xj on target variable. I.e., the effect of
adding Xj to the model after accounting for the effects of other variables in the model.
 F- test – testing joint significance of the entire set of predictors.

General model selection measures: AIC and BIC:

For AIC:

Ideally, we would like both terms to be small but they are typically in conflict with each other – the model fits
well to the training data is usually more complex.

For BIC:
For BIC in a reasonable large sample size, the additional parameter has to improve the goodness of fit by a lot
for it to be included. With a stringent penalty term.

Model Diagnostics. – A set of quantitative and graphical tools that are used to identify evidence against the
model assumption. If any apparent deficiencies are present, we may refine the specification of the model
taking these deficiencies into account.

If a linear model is properly specified according to the model equation and the model assumptions are valid.
Then the residuals should resemble the random errors and have the following “nice” properties:
Normal Q-Q Plot:
Used for checking the normality of the random errors. It should lie closely on the 45 degree straight line if it is
normally distributed.

However, if there’s systematic departures from the 45 degree straight line, often in two tails, suggest that the
normality assumption is not entirely fulfilled.
3.2.3 Feature Generation
How do we specify a useful linear model?

1. How do we represent different types of predictors in a linear model?


Numeric/categorical
2. How do we configure the model equation to accommodate different kinds of relationships between
target variable and predictors?
We generate new features that can be added to the model to enhance its flexibility and hopefully its
prediction performance.

Numeric Predictors

Basic form: We can interpret the regression coefficient Beta1 as – a unit increase in X is associated with an
increase of Beta1 in Y on average, hold all other predictors fixed.

Tweaking a linear model to handle more complex non-linear relationships:

Method 1: Polynomial regression:

 Pros: by using the polynomial regression, we’re able to handle a substantially more complex
relationships between the target variable and predictors than linear ones.
 Cons:
1. Interpretability – difficult to interpret. It’s hard to say Beta1 is the expected change in Y
associated with a unit change in X, holding others fixed, polynomial has X^2.
2. Choice of m – No simple rule as how to choose the value of m.(hyperparameter)
Method 2: Binning – using piecewise constant functions.

An alternative to polynomial regression for incorporating non-linearity into a linear model is to not treat a
numeric variable as numeric. Rather, we “bin” the numeric variable and convert it into an ordered categorical
variable with different levels that are defined as non-overlapping intervals over the range of the original
variable.

 Pros: Binning liberates the regression function from assuming any particular shape. No definite order
among the coefficient of dummy variables corresponding to different bins. The larger the number of
bins, the wider of variety of relationships between target variable and predictors we can fit, the more
flexible the model. Thus, it’s bias drops however, variance will increase.
 Cons: No simple rule on how many bins we should use/Loss of information due to binning, we used
to have exact numbers but now we only have range of intervals.

Method 3: Using piecewise linear functions.

An extension of binning is to use piecewise linear functions. The regression function is “linear” over different
intervals in the range of the numeric variables.

 Pros: Simple but powerful: To handle non-linearity. As we add more and more payoff functions with
different strike prices, we’re able to fit a wider range of non-linear relationships. Unlike binning, this
method recognizes the variation of the target mean within each piece and retains the original values
of X.
 Cons: Break points has to be provided in advance.
Categorical predictors
The problem with categorical predictors – they cannot be handled algebraically in the same way as numeric
predictors and require special treatment.

Binarization: It is an important feature generation method; it turns a given categorical predictor into a
collection of artificial “binary” variables. (a.k.a dummy variables) Each will serve as an indicator of one and
only one level of the categorical predictor.

It creates a binary variable that equals 1 for every observation whose categorical predictor assumes that level,
and 0 otherwise. This way, it shows whether or not each observation does/does not possess a certain
characteristics.

Baseline level and interpretation of coefficients:


Choice of baseline:

Should we treat numeric variable as a factor?

Typically, when the numeric variables have one/more of following characteristics.

Difference as a numeric/a factor variable in a linear model:


 Numeric: we are assuming that the target mean will vary linearly with integers.
 Factor: provide linear model with more flexibility in capturing the effect predictors on the target
mean. But when there are more levels, it’s prone to overfitting.
Q. Calendar year 2011 – 2015, should we turn this as factor variable or stay numeric.

Interaction: If the association between the target variable and one predictor depends on the value of
another predictor.

Two or more predictors may interact with each other and affect the target variable on a joint basis.

Interactions between different types of predictors:

Beta1 X1 and Beta2 X2 capture the main effects due to two numeric predictors and Beta3 X1 X2 produces the
interaction effect. NOTE: even if X1 and X2 are strongly correlated, it has nothing to do with whether the
relationship between X1 and Y varies with the value of X2.

There are three interactions:


 Case 1: Numeric & Numeric
 Case 2: Numeric & Categorical
 Case 3: Categorical & categorical

Sidebar: Collinearity
Definition: Collinearity exists in a linear model when two or more features are closely, if not exactly, linearly
related.
Problem with collinearity. Presence of collinear variables means that some of the variables do not bring
much additional information because their values could be largely deduced from the values of other
variables, leading to redundancy.
 Variance inflation – such as wildly large positive coefficient estimate for one feature accompanied by
a similarly large negative coefficient estimate for its correlated counterpart.
 Interpretation of coefficient – hard to interpret a coefficient estimate as the expected impact of the
corresponding feature on the target variable when other variables are held constant, because
variables strongly correlated with the given feature tend to move together.

Rank-deficiency: coefficient estimates of the model cannot be determined uniquely.

Possible solution to collinearity:


 Delete one of the problematic predictors causing collinearity.
 PCA

3.2.4 Feature Selection

Best subset selection: Fitting a separate linear model for each possible combination of the available features
and selecting the “best” subset of features to form the best model. NOTE: this method is computationally
prohibitive if we have too many features.

Stepwise selection: Backward / Forward

Stepwise vs. best subset:


 Stepwise do not consider ALL possible combination of features, so it only produces a local optimum.
However, it’s computational burden is much lower than best subset and it’s easy to interpret.
 Best subset: consider All possible combination of features, produces global optimum. However,
computational burden is heavy and hard to implement.

3.2.5 Regularization – incorporates a penalty term that reflects


the complexity of the model and minimizes the objective
function.
It is an alternative to stepwise selection for choosing features, and more generally, for reducing the
complexity of a linear model by shrinking the coefficient estimates.
 Goodness of fit to training data: we desire coefficient estimates that match the training data
effectively (small RSS)
 Model Complexity: it is desirable to have coefficient estimates that are relatively small in absolute
value so that the model has lower variance and is less prone to overfitting.

Remarks:
 Intercept Beta0 is not part of the regularization penalty
 Standardization of predictors. (same scale)

Effect of Lambda and bias-variance trade off:


 Lambda = 0, the regularization penalty term vanishes, the coefficient estimates will be identical to
OLS estimates.
 As Lambda increases, the regularization penalty becomes heavier, and it will shrink the coefficient
estimates to be closer and closer to zero. As the flexibility drops, variance will drop at a cost of small
increase in bias. To improve model prediction performance.

Feature selection of elastic net:


Pros and Cons of regularization techniques for feature selection:

Pros:
 R auto binarizes categorical predictors and each factor level is treated separately to be removed. So
we can assess the significance of each factor level rather than the entire categorical predictor.
 Tuning by CV – using the same criterion MSE that will ultimately be used to assess the model against
unseen test data.
Cons:
 Interpretability – like ridge regression, where all the features are retained.

NOTE: EXERCISE 3.2.21 offers insights on differences between stepwise vs. regularization
3.3 Case Study 1: Fitting Linear Models in R
Univariate data exploration:

For Newspaper, the median is remarkably lower than its mean, indicating there is a certain amount of right
skewness in its distribution

Bivariate data exploration – pairwise relationship between target variable and each predictor

 Correlation

TV and Radio are strongly positive correlated with sales, thus spending more money on these two would
theoretically drive-up sales.
3.3.2 Simple Linear Regression
OLS coefficient estimates of the fitted linear model
Based on the screenshot below:

1. Coefficient estimates table: TV is a significantly positive factor associated with sales


2. Based on the “Multiple R-squared”, the 61.19% of the variation of sales was explained by the fitted
model, indicating that a moderate amount of variability of sales is explained by regressing sales on
TV.

Making predictions: The predict () function.


3.3.3 Multiple Linear Regression
Interaction in multiple linear regression page 269.

3.3.4 Evaluation of Linear Models


3.4 Case Study 2: Feature Selection and Regularization
3.4.1 Preparatory work

Variable factors may have potential ethical concerns – Ethnicity.


 Positive – potentially useful information for understanding and predicting credit card balance.
 Negative – protected class, maybe criticized on grounds of unfair discrimination and raise legal issues.

Discuss the distribution of target variable before jump into analysis.

NOTE: Median, mean and maximum – mean exceeds the median suggests that the distribution of balance is
skewed to the right.
 If there are zeroes, then log transformation is not applicable
 If there are zeroes, apply square root transformation

Descriptive Statistics: Correlation Matrix

 Observation 1: limit/Rating/Income are positively correlated with Balance


 Observation 2: Correlations between variables, limit/rating are almost positive collinear (0.9969),
can just delete one and keep one in the data.

Scatterplot to show relationships between numeric and target variable:


 Income and balance are moderately positive correlated
 Rating and balance are strongly positive correlated

Split boxplots

 Only Student/not has significant difference to balance. All other three factors show no significant
differences.
3.4.2 Model Construction and Feature selection

 Married”NO” is the baseline as well as Student”NO”, we can say that the credit card balance of an
Unmarried person is 20.46469 more than a married person.
 NOTE that Caucasian has less Standard error than Asian - sparse levels tend to have larger standard
error because the coefficient estimates for those are based on less data.

Binarization – Converting categorical predictors with three or more levels into dummy variables.
 Advantages – Able to drop/add individual factor levels when we’re doing feature selection.
 Disadvantages – Stepwise selection might take longer due to many more factor levels to choose
from, thus increased computational burden.

Feature selection by the stepAIC() function – forward(adding each feature) or Backward(removing each
feature) will produce the greatest improvement in the model according to a certain criterion.
 BIC tend to favor model with fewer feature because it’s more conservative. For the more complex
model to beat a less complex model in terms of BIC, the goodness of fit of the former has to far
surpass the latter to make up for the increase in the penalty term.
 AIC due to its propensity, tend to retain more features; the client has the option of considering more
features and less risk of missing out important features.

3.4.3 Model Validation


 The Residuals vs. Fitted Graph – If the linear model has sufficiently captured the relationship
between the target variable and the predictor. The points should exhibit no special pattern and
dispersed in an erratic manner.
 Homogeneity of error variance – if the amount of fluctuation of the residuals increases as
increase in magnitude of fitted values, then homoscedasticity is at stake.
 Normal Q-Q plot – check of normality, in two extreme tails deviated from the 45 degree line, this
means using a distribution with a fatter tail than normal distribution.

3.4.4 Regularization
Reduce magnitude of the coefficient estimates via the use of a penalty term and serves to prevent overfitting.

For glmnet() function to work, binarization of categorical variable is required because it only works with
quantitative inputs.

Glmnet() also standardize the predictors to put them on the same scale.

Lambda = 0, 10, 100, 500, 1000. The bigger lambda, the stronger the penalty effect, thus the penalty function
has the ability to shrink the predictors with limited predicting power all the way to exactly zero in the elastic
net function.

One-standard-error (1-SE) rule – the simplest regularized regression model whose CV error is within “one
standard error” of the minimum error.
4.1 Conceptual Foundations of GLMs (differences between
linear model and GLMs)
 Distribution of the target variable – the target variable only needs to be a member of so-called
Exponential family of distributions. Whereas linear model needs to be normally distributed.
 Relationship between the target mean and linear predictor – GLM sets a function of target mean to
be linearly related to the predictors.

GLM vs. Linear Model


 Transformation in Linear model applies directly to target variables and bring them closer to normal so
they could be reasonably described by a normal linear model.
 As for GLM, transformation only applies to the TARGET MEAN but not the target variable.

4.1.1 Selection of Target distribution and Link Functions


Component 1: The target distribution – GLM can accommodate a variety of distributions.
 Positive, continuous, and right-skewed data such as claim amounts and amount of insurance
coverage. Use gamma or inverse Gaussian distribution.
 Binary data – for classification problem with 1 and 0 for occurrence and non-occurrence, Binomial
distribution will be a reasonable choice of target distribution. The mean of such problem is the
probability that event of interest occurs.
 Count data – number of times a certain event happens over a reference time period, we will use
Poisson distribution. Variables assume only non-negative integer values. One of the drawbacks for
Poisson distribution is that it requires the mean = variance, when variance exceeds its mean =
overdispersion.
 Aggregate data: to handle mix of discrete and continuous data, Tweedie distribution comes to rescue,
it’s a compound of Poisson-gamma distribution.
o Tweedie is a in-between distribution of Poisson and gamma.
o Discrete probability mass at zero and a probability density function on the positive real line.
The nature of the distribution makes it suitable to model aggregate claim losses. Large mass
at zero as most policies have no claims and a continuous distribution skewed to the right
reflecting total claim sizes as gamma is right-skewed.

Component 2: The link function – Given the target distribution, we can proceed to specify the link function.
 Positive mean – Poisson, Gamma and inverse Gaussian distribution, unbound from above. We use log
link for it.

As it ensures the target mean of the GLM is always positive given by

 Appropriate prediction - Unit Value mean – for binary target variables, the target mean is the
probability of the event of interest occurs, which is always between 0 and 1. Thus, the link function
should ensure that the mean implied by GLM is unit-value. Using Logit link:

 Interpretability – ease of interpretation for good link function.


IMPORTANT! Interpretation of GLM coefficients. – quantifying the change in target mean in response to
changes in the predictors in terms of the model coefficients. This depends of the choice of link function.
 Log link: it always ensures positive predictors and easy to interpret.
As for categorial variable

Is at baseline.
 Logit Link – almost always for Binary data, which is another form of log link, applied to the odds – a
unit change in a numeric predictor with coefficient Beta j is associated with a multiplicative change of
exp(beta j) in the odds. It ensures model prediction always valued between 0 and 1. Makes the model
easily interpretable.

4.1.2 Weights and Offsets


Both of these tools are used to incorporate a measure of exposure into a GLM to improve the fitting.

 Weights – Observation in grouped data with a larger exposure have smaller variance and higher
degree of precision. Thus, we attach higher weight to this more credible observation. They carry
more weight in the estimation of GLM coefficients. The goal is to improve the reliability of the fitting
procedure.
 Offsets – The larger the number of polices, the larger the total number of claims (target variable).
Thus the number of policies can set as an “offset” to better account for total number of claims. The
number of policies is in direct proportion to total number of claims.
 Weights – to use weights properly, the observation of the target variable should be averaged by
exposure. Due to averaging, the variance of each observation is inversely related to the size of the
exposure. Weights do not affect mean of the target variable directly.
 Offsets – Observations are aggregated over the exposure units. The exposure, serving as an offset is
directly proportional to the mean of the target variable but leaves its variance unaffected.

4.1.3 Fitting and Assessing the performance of a GLM

Goodness of fit measure –


 A global goodness of fit: Deviance – it measures the extent to which the GLM model departs from the
saturated model(model with as many parameters as the size of the training set, it perfectly fits each
training observation and very flexible and overfitted).

1.1 Deviance reduces to RSS for linear model – we can view deviance as a
generalization of RSS that works even for non-normal target variables in the
exponential family.
1.2 Deviance should be used carefully! – We can ONLY compare GLMs with the same
target distribution, so they share the same maximized likelihood of the saturated
model.
1.3 Residuals derived from deviance

 A local goodness of fit measure: Deviance residuals

NOTE: Deviance of a GLM parallels the RSS of a linear model in a sense that it is merely a goodness of
fit measure on the training set and always decrease when new predictors are added.

A Traditional model selection method: Likelihood ratio test. – a generalization of the t-test and F-
test.
Restrictions of the LRT:

General model selection measures: AIC and BIC

4.1.4 Performance Metrics for Classifier


Confusion matrices. – Binary classifier only predicts the probability that the event of interest occurs
for a given set of feature values. It does not directly say whether the event is predicted to happen
or not. Thus, we need to introduce a CUTOFF.

 Case 1 – If the predicted probability is higher than the cutoff, then the event is predicted to occur
 Case 2 – if the predicted probability is lower than the cutoff, then the event is predicted not to
occur.
Classification error rate:

Accuracy – a measure of a classifier’s performance.

Sensitivity – True positive rate, the proportion of positive observation in the data that are correctly
classified as positive.

Specificity – True negative rate, the proportion of negative observation in the data that are
correctly classified as negative.

Precision – proportion of positive predictions that truly belong to the positive class.
Relationship between accuracy, sensitivity and specificity – Accuracy is a weighted average of
sensitivity and specificity.

Effects of changing the cutoff. – the selection of cutoff is a trade-off between having high sensitivity
or having high specificity

 Extreme case 1: If the cutoff is 0, then everyone will predict positive. Thus, we will have extremely
high sensitivity equals 1 but 0 specificity
 Increasing the cutoff: As we increase the cutoff, more and more observation will be classified as
negative. Thus the specificity will increase at a cost of losing a small amount of sensitivity.
 Extreme case 2: if the cutoff is 1, then everyone will predict negative. Thus sensitivity = 0 but
specificity = 1.

ROC Curves. (Receiver Operating Characteristic) – begin at (1,0) and end at (0,1) correspond to a cutoff of 1
and 0.

The predictive performance of a classifier can be summarized by computing the area under the curve (AUC),
the exact value of the AUC may NOT mean much for the quantitative assessment of a classifier but the
relative value of the AUC that matters. The higher the better.
 AUC = 1: the highest possible value of the AUC is 1, the classifier with this value has a perfect
discriminatory power. Able to separate all the observations correctly.
 AUC = 0.5: Naïve classifier that classifies the observation Purely randomly without information
contained in the predictors.
 AUC = 0: bad classifier that always incorrectly separate the observation.
Sidebar: Unbalance Data. – One class of the binary target variable is much more dominant than the other
class in terms of proportion of observations.

What is so bad about unbalanced data?


1. The classifier places more weight on the majority class and tries to match the training
observations in that class.
2. Without paying enough attention to the minority class, this could be problematic
especially when the minority class is a positive class and class of interest.

Solutions:
 Undersampling: it produces roughly balanced data by drawing fewer observations from the negative
class(undersampled class) and retaining all of the positive class observations.
 Drawback – now our classifier is based on less data, thus could be less robust and more prone to
underfitting.
 Oversampling: Keeps the original data but oversamples with replacement the positive class, reduce
the imbalance between the two classes. NOTE: it only performs on the training set after the split of
training/test set. Otherwise, our test set might contain some same observations as training set so it’s
not completely unseen.

4.2 Case study: GLMs for continuous Target variables

For inj with 6 levels – treating it as numeric implicitly implies that it has a global monotonic relationship
with claim amounts across all injury groups. To relax this restriction, we can convert inj as a factor

4.2.2 Model Construction and Evaluation


 For a highly skewed target variable, its mean can easily be distorted by just a small number of
extreme observations and it is more reliable to look at central tendency measure like the Median.
 RMSE places lower weight on large errors, which is not uncommon for a skewed target variable.
Therefore, is less sensitive to outliers.

To get a feel for how OLS model performs, we compare the RMSE of the test with the RMSE of the
intercept-only model fitted to the training set, which uses the mean of amt on the training set as
prediction.
GLM 1: Log-link normal GLM on claim size.
GLM ensures the positive predictions and uses log-link to connect mean size to linear predictor. A
drawback is that it allows for the possibility that observations of the target variables are negative.

GLM 2: Log-link Gamma GLM

Using log-link ensures positive prediction, a characteristic of the target variable. and it makes the model
coefficient easier to interpret.

4.2.3 Model Validation

 For “Residuals vs. Fitted” – most of the residuals point scatter around 0s with no special pattern
with some positive outliers such as 5422, 10224 and 18870, and seems like more spread of larger
positive than negative residuals.
 For “Normal Q-Q plot” – this allows us to assess the normality of the standardized deviance
residuals. Most of the points lies on the 45 degree line from left end to mid but starts to deviate
towards right end, indicating there are some skewness that the gamma distribution failed to
handled, using a distribution with a fatter tail maybe more suitable.

Interpret the coefficient estimates:

Since the gamma distribution used log link, the expected claim size is a multiplied by exp(beta j) for every
unit increase in a numeric predictor with coefficient beta j or a qualitative predictor moves from its baseline
level to a new level represented by a dummy variable with coefficient beta j, holding everything else constant.
4.3 Case study 2: GLMs for Binary Target Variable

Removing inappropriate observations and variables

 Target leakage – removing any variables that will be known as target variables is observed.
 Questionable data that do not make sense – negative in age for example.

Factor conversions – converting numeric variable into factors.


 Advantages: taking the numeric variable with no numeric “sense” may impose a global monotonic
association with target variable. Converting them into categorical variables presented by dummy
variables with different coefficient lifted this restriction. This provide GLM with more flexibility to
capture the effects of each level on the target variable more effectively.
 Disadvantage: Inflate the dimension of the data and may result in overfitting. Stepwise selection or
regularization can help to reduce the great number of factor levels, but the computation time will
take longer.

Numeric Predictors – using boxplot to see the relationships between target variable and numeric variables.
 Vehicle values do not have much of impact on claim occurrence.
 Exposure has a significant impact on claim occurrence as expected, the more one drives the more
likely one will have an accident.

Categorical Predictors – Combine similar levels in a factor to reduce dimension of the data.

4.3.2 Model Construction and Selection


1. Split data into training and testing set
2. Choose appropriate distribution and log link of a GLM model to fit into the data.
3. Check model performance on training set based on Complexity (AIC/BIC) and
interpretability
4. For offset – our means of target variable using log link then we will impose the offset on
a log scale meaning that the target variable is directly proportional to the value of offset.
5. Training set vs. test set – if the test set’s accuracy/sensitivity/specificity are slightly lower
than training set, this indicates that there is a small extent of overfitting for our model.

Binarization – to binarize a feature with multiple levels allow us to view each level categorical predictor
separately and add/remove one level at a time.

Then perform Stepwise selection to add/remove features to reach a better model.


4.3.3 Interpretation of Model Results.

Odds-based interpretation –
 For every unit increase in log_veh_value, the odds of occurrence is 1.1716752 – 1 = 17.17%. Holding
everything else constant. The higher the vehicle value, the more likely a drive will submit a claim.
 For every unit increase in agecat.1, there is an increase of odds of occurrence of 1.2934442 – 1 =
29.34442%, holding everything else constant. The younger the driver, the higher chances the driver
will experience a claim.

Probability-based interpretation – A unit increase in a predictor Xj will increase the probability by ## much,
if Xj is currently this value. So the “average policyholder” is the first row.

 10% increase in exposure will lead to a 0.0785 – 0.0718 = 0.0067 increase in claim occurrence
predicted probability
 Being in agecat.5 will lead to a 0.0566 – 0.0718 = -0.0152 decrease in claim occurrence.

4.4 Case Study 3: GLMs for count and Aggregate loss variable.
4.4.1 Data Exploration and Preparation
 The number of claims
 The sum of claim payments
Picking and justifying offsets.

 Insured years is an good offset for claims – longer the insured policy, larger the number of claims
submitted.
 Claim is a good offset for payment – the more claims submitted, the more payments.

Some additional observations. – factor conversions/ Univariate and bivariate data exploration.

4.4.2 Model Construction and Evaluation

A GLM for Claims with Kilometres and Bonus as numeric variable.


 Claims is a non-negative integer-valued variable with a right skew. Poisson distribution is a suitable
target distribution. Log link is most commonly employed in conjunction with the Poisson distribution
because it ensures the model prediction are always non-negative, easy to interpret.

Digression: Can we model ClaimPerInsured Directly?


 We can but won’t be able to get AIC due to stepwise selection can not be performed.
 Pearson goodness-of-fit: commonly used performance metric for a count GLM. Lower the better.

A GLM for claim severity.


 Variability of PaymentPerClaim decreases with the value of Claims, this motivates the use of claims as
a weight.
 Log link and gamma distribution is a good choice for claim severity – continuous, non-negative
values.

4.4.3 Predictions

 Predict Claims first using count prediction model and then predict PaymentPerClaim using severity
model, finally get the product of them.
Chapter 5
Tree-Based Models

Decision tree divide the feature space into a finite set of non-overlapping region containing relatively
homogeneous observations more amendable to analysis and prediction.

How it works: To predict a given observation, we simply locate the region to which it belongs and use the
mean or the most common class of target variable in that region as prediction.

5.1 Conceptual foundations of decision trees.


5.1.1 Single Decision trees.

 Node – a point on a decision tree that corresponds to a subset of the feature space.
 Binary tree: decision tree in which each node is followed by two sub-nodes.
 Depth: number of tree slips needed to go from root node to furthest terminal node. Measure of the
complexity of a decision tree.

Regression problem: for numeric target variable, use the average of the target variable in the region as
predicted value. To quantify the variability of a numeric target variable in a particular node of a decision tree,
we use Residual Sum of Squares – the sum of the squared discrepancies between the mean of target variable
and the fitted value over all training observations in a certain node. The lower the better.

Classification problem: use the most common class of the target variable in that region as the predicted
value. Classification error rate – measure of impurity of a tree node.

Construction of decision trees: Recursive binary splitting. – Every time we make a binary split, there are
two inter-related decision:
 The predictor used to form the split
 Given the predictor, what’s the Cutoff value
 Each split should lead to greatest information gain or equivalently, the greatest reduction impurity.

Three ways to measure impurity of a node for classification tree:

Classification error rate cares only about the maximum class proportion, the more proportion the lower
Em, and Gini index and Entropy could also capture this and Em does not say enough about node impurity.
Controlling tree complexity by cost-complexity pruning. – Although decision tree can be fitted efficiently but
they can also be easily overfitting, capture too much signal as well as noise of the training data.

Ways to prune a tree:


 Pre-specify an impurity reduction threshold, then start from root node and only make split if
reduction in impurity exceeds this threshold.
Cons: This is a short-sided method because a “not-so-good split’ early on in the tree-building process
could be followed by a “good split”.
 Cost-Complexity pruning: First grow a large tree, then prune it back to get a smaller subtree. We
trade off goodness-of-fit of the tree to the training data with the size of the tree using the penalized
objective function given by:

 Penalty term Cp = 0, there is no price to pay for making the tree more complex, the tree is identical to
the large, overblown tree without pruning.
 Cp increase from 0 to 1, the penalty term becomes greater, the tree branches are snipped off
retroactively to form a new, larger terminal nodes. Squared bias of the tree predictions will increase
but variance will drop.
 Cp = 1, trivial tree. The drop in the relative training error due to extra split can never compensate for
the increase in complexity penalty.
 A nested family of trees: if c1 < c2, then c2 must form a smaller tree and all of its nodes, terminal
and non-terminal must form a subset of the nodes of the larger tree fitted with c1.
 Hyperparameter tuning: using Cross-validation method. For a given Cp value, we use it to fit a tree on
all but one of the k folds, generate predictions on the held-out fold and measure the performance
based on RSS for Regression tree and classification error rate on classification tree. Repeat this
process with each of the k-folds left out in turn and compute overall performance. Choose whichever
Cp value has the lowest RSS or Classification error rate.
[IMPORTANT!] GLM vs. decision tree.

Numeric predictors:

1. GLMs: it assumes that the effect of a numeric predictor on the target mean is monotonic. Non
monotonic relationships can be captured by introducing higher power polynomial terms
2. Trees: Tree splits are made, possibly repeatedly and separating the training observations into distinct
groups according to its ordered values. The predicted mean(regression) or possibility (classification)
in these groups are allowed to behave irregularly as a function of numeric predictor, without
imposing a monotonic structure on the predicted mean. Thus, the tree excel in handling complex
relationships.

Categorical predictors:

Interactions:
Collinearity:

Variable transformations

Pros and Cons of decision trees.

Merits
1. Interpretability – Trees can be displayed graphically, which makes them easier to interpret. As there
are not many buckets, trees are easier to explain to non-technical audiences.
2. Handling complex relationships between predictors and target – realizing non-linear relationships
and interactions between target variable and predictors. Unlike GLMs which needs to manually insert
polynomial or interaction terms before fitting into model.
3. Categorical variables – able to handle categorical variables without the need of binarization or
selecting a baseline level.
4. Variable selection – Trees automatically select important variables show up on the top nodes, and
filter out non-important ones. Unlike GLMs, including all variables and require to perform Stepwise
selection or regularization to filter irrelevant ones.

Demerits
1. Overfitting – More prone to overfitting and tend to produce unstable predictions with a high variance
even with pruning.
2. Numeric variable – To capture the effects of a numeric variable, we need to make tree splits based
on this variable repeatedly. This gives a rise to a complex, large depth tree that is hard to interpret.
GLM only requires a single coefficient to be estimated for a numeric predictor. (assuming a
monotonic relationship)
3. Categorical variable – trees tend to favor variables with a large number of levels over those with
fewer levels. This can lead to a large information gained for a particular training data but it is actually
not the true signal. Again, Overfitting. (Combining factor levels into more representative groups prior
to fitting decision tree can remedy this.)

5.1.2 Ensemble Tree Model I: Random Forests

Motivation behind ensemble methods – a decision tree is sensitive to noise and tends to overfit even with
pruning. With small perturbations in the training data, the fitted decision tree could be vastly different and
the resulting predictions can be highly unstable.

Ensemble method – Allows us to hedge the instability of decision trees and substantially improve their
prediction performance in many cases. Instead of relying on one model, we take the results of multiple base
models in aggregate to make an overall prediction.

 Bias – Capturing complex relationships in the data due to use of the multiple base models each
working on different parts of the complex relationships.
 Variance – Improving the stability of overall predictions by averaging the predictions of the individual
base models.
 Drawback – computationally prohibitive to implement and difficult to interpret due to the need to
deal with hundreds and thousands of base models.

Boostrap. Samples that are generated by randomly sampling the original training observations with
replacement.
 Random forest: Generating multiple boostrapped samples of the training set and fitting unpruned
base trees in parallel independently on each of the boostrapped training sample.
 Numeric target variable: The overall prediction is the simple average(default) of the B base
predictions.
 Categorical target variable: “Majority vote” approach, pick the predicted class that is most
commonly occurring.

 Randomization at each split. Random forest is injected with a randomization element into the
growing process of the base trees.

At each split, a random sample of m (less or equal to p) predictors is chosen as the split candidates
out of p predictors. The one feature that makes the most impurity reduction is used to construct the
split. A new random sample of m predictor is made at each split, so a predictor that is sampled in one
split can still be sampled and used in another split.

Pros and Cons of random forests (relative to a single decision tree).

Merit:
1. Random forest is more robust, although the base trees are unpruned and therefore low bias and high
variance, by averaging the results of these trees contribute substantial variance reduction, especially
when number of base trees is large and produce much more precise predictions. (Bias is generally
not reduced by random forest.)

Demerits:
1. Interpretability – random forests are difficult to interpret due to multiple base models. Difficult to see
how predictions depend on each feature.
2. Computational power – it takes longer time to implement compared to a base decision tree.

5.1.3 Ensemble tree: Boosting – boosting builds a sequence of independent trees using information from
previous grown tree. We fit a tree to the residuals of the preceding tree and subtract a scaled-down version
of current tree’s prediction to the preceding tree’s residual to form a new residual. The whole process is
repeated with the effects being that each tree will focus on predicting the observations that the previous tree
predicted poorly.

Key differences between Random forest and Boosting:


1. Base trees are fitted in parallel in random forest and base trees in a boosted model is done in
series.
2. Boosting tree focus on reduction of model bias and random forest focus on reduction of model
variance. Bias example – the ability to capture the signal of the data.

The overall predictions of a boosting tree is the sum of a scaled-down prediction of each base tree.

Boosting algorithm with two features.

Pros and Cons of boosting (relative to random forest)


 Boosting tree usually perform better in terms of prediction accuracy than random forest due to
their emphasis on bias reduction. They are more vulnerable to overfitting due to multiple trees try
to capture signal from the original data. (Regularization can remedy this issue) and more sensitive
to hyperparameter values.
 Due to the complexity of the trees, often hard to interpret and computational burden is heavy.

5.2 Mini case study: a toy decision tree

1. Minisplit: minimum numbers of observation must exist in a node in order to make the split. (the
smaller, the more complex the model)
2. Minibucket: minimum numbers of observation must exist in any terminal node.
3. Cp: complexity parameter, penalizing the tree by its size when the cost-complexity pruning is
performed. Higher the cp, the heavier the penalty and simpler the model.
4. Maxdepth: Number of branches from root node to terminal node.

5.3 Extended Case Study: Classification tree


5.3.1 Problem set-up and preparatory steps

Turning Wage as numeric into Wage_flag as binary for high earner greater or equal to $100K a year equals 1
otherwise 0.

For Regression tree, the splits are chosen to MINIMIZE the residual sum of squares. For a right-skewed
target variable, the large observed values will exert a disproportionate effects on the RSS calculations.
GLMs are assumed to have a monotonic relationship between Target variable and predictors.

Year as a factor variable and

5.3.2 Construction and Evaluation of a single classification tree

Tree 2: Pruning Tree 1 (full tree) using a minimizer of Xerror – Choosing the one with the lowest CV Error
(xerror).
Tree 3: Pruning Tree 1 using the one-standard-error rule – To select the smallest tree whose CV error is
within “one-standard-error” from the minimum CV error.

Using Tree 3 to make a prediction. – how to fully Interpret a decision tree.

How do decision trees handle non-linearity?


It divides the feature space into a set of mutually exclusive regions, each of which has a possibly different
target mean. As a function of the numeric predictor of interest, these target means can behave in a highly
non-linear fashion, depending on true relationship between numeric predictors and target variable.

5.3.3 Construction and Evaluation of Ensemble Tree.

Ensemble Tree 1: Random Forest.

Mtry: Among all the parameters of a random forest, the number of features considered in each split,
represented by mtry The larger the value of mtry, the less variance reduction we get because they have
similar base trees. E.g, if mtry = 2, then only two predictors should be sampled and considered in every split
of every base tree built.

Ntree: Building more base trees tends to improve prediction performance of a random forest. With more
base trees, the variance reduction contributed by averaging becomes more significant and the model
prediction become more precise.
Variable importance

Partial dependence plot.


Nrounds: maximum number of trees to grow. Often set to 1,000 which is large enough to capture the signal
of data but not excessively large to avoid overfitting.

Eta: skrinkage parameter/learning rate. 0 < eta < 1. Higher the learning rate, the faster the model will reach
optimality and fewer the number of iterations required, though the resulting model will more likely overfit.

 For larger eta, the accuracy starts to drop beyond 100 nrounds. Larger the eta, the faster the model
converges and leads to overfitting.
 For smaller eta, the accuracy first rise, it’s a sign that the base trees of the boosted model are
learning the signal more and more effectively. When nrounds is beyond 150 or 200, the accuracy
starts to drop.

6.1.1 Conceptual Foundations

Principal components analysis (PCA) is an advanced data analytic technique that transforms a high-
dimensional dataset into a smaller, more manageable set of representative variables that capture most of the
information of the original dataset.

Principal components are constructed from the features.

PC loadings – there are two ways to choose PC loadings


 PC loadings are defined to maximize variance. (the larger the spread, the more information we have)
 Line with minimal distance – line that is as close as possible to the observations in the sense that it
minimizes the sum of the squared perpendicular distances between each data point and the line.

Applications of PCA.
1. Application 1: data visualization – Via PCA, we have successfully reduced the dimensional of the
original dataset from p variables to a smaller set of variables while retain most of the information
measured by variance.
2. Application 2: Feature generation – Unsupervised learning can be applied to “improve predictive
modeling outcomes.”. NOTE: once we created new PC from original variables, these predictors are
mutually uncorrelated, so one variable is confounded by another variable is no longer an issue. By
reducing the dimension of the data and complexity of the model, we hope to optimize the bias-
variance trade-off and improve the prediction performance of the model.

6.1.2 Additional PCA Issues

Proportion of variance explained. One problem with PCA is that what m number of PCs we should choose.
We want to choose the smallest m required to visualize or understand the data well. This could be done by
assessing the variance explained by each PC in comparison to total variance.
Choosing the number of PCs by a scree plot. – Choose the smallest number of PCs required to explain a
sizable proportion of variance.

Importance of centering and scaling. – another issue when applying PCA, whether to mean-center the
variable and/or scale the variables to have unit variance.

 Centering: whether a variable is added or subtracted by their sample mean to yield mean zero should
have NO effect on PC loadings. PC loadings are defined to maximize the variance of the PC scores,
variance remain unchanged when the values of a variable are added/subtracted by a constant.
 Scaling:
1. If we conduct PCA on their original scale, we are determining the PC loadings based on the
COVARIANCE matrix of the variables.
2. If we conduct PCA on their standardized variables, we are determining the PC loadings based
on the CORRELATION matrix of the variables.

Scaling is important to PCA results because if we have a dataset containing variables with vastly different
magnitude, one variable is range from 0 to 10 and another is from 0 to 100,000, then because the PC loadings
are defined to maximize the sample variance, the larger variance will get assign with larger weight. However,
there is no guarantee this variable explains much of the underlying patterns of the data. It is not desirable for
the results of a PCA to depend on an arbitrary choice of scaling, thus, scaling is recommended.

PCA with categorical features. – We need to binarize the variables in advance. Then run PCA on each
dummy variables and use the first few PCs to summarize most of the information.

Drawbacks of PCA.
 Interpretability – PCs can be complicated linear combinations or original features. What do they
actually represent?
 Not good for non-linear relationships – by construction, PCA involves using linear transformations to
summarize and visualize high-dimensional datasets where the variables are highly linearly correlated.
 PCA is not doing feature selection – PCA reduce dimensionality of the model but it’s not feature
selection because each PC is a linear combination of original features, none of the original features
are removed.
 Target variable is ignored. When using PC in a supervised learning setting, the PC loading and scores
are generated completely independent of the target variable. We are assuming the feature directions
exhibit the most variation and most associated with the target variable. Often there is no guarantee
it’s true.

6.1.3 A Simple Case study

Data description.

 UrbanPop as percentage – range from 32 to 91


 Other three – large range difference, and do not have restriction. More serious crime will have lower
arrests which make sense.

 Murder/Assault/Rape are strongly positive correlated. This suggests that PCA may be an effective
technique to “compress” these three crime levels into one single measure.
 UrbanPop does not have a strong linear relationship with the three crime-related variables.
Interpretation of the PCs.
Interpretation of biplot.
6.2 Cluster Analysis
Cluster analysis works by algorithmically partition the observations in a dataset into a set of distinct, non-
overlapping subgroups, better known as clusters, with the goal of discovering hidden patterns in the data
that would otherwise be missed.
 Observations within each cluster share similar characteristics. (Measured by feature values.)
 Observations in different clusters are rather different from one another.

6.2.1 K-means Clustering – Specify the number of clusters K upfront and assign each observation to one and
only one of the K clusters, and each of the cluster hosts relatively homogeneous observations.

How to quantify the homogeneity of the observations?

Within-cluster SS and total SS. The variation of each observation that live in each cluster is quantified by the
within-cluster sum of square (SS). It is defined as the sum of squared distances between each observation in
Ck and the centroid of Ck.

The smaller the Within-cluster SS, the more homogeneous the observations within each cluster and the
better the separation works.

K-means clustering algorithm. Producing local optimum when K and n are large.
 Step 1: Initialization – Randomly select K points in the feature space, and K will serve as the initial
cluster centers.
 Step 2: Iteration – repeat the following steps until the cluster assignment no longer change.
a. Assign each observation to the closet cluster centroid. (Using Eulidean distance)
b. Recalculate the center of each of the K clusters.

This algorithm involves iteratively recalculating the K means of the clusters. Thus named “K-means”
clustering.

Example of recalculating centroid.

Two practical issues in K-means clustering


 Global vs. local optimum. K-means clustering can only produce a local, but not necessarily a GLOBAL
optimum. Different choice of pre-specified K could result in different final set of clusters, note that
the clustering method is generally sensitive to outliers in the data. To mitigate the randomness
associated with initial cluster centers and increase the chance of identifying a global optimum, run K
with different sets of numbers 20-50 and choose the one with the lowest total within-cluster SS.
 Choice of K: The Elbow method. A good cluster grouping is characterized by a small within-cluster SS,
but a large BETWEEN cluster SS.

6.2.2 Hierarchical Clustering

Inter-cluster dissimilarity. It consists of a series of fusions of observations in the data. It starts with the
individual observations, each is treated as a single cluster and successively fuses the closet pair of clusters,
one pair at a time. The process goes on iteratively until all the clusters are fused into a single cluster
containing all of the observations.

 Average linkage – compute all pairwise distances and then take average.
 Centroid linkage – take average of the feature values to get the centroids, then compute the
distance between the two centroids.
Dendrogram. Upside-down tree showing the sequence of fusions and the inter-cluster dissimilarity.

 Clusters are nested – The clusters formed by cutting the dendrogram at one height must be nested
within those formed by cutting the dendrogram at a greater height. [1,3,4] and [2,5] at 10 height
then [2,5], [1], [3,4] at 4 height.
 One dendrogram suffices – The height of the cut controls the number of clusters. But it needs not
to be pre-specified.

Similarity and dissimilarity between K-means and Hierarchical clustering.

6.2.3 Practical Issues in Clustering

Scaling of variables matters for both K-means and hierarchical clustering.


 Without scaling. If one variable is on a much larger order of magnitude, it will dominate the
Euclidean distance calculations and exert a disproportionate impact on the cluster arrangements.
 With Scaling. When the features are standardized, we essentially attach equal importance to each
feature when performing distance calculations.

Choice of dissimilarity measure: Euclidean vs. correlation


Using Cluster Analysis to generate features like PCA.
 Cluster groups. The group assignment created as a result of clustering is a factor variable which may
be used in place of the original variables and serve as potentially useful feature for predicting the
target variables.
 Cluster centers. We may also replace the original variables by cluster centers, which will serve as new
numeric features.

Clustering and curse of dimensionality.

With p variables larger than or equal to 3, visualization is much harder. With 2 dimensions, we can visualize
the cluster assignments by using two-dimensional scatterplot.

As the number of dimensions increases, our intuition breaks down and it becomes harder to differentiate
between observations that are close and those that are far apart.

You might also like