Week 4
Week 4
Machine Learning
(112-1: EE5184)
劉子毓 Joyce Liu
1
M2Lab
Outline
● A quick review of the materials last week
● Tree-based methods
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
2
M2Lab
A quick review
of the materials
last week
3
M2Lab
● Suppose each of the response categories are coded via an indicator variable.
If the class output variable G has K classes, there will be K indicators, Yk, k =
1, . . . ,K, with Yk = 1 if G = k else 0. These are collected together in a vector Y
= (Y1, . . . , YK).
● Fit a linear regression model to each of the columns of Y simultaneously.
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
4
M2Lab
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 6
M2Lab
LDA QDA
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 7
M2Lab
Logistic regression
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 8
M2Lab
Logistic regression
After some calculations, one can show that
and these probabilities sum to 1. In what follows, we denote the parameter set
and .
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 9
M2Lab
Separating hyperplanes
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 10
M2Lab
Note that this optimization is conducted over all training points. The set of
conditions ensure that all the points are at least a signed distance M from the
decision boundary defined by beta and beta_0.
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 11
M2Lab
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 12
M2Lab
● We can represent the the support vector classifier optimization problem and its
solution in a special way that only involves the input features via inner products.
● Suppose we transform our input features x first using h(x). We see that the
solution f(x) can be written
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 13
M2Lab
The “+” notation indicates positive part. This formulation has the loss+penalty
format. It is known as the hinge loss function. And the optimization problem is
equivalent to
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 14
M2Lab
Some methods
related to LDA
and QDA
15
M2Lab
○ The above estimate could be bumpy, and often a smooth estimate is preferred.
17
M2Lab
Naive Bayes
Another popular technique is to assume that given a class G=j, the features Xk are
independent.
18
M2Lab
Naive Bayes
We can further derive the logit-transform as follows.
It has the form of a generalized additive model, which we will discuss further later
in the class.
19
M2Lab
Generalized
additive models
20
M2Lab
Smoothing splines
● We have made use of models linear in the input features, both for regression and
classification. For example, linear regression, linear discriminant analysis, logistic regression
and separating hyperplanes all rely on a linear model.
● The idea is to augment/replace the vector of inputs X with additional variables, which are
transformations of X, and then use linear models in this new space of derived input features.
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
21
M2Lab
Smoothing splines
● Once the basis functions hm have been determined, the models are linear in these new
variables, and the fitting proceeds as before.
● Examples
22
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Smoothing splines
Consider the following problem.
The first term measures closeness to the data, while the second term penalizes
curvature in the function, and λ establishes a tradeoff between the two.
23
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Smoothing splines
● The solution is a natural spline.
● Nj(x) are an N-dimensional set of basis functions for representing this family
of natural splines (See section 5.2).
Smoothing splines
A smoothing spline is an example of a linear smoother. This is because the
estimated parameters are a linear combination of the y i.
The N-vector of the fitted values at the training predictors xi is also linear in y.
The finite linear operator S𝛌 is known as the smoother matrix.
25
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Smoothing splines
26
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
27
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
● This is helpful since in real life, effects are often not linear.
● Additive logistic regression model replaces the linear term by a more general functional
form.
● In the generalized additive model, not all of the functions fj need to be nonlinear. We
can mix in linear and other nonlinear terms. 28
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Given the observations xi, yi, a criterion like the penalized sum of squares can be
written as
The idea is to iteratively solve for each function fj. When solving for a function fj, we
can adopt what we learned in smoothing splines.
29
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
30
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Tree-based
methods
31
M2Lab
Tree-based methods
Tree-based methods partition the feature space into a set of rectangles, and then
fit a simple model in each one. Here we introduce tree-based regression and
classification, known as CART (Classification And Regression Trees).
● Suppose we have a regression problem with
continuous response Y and inputs X1 and
X2.
Tree-based methods
To simplify, we restrict to recursive binary partitions.
33
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Regression tree
● Our data consists of p inputs and a response, for each of N observations:
● Suppose we have a partition into M regions, R1,..., RM, and we model the
response as a constant cm.
34
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Regression tree
Finding the best binary partition in terms of minimum sum of squares is generally
computationally infeasible. We proceed with a greedy algorithm.
● Start with all of the data, consider a splitting variable j and split points, and define the
pair of half-planes.
● Having found the best split, we partition the data into the two resulting regions and
repeat the splitting process on each of the two regions. 35
M2Lab
Regression tree
We have the algorithm to grow the tree. But how large should we grow the tree?
● Strategy 1: Split tree nodes only if the decrease in sum-of-squares due to the
split exceeds some threshold.
○ However, sometimes a seemingly worthless split might lead to a good split below it. This
strategy can be short-sighted.
● Strategy 2: Grow a large tree. Stop the splitting process only when some
minimum node size is reached. Then prune the tree using cost-complexity
pruning.
36
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Regression tree
37
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Classification tree
If the target is a classification outcome taking values 1, 2, …, K, the only changes
needed the the tree algorithm pertain to the criteria for splitting nodes and pruning
the tree.
38
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Classification tree
39
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Classification tree
● Cross-entropy and the Gini index are differentiable and hence more amenable to
numerical optimization.
● Cross-entropy and the Gini index are more sensitive to changes in the node
probabilities than the misclassification rate.
○ Suppose we have (400, 400) observations, i.e., 400 observations in each class in a 2-class problem.
■ Scenario 1: a split that created nodes (300, 100) and (100, 300)
■ Scenario 2: a split that created nodes (200, 400) and (200, 0)
○ Both Gini index and cross-entropy are lower for scenario 2. As for misclassification error, these two
scenarios achieved the same error rate. Note that the 2nd scenario produces a pure node and is
probably preferable.
40
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab
Homework ●
●
Reading materials
Colab experiment
41
M2Lab
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
42
M2Lab
● You are welcome to explore different regularization terms in each method, and even other
classification methods.