0% found this document useful (0 votes)
29 views43 pages

Week 4

Uploaded by

KhánhLinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views43 pages

Week 4

Uploaded by

KhánhLinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

M2Lab

Machine Learning
(112-1: EE5184)
劉子毓 Joyce Liu

1
M2Lab

Outline
● A quick review of the materials last week

● Some methods related to LDA and QDA


○ Kernel density classification
○ Naive Bayes

● Generalized additive models


○ Smoothing Splines
○ Additive logistic regression algorithm

● Tree-based methods

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
2
M2Lab

A quick review
of the materials
last week

3
M2Lab

Linear regression of an indicator matrix


Can we apply the linear regression method to this problem?

● Suppose each of the response categories are coded via an indicator variable.
If the class output variable G has K classes, there will be K indicators, Yk, k =
1, . . . ,K, with Yk = 1 if G = k else 0. These are collected together in a vector Y
= (Y1, . . . , YK).
● Fit a linear regression model to each of the columns of Y simultaneously.

X is the model matrix with p+1 columns corresponding to the p


● inputs, and a leading column of 1’s for the intercept

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]

4
M2Lab

Linear discriminant analysis (LDA)


● Suppose fk(x) is the class-conditional density of X in class G = k, and let
πk be the prior probability of class k.

● The class posteriors can be written as

● Assume that we model each class density as multivariate Gaussian.

● Linear discriminant analysis (LDA) arises in the special case when we


assume that the classes have a common covariance matrix.
5
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Linear discriminant analysis (LDA)


● Now we know how to make a prediction given the parameters of the
Gaussian distribution. The remaining question is how to find there
parameters in the Gaussian distribution?

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 6
M2Lab

Quadratic discriminant analysis (QDA)


Note that in LDA, we assumed that the covariance matrices of each class are the same. Without
the assumption, the convenient cancellation would not occur, and the pieces quadratic in x remain.

LDA QDA

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 7
M2Lab

Logistic regression

For a binary response with a 0/1


coding as above, if we use linear
regression, some of our estimates
might be outside the [0, 1]
interval, making them hard to
interpret as probabilities!

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 8
M2Lab

Logistic regression
After some calculations, one can show that

and these probabilities sum to 1. In what follows, we denote the parameter set
and .

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 9
M2Lab

Separating hyperplanes

We have seen that LDA and


logistic regression both estimate
linear decision boundaries in
similar but slightly different ways.
Now, we will describe separating
hyperplane classifiers, in which
we construct linear decision
boundaries to explicitly try to
separate the data into different
classes as much as possible.

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 10
M2Lab

Optimal separating hyperplanes


The optimal separating hyperplane separates the two classes and maximizes the
distance to the closest point from either class.

Note that this optimization is conducted over all training points. The set of
conditions ensure that all the points are at least a signed distance M from the
decision boundary defined by beta and beta_0.

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 11
M2Lab

Support vector classifier


● Define the slack variables and
incorporate them into the
optimization problem.
Optimal separating hyperplanes

Support vector classifier

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 12
M2Lab

Support vector machines (SVM)


● The support vector classifier we described so far finds linear boundaries in the
input feature space.

● We can represent the the support vector classifier optimization problem and its
solution in a special way that only involves the input features via inner products.

● Suppose we transform our input features x first using h(x). We see that the
solution f(x) can be written

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 13
M2Lab

The SVM as a penalization method


Consider the following optimization problem.

The “+” notation indicates positive part. This formulation has the loss+penalty
format. It is known as the hinge loss function. And the optimization problem is
equivalent to

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link] 14
M2Lab

Some methods
related to LDA
and QDA

15
M2Lab

Kernel density classification


● Kernel density estimation:
○ Suppose we have a random sample x1, x2, …, xN drawn from a probability density function
fx(x), and we wish to estimate fx at a point x0.
○ A natural local estimate has the form below, in which N(x0) is a small metric neighborhood
around x0 of width λ.

○ The above estimate could be bumpy, and often a smooth estimate is preferred.

○ A popular choice for Kλ is the Gaussian kernel.


16
M2Lab

Kernel density classification


● An example of kernel density
estimation is shown on the right.

● Kernel density classification: We can


apply kernel density estimation to
classification by using Bayes’ theorem.

● Suppose we fit density estimates in


each class and
estimates of the class priors .

17
M2Lab

Naive Bayes
Another popular technique is to assume that given a class G=j, the features Xk are
independent.

The assumption is generally not true, but it simplifies the estimation.


● The individual class-conditional marginal densities fjk can each be estimated
separately using one-dimensional kernel density estimates. The original naive
Bayes model uses univariate Gaussian the represent the marginals.
● If the input Xj is discrete, we can use appropriate histogram to estimate.

18
M2Lab

Naive Bayes
We can further derive the logit-transform as follows.

It has the form of a generalized additive model, which we will discuss further later
in the class.
19
M2Lab

Generalized
additive models

20
M2Lab

Smoothing splines
● We have made use of models linear in the input features, both for regression and
classification. For example, linear regression, linear discriminant analysis, logistic regression
and separating hyperplanes all rely on a linear model.

● Here we will take a look at an example for moving beyond linearity.

● The idea is to augment/replace the vector of inputs X with additional variables, which are
transformations of X, and then use linear models in this new space of derived input features.

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
21
M2Lab

Smoothing splines

● This is a linear basis expansion in X.

● Once the basis functions hm have been determined, the models are linear in these new
variables, and the fitting proceeds as before.

● Examples

Read section 5.2 on piecewise


polynomials and splines.

22
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Smoothing splines
Consider the following problem.

The first term measures closeness to the data, while the second term penalizes
curvature in the function, and λ establishes a tradeoff between the two.

23
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Smoothing splines
● The solution is a natural spline.

● Nj(x) are an N-dimensional set of basis functions for representing this family
of natural splines (See section 5.2).

● The loss function RSS becomes

● The solution becomes and


24
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Smoothing splines
A smoothing spline is an example of a linear smoother. This is because the
estimated parameters are a linear combination of the y i.

The N-vector of the fitted values at the training predictors xi is also linear in y.
The finite linear operator S𝛌 is known as the smoother matrix.

25
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Smoothing splines

The figure on the right shows the


results of applying a cubic smoothing
spline to some air pollution data. The
larger lambda is, the smoother the fit
becomes.

26
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Generalized additive models


● We just discussed the smoothing spline, in which models have the format.

● What about the case when we have multiple predictors?


● Recall in the 2nd week, we extend linear regression to multiple regression.

● Here, we can extend the smoothing spline to generalized additive models.

27
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Generalized additive models

● This is helpful since in real life, effects are often not linear.

● In general, the conditional mean 𝜇(X), e.g., 𝜇(X)=P(Y=1|X) of a response Y is related to


an additive function of the predictors via a link function g:

● For example, when g(𝜇)=logit(𝜇)=log(𝜇(X)/(1-𝜇(X))) is equivalent to logistic regression.

● Additive logistic regression model replaces the linear term by a more general functional
form.

● In the generalized additive model, not all of the functions fj need to be nonlinear. We
can mix in linear and other nonlinear terms. 28
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Fitting additive models


The additive model has the form

Given the observations xi, yi, a criterion like the penalized sum of squares can be
written as

The idea is to iteratively solve for each function fj. When solving for a function fj, we
can adopt what we learned in smoothing splines.

29
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Fitting additive models

30
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Tree-based
methods

31
M2Lab

Tree-based methods
Tree-based methods partition the feature space into a set of rectangles, and then
fit a simple model in each one. Here we introduce tree-based regression and
classification, known as CART (Classification And Regression Trees).
● Suppose we have a regression problem with
continuous response Y and inputs X1 and
X2.

● In each partition element we can model Y


with a different constant.

● However, not every partitioning line has a


simple description like X1=c. Some of the
resulting regions are complicated to
describe.
32
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Tree-based methods
To simplify, we restrict to recursive binary partitions.

● We first split the space into two regions and model


the response by the mean of Y in each region.
● Then one or both of these regions are split further
into two more regions.
● Continue the process.

The resulting of the process is a partition into the five


regions R1,..., R5. And the corresponding regression
model predicts Y with a constant cm in region Rm.

Now, let’s see the details for regression tree and


classification tree respectively.

33
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Regression tree
● Our data consists of p inputs and a response, for each of N observations:

(xi, yi) for i =1, 2,..., N, with xi=(xi1, xi2, …, xip)

● Suppose we have a partition into M regions, R1,..., RM, and we model the
response as a constant cm.

● If we adopt as our criterion minimization of the sum of squares, the solution is


the average of yi in region Rm.

34
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Regression tree
Finding the best binary partition in terms of minimum sum of squares is generally
computationally infeasible. We proceed with a greedy algorithm.

● Start with all of the data, consider a splitting variable j and split points, and define the
pair of half-planes.

● Seek the splitting variable j and split point s that solve

● For any choice j and s, the inner minimization is solved by

● Having found the best split, we partition the data into the two resulting regions and
repeat the splitting process on each of the two regions. 35
M2Lab

Regression tree
We have the algorithm to grow the tree. But how large should we grow the tree?

● Strategy 1: Split tree nodes only if the decrease in sum-of-squares due to the
split exceeds some threshold.
○ However, sometimes a seemingly worthless split might lead to a good split below it. This
strategy can be short-sighted.

● Strategy 2: Grow a large tree. Stop the splitting process only when some
minimum node size is reached. Then prune the tree using cost-complexity
pruning.

36
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Regression tree

The parameter ɑ governs


the tradeoff between tree
size and its goodness of fit
to the data.

37
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Classification tree
If the target is a classification outcome taking values 1, 2, …, K, the only changes
needed the the tree algorithm pertain to the criteria for splitting nodes and pruning
the tree.

● In a node m, representing a region Rm with Nm observations, let

the proportion of class k observations in node m.

● We classify the observations in node m to class

● We need to define the measure of node impurity suitable for classification.

38
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Classification tree

For two classes, suppose p is the proportion in


the second class.

● Misclassification error = 1-max(p, 1-p)


● Gini index = 2p(1-p)
● Cross-entropy = -plogp-(1-p)log(1-p)

39
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Classification tree

● Cross-entropy and the Gini index are differentiable and hence more amenable to
numerical optimization.
● Cross-entropy and the Gini index are more sensitive to changes in the node
probabilities than the misclassification rate.
○ Suppose we have (400, 400) observations, i.e., 400 observations in each class in a 2-class problem.
■ Scenario 1: a split that created nodes (300, 100) and (100, 300)
■ Scenario 2: a split that created nodes (200, 400) and (200, 0)
○ Both Gini index and cross-entropy are lower for scenario 2. As for misclassification error, these two
scenarios achieved the same error rate. Note that the 2nd scenario produces a pure node and is
probably preferable.
40
Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]
M2Lab

Homework ●

Reading materials
Colab experiment

41
M2Lab

Reading materials after this week’s class


● Some methods related to LDA and QDA
○ Kernel density classification (Ch 6.6.2)
○ Naive Bayes (Ch 6.6.3)

● Generalized additive models (Ch 9.1)


○ Smoothing Splines (ch 5.4)
○ Additive logistic regression algorithm (Ch 9.1)

● Tree-based methods (Ch 9.2)

Hastie, Trevor, et al. The elements of statistical learning: data mining, inference, and prediction. New York: springer, 2017. [Author’s pdf link]

42
M2Lab

Colab Experiment: Breast cancer classification


● Data: Breast cancer wisconsin dataset for classification
○ The detailed information can be found below
○ https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.lo
ad_breast_cancer

● Compare the classification performance of different methods that we have discussed in


class. You should include at least the following.
○ Logistic regression
○ SVM
○ Decision tree

● You are welcome to explore different regularization terms in each method, and even other
classification methods.

● Classification performance: Use 5-fold CV to evaluate the performance of each method in


terms of error rate. Report the mean and standard deviation of the error rate over these 5
folds.
43

You might also like