L19 : SVM :SVC : b
I The \budget" _i tells us for each observation xi by how much it violates the margin: it
is on the correct side of the margin if _i = 0; on the wrong side of the margin if _i > 0; on
the wrong side of the hyperplane if _i > 1.
I The constant b > 0 is a tuning/regularization parameter, the total budget for the
amount that the margin can be violated by the n observations.
I Here, the support vectors are those who are on the wrong side of the margin for their
class (_i > 0) or lie directly on the margin, and only they a_ect the classi_er!
I Thus, b controls the number of support vectors, and therefore the bias-variance
trade-off of the support vector classi_er.
Support vector classi_er: the tuning parameter b
I Large b gives wide margins, and high bias/small variance for the SV classi_er
I Small b gives tight margins, and small bias/high variance for the SV classi_er.
I The optimal b can be found by cross-validation.
Support vector classi_er: limitations
I A linear decision boundary can fail if the true boundary between the classes is not
linear in the feature space of (X1; : : : ;Xp)! Any linear classi_er will perform poorly here.
I One might want to enlarge the feature space, say with order-two polynomials
X1;X2 1 ; : : : ; Xp; X2
p ; and then use a SV classi_er with the 2p features.
I There are many ways to enlarge the space, and the huge number of features will make the
computations infeasible.
L20 : SVM : λ (tuning parameters,
Support vector machines: comments
I SVMs are a very powerful and flexible tool for classi_cation.
I The mathematics are quite hard, but thanks to the kernel trick we do not need to know
the details!
I For _tting, we only compute all similarities K(xi ; xj ) for all
distinct pairs i ; j = 1; : : : ; n
in the training set.
I Many di_erent kernels have been develloped, but even the \standard" ones perform
usually well.
I For more than two classes, there are simple extensions, so-called One-Versus-One
classi_cation, or One-Versus-All classi_cation.
L21 : SLNN
(deep) neural network is: a highly flexible, non-linear prediction method.
I Neural networks can be used for both regression and classification.
I As for regression, the default choice for the activation is the ReLU g(x) = maxf0; xg:
I For classi_cation, the functions hi are the softmax functions
L22: Deep neural networks
I A deep neural network is a neural network with multiple hidden layers between the
input and the output layers.
I The predictive model is then de_ned sequentially
The architecture of deep neural networks
I The depth of the neural network is the total number of layers.
I For each layer we call the number of neurons the width of that layer.
I The architecture of the network is mainly determined by the depth and width.
I There can be further architectural considerations concerning the connections between
neurons or shared parameters; see convolutional neural networks.
I A good architecture is based on experience and cross-validation.
Comments on deep neural networks
I As before the default choice for all activation functions is the ReLU
gj (x) = maxf0; xg; j = 1; : : : ;M:
I For regression the number of outputs is q = 1 and the function h = h1 is the identity
h(z) = z.
I For classi_cation, the functions hi are the softmax functions.
I The hidden layers of the network extract patterns or features of the data.
I Especially the _rst layer that is directly linked to the inputs allows for interpretation
and visualization of the extracted features; more on this for image data later.
I Deep neural networks are very flexible. Indeed, if the number of neurons is large enough,
they can approximate any continuous function on Rp arbitrarily well (under mild
assumptions).
I Deep neural networks have many (!) parameters, meaning that _tting them is
computationally very costly and overfitting must actively be prevented.
L23: Deep learning: Fitting neural networks and back-propagation
Loss functions for neural networks
INeural networks are non-linear parametric models, with parameter vector Octet consisting of the weight matrices and the bias
vectors.
For regression (yi 2 R) we use the squared-error loss
I For classi_cation (yi 2 f1; : : : ; qg) we use cross-entropy (deviance) loss, which is the negative log-likelihood of the multinomial
distribution.
I Since we have so many parameters, we are typically not interested in the global/local
minimum of J(_) since this would be an overfit.
I Instead we will prevent over_tting by regularization or early stopping (later more).
I For now, we _rst consider how we can improve the initial model using gradient descent.
I Back-propagation is a way of e_ciently deriving the gradient using the chain-rule for di_erentiation.
Advantages of back-propagation
I Back-propagation for deep networks works similarly, where we have error vectors
_(1); : : : ; _(M+1) at all M hidden layers and the output layer.
I Back-propagation is very efficient since the quantities _(m) can be stored an re-used,
especially if the network is large.
I Only the local derivatives at each node are required.
I It can be e_ciently parallelized on GPUs.
L24: Regularization
α > 0 is a tuning parameter
I The interpretation is similar as for regularized linear models, and the tuning parameter is
also chosen by cross-validation.
Dropout
I Dropout is a regularization method where in each update step in the gradient descent, the
network structure is changed.
I In fact, before a forward pass and back-propagation, each non-output neuron is
independently dropped out with some probability q 2 [0; 1].
I For hidden units this probability is usually q = 0:5, and for input units q = 0:2.
I This robustifies the _tting, since every unit must learn to work independently from the
dropped units.
I Dropout is a type of bagging, where the _nal model is a combination of the _tted weights
and the dropout probabilities.
Early stopping
I Probably the most commonly used method for regularization is early stopping in the
gradient descent algorithm.
I For large networks with enough capacity to overfit the training error decreases steadily
over time, but the validation error begins to raise at some point.
I Early stopping means to return the parameter value _(i_) at step i_ where a minimum of the
validation error is reached, and then stop the descent.
I This means we do not go all the way to the (local) minimum of J(_), but stop at a
parameter set that gives the best validation error.
I In some cases, early stopping can be shown to be equivalent to `2 regularization.
Parameter sharing
I Due to invariances or other domain knowledge, often it makes sense to force sets of
parameters to have the same value.
I This can drastically reduce the number of unique parameters and thus performs
regularization.
I The most popular use of this parameter sharing occurs in convolutional neural
networks applied to image recognition.
L25: Deep learning: Convolutional neural networks
Example: I It consists of 60000 *32 * 32 RGB colour images with 10 di_erent classes
G = fairplane; automobile; bird; cat; deer; dog; frog; horse; ship; truckg
I Each image has three channels (red, green, blue), each consisting of the intensities of
that color in each pixel of that image.
I The predictor space has thus dimension p = 32 *32 *3 = 1024.
A dense neural network
I For the CIFAR-10 data set, the number of predictors is p = 1024.
I A dense neural network is one where all neurons are connected with individual
weights.
I For this data set, each layer in a dense network has about 1 million parameters!
I Fitting so many parameters is computationally very expensive.
I Moreover, with only 60000 training samples, heavy overfitting will occur. Necesselary
Convolutional neural networks
I Convolutional neural networks are deep neural networks with many hidden layers.
I The hidden layers can be of differents types, meaning that di_erent operations or
parameter sharings are used.
I We already know one type of hidden layer, namely the dense/fully connected layer
whose neurons are connected to all neurons of the previous layer (either input or
hidden).
I There are many other layer types and we will study some of the most important.
I A convolutional layer tries to exract local features of images in three stages:
a) the convolution stage;
b) the activation stage;
c) the pooling stage.
The convolution stage
I Convolutional layers have several hyper parameters, e.g., the width and the height of the
_lter.
I There is also the depth of the layer, that is, the number of different filters applied in
the same step.
I The stride speci_es by how much we shift the filter to obtain the next pixel of the
output.
I The output image may have smaller size than the input image.
I Zero-padding (also called \same" padding) might be used to keep same sizes of input
and output; in this case the image is su_ciently padded with zeros at the edges.
The detector and pooling stages
I The second stage of the convolutional layer is the detector stage where a non-linearity is
applied to each neuron, usually the ReLU.
I A neuron is active and the feature is present in that part of the image only if the linear
score of the _lter is high enough.
I This stage does not have any parameters when the non-linearity is _xed.
I The third stage is pooling which is a form of downsampling.
I Typically, max-pooling is applied to aggregate chunks of the image into a single value by
taking the maximum activation of the convolution output in that chunk (often 2 _ 2)
The convolutional layer
I The convolutional layer are the three stages together convolution operation ! ReLU ! pooling;
PS 9 SVM
2 tuning parameters: C and hinge
If the data not linealy separable, we have a optimization problem and this budget margin optimization. And what this does is that it allows
for misclassified points on the inside of the market. And also on the wrong side of the separating hyperplane with the maximal budget. So
you can tune this to again have this bias variance trade off as with any hyperparameters. So the way you've seen this in class.
We start by fitting a linear SVM.
Create a pipeline to scale the features, and then train a linear SVM model (using the
class LinearSVC with C=1 and the hinge loss function).
Hint: recall that SVM are sensitive to feature scaling, so it's good pratcice to scale the
features before fitting an SVM.
# Now we get the linear decision boundary; it's the best linear decision baundary
# but it doesn't fit the data very well. We need more flexiblility
# we can do this well the transformation of the polynomial features
Although linear SVM classifiers are efficient and work surprisingly well in many cases, you can see
that this dataset is not even close to being linearly separable. One approach to handling nonlinear
datasets is to add more features, such as polynomial features; in some cases this can result in a
linearly separable dataset.
To implement this idea using Scikit-Learn, create a Pipeline containing a PolynomialFeatures
with degree = 3, followed by a StandardScaler and a LinearSVC with C=1 and hinge loss: we
get 10 variable (1, x1, x2, x1^2, x1x2, x2^2, x1^3, x1^2x2, x1x2^2, x2^3); previosly it was just 2 (X1,
X2)
.
This feature about things you could do exactly the same with the logistic regression. So logistic regression has a linear decision boundaries.
You can create polynomial features. Then in the original feature space, that decision boundary of logistic regression will probably not be
linear anymore. So this is not specific to SVM.*
Another useful transformation to apply to the features is the radial basis function. Create a
Pipeline containing a RBFSampler from sklearn.kernel_approximation, followed by a
StandardScaler and a LinearSVC with C=0.1 and hinge loss.
Hint: The RBFSampler has a tuning parameter gamma. What is its role?
RBF sampler has one gamma hyperparameter and this controls the flexibility of this feature
mapping. So it controls how local the predictions will be when you will classify on these new
feature sets. So a large value of gamma will give you some more local influence of your
initial features. And some uh, low gamma value will give you a more wide influence of your
initial features.
# RBF: gamma controls how "local" the predictions are. You can think of gamma as as
explained in in in the lecture. This radial basis function kernel corresponds to an infinitely
dimensional feature space. So specify when you use this sampler certain number of
components so it will approximate this kernel. It cannot do it perfectly because it's. Infinitely
dimensional, so you cannot compute infinitely many values, of course. So you will specify a
number of components and it will Yeah, exactly. Approximate these RBF features up to this
point. So here we specify some value, let's say.
# inverse of k in KNN.
# gamma is hyperparameter and this controls the flexibility of this feature mapping so it
controls how local the
# predictions will be when you will classify on these new feature sets so large value of
gamma will give you some more local
# influence of your initial features and some low gamma value will give you more wide
influence of your initial features
# if low value of gamma ( like 0.001), => not flexible
1. Validation. So that's how you would know in practice. When you have the data set, you
don't know how it's generated validation and this gives you some idea on. Which model you
should choose for this data set? That's That's the whole point. Cannot. Any other question?
OK, so let's get back to the polynomial 1. So now the next question we ask you to instead of
using this polynomial feature transformation was to use this so-called Radial Basis Function
Sampler. What this does is that it creates new features. It's also a feature mapping that will
transform A2 dimensional feature space into a higher dimensional feature space. But this will
be based on the UH, radial kernel or radial Basis function kernel that you've you've seen in
class I. As explained in in in the lecture, this radial basis function kernel corresponds to an
infinitely dimensional feature space. So you have to specify when you use these RF. Certain
number of components, so it will approximate this kernel. It cannot do it perfectly because it's
infinitely dimensional, so you cannot compute infinitely many values, of course. So you will
specify a number of components and it will yeah, exactly. Approximate this
We can find the solution. Our our minimization problem that SVC is trying to solve, and you've seen that the solution to this
minimization problems looks like this. And in the solution there's this inner product between the the value at which we want to
evaluate our classifier model and each. Individual observations in the training sets, and you've seen that you can actually swap this
inner product by a nonlinear kernel that also takes into inputs. You've seen a few examples that corresponds actually to the features
that we just done, and you've also seen that this is equivalent to working in a higher dimensional feature space. Using this kernel
instead of the inner product in this solution is equivalent to actually working with an extended feature space with some
transformations of our initial features. So this trick kind of helps us for a fixed sample size. To kind of work with a lot more features
and get much more flexibility without. Without needing to actually compute explicitly each one of the features. End the possibly
some of the discussions. If you try to do this this mapping explicitly, you will you will see, As for the radial basis function kernel,
that this actually corresponds to an infinitely dimensional feature space. So we can get something basically which is much more
flexible than we could with actual transformations of the features. And you get
Boosting: the goal of boosting, you have weak learners, you want to improve the flexibility.
Boosting refers to any Ensemble method that can combine several weak learners into a strong
learner. The general idea of most boosting methods is to train predictors sequentially, each trying to
correct its predecessor.
Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its
predecessor. This method tries to fit the new predictor to the residual errors made by the previous
predictor. sing Decision Trees as the base predictors. This is called Gradient Tree Boosting, or
Gradient Boosted Regression Trees (GBRT).
A simpler way to train GBRT ensembles is to use Scikit-Learn’s GradientBoostingRegressor
class. Similarly to RandomForestRegressor, it has hyperparameters to control the growth of
Decision Trees (e.g., max_depth, min_samples_leaf), as well as hyperparameters to control the
ensemble training, such as the number of trees (n_estimators) and the shrinkage parameter
(learning_rate).
We now apply boosting to the Heart_ISL.csv dataset. Recall that this is a classification problem,
therefore we need to use GradientBoostingClassifier
A better test score could be achieved using a cross-validated grid search to find the best values for
the hyperparameters max_depth, learning_rate and n_estimators, as usual.
One very important difference with bagging is that in boosting as you each time. Pressing the residuals, at some point you will definitely
begin to overfit. So the number of trees you you don't just increase it as much as you can as you would do with bagging
One very important difference with bagging is that in boosting as you each time. Pressing the residuals, at some point you will definitely
begin to overfit. So the number of trees you you don't just increase it as much as you can as you would do with bagging
now the goal will be to to build a boosting, boosting regressor to query to learn. This regression function between X&Y. And to do so, we
will do a boosting with trees. So now a comparison that's often made with boosting is to compare it to a bagging.
Because these are two ensemble methods that you can use allegedly with any method to kind of improve the performance. But the
motivation is the opposite. As bugging bugging, we have a lot of estimators that. Have a large variance but the small bias and we
kind of average them out together to reduce the variance between increasing the bias too much. Boosting it's the opposite idea. We
have what's called the weak classifiers, so classifiers that do not learn much, so they have a large bias but low variance and the goal
is to. Reduce the bias step by step of this weak learners without increasing the the variance of the ensemble too much. So let's see
how we do.
The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low
value, such as 0.01 or 0.001, you will need more trees in the ensemble to fit the training set, but the
predictions will usually generalize better. This is a regularization technique called shrinkage.
Selectrics and the response variables. Remember we are with the patient has had disease or not based on the sets of features. We do
the same trend uh that speaking as we did in the previous lab and then here it's a classification method. So remember here we
looked at boosting for regression trees, but we can also there's also a gradient boosting classifier. It does the same but just the weak
learners that are used in the boosting are classification trees instead of regression trees. But the idea is. Exactly the same, just to
show you that you can also use this gradient boosting classifier. You have the same hyperparameters, so the maximum depth of the
classification previous times, the number of estimators and the learning rate.
gradient boosting regressor has two types of hyperparameters. It has the hyperparameters of each individual tree, so as as it was the case for
random forests for example. So you can specify the maximum depth, the minimum of sample leaves, and so on of the trees, that of the base
estimator that that it's trying to boost. And it has some, um, hyperparameters for the ensemble or the boosting procedure, and these two
hyperparameters are the number of estimators that you fit. And the learning rate. So here the learning rate that I've used, if you remember
the equation from the class is just one. So I'm adding all the trees without kind of slowing down further than learning. But what's done in
practice is that so contribute. One very important difference with bagging is that in boosting as you each time. Pressing the residuals, at
some point you will definitely begin to overfit. So the number of trees you you don't just increase it as much as you can as you would do
with bagging. But here you have to cross validate this and you can actually fine tune it even further. So if you don't specify this learning
rate, we will see this now. So let's suppose that now we have 1.0 as a learning.