0% found this document useful (0 votes)

62 views22 pages

Lecture4 Foundations Supervised Learning

This document discusses the foundations of supervised learning. It begins by explaining that supervised learning works because training datasets are assumed to be samples from an underlying probability distribution, and models are trained to generalize to new samples from the same distribution. It then discusses how holding out a separate test set allows evaluating if a model generalizes. Finally, it introduces the concepts of overfitting and underfitting as potential failures of supervised learning models to generalize.

Uploaded by

Jeremy Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views22 pages

Lecture4 Foundations Supervised Learning

Uploaded by

Jeremy Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

lecture4-foundations-supervised-learning

September 15, 2020

1 Lecture 4: Foundations of Supervised Learning

1.0.1 Applied Machine Learning

Volodymyr KuleshovCornell Tech

2 Why Does Supervised Learning Work?

Prevously, we learned about supervised learning, derived our first algorithm, and used it to predict
diabetes risk.
In this lecture, we are going to dive deeper into why supevised learning really works.

3 Part 1: Data Distribution

First, let’s look at the data, and define where it comes from.
Later, this will be useful to precisely define when supervised learning is guaranteed to work.

4 Review: Components of A Supervised Machine Learning Prob-

lem

At a high level, a supervised machine learning problem has the following structure:

Training Dataset + Learning Algorithm → Predictive Model

| {z } | {z }
Attributes + Features Model Class + Objective + Optimizer

Where does the dataset come from?

1
5 Data Distribution

We will assume that the dataset is sampled from a probability distribution P, which we will call
the data distribution. We will denote this as

x, y ∼ P.

The training set D = {(x(i) , y (i) ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P.

6 Data Distribution: IID Sampling

The key assumption in that the training examples are independent and identicaly distributed (IID).
* Each training example is from the same distribution. * This distribution doesn’t depend on
previous training examples.
Example: Flipping a coin. Each flip has same probability of heads & tails and doesn’t depend on
previous flips.
Counter-Example: Yearly census data. The population in each year will be close to that of the
previous year.

7 Data Distribution: Example

Let’s implement an example of a data distribution in numpy.

[1]: import numpy as np
np.random.seed(0)

def true_fn(X):
return np.cos(1.5 * np.pi * X)

Let’s visualize it.

[2]: import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]

X_test = np.linspace(0, 1, 100)

plt.plot(X_test, true_fn(X_test), label="True function")
plt.legend()

[2]: <matplotlib.legend.Legend at 0x120e92668>

2
Let’s now draw samples from the distribution. We will generate random x, and then generate
random y using
y = f (x) + ϵ
for a random noise variable ϵ.
[3]: n_samples = 30

X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1

We can visualize the samples.

[4]: plt.plot(X_test, true_fn(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.legend()

[4]: <matplotlib.legend.Legend at 0x12111c860>

3
8 Data Distribution: Motivation

Why assume that the dataset is sampled from a distribution?

• There is inherent uncertainty in the data. The data may consist of noisy measurements
(readings from an imperfect thermometer).
• There is uncertainty in the process we model. If y is a stock price, there is randomness in the
market that cannot be modeled.
• We can use probability and statistics to analyze supervised learning algorithms and prove
that they work.
# Part 2: Why Does Supervised Learning Work?
We made the assumption that the training dataset is sampled from a data distribution.
Let’s now use it to gain intuition about why supervised learning works.

9 Review: Data Distribution

We will assume that the dataset is sampled from a probability distribution P, which we will call
the data distribution. We will denote this as

x, y ∼ P.

The training set D = {(x(i) , y (i) ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P.

10 Review: Supervised Learning Model

We’ll say that a model is a function

f :X →Y
that maps inputs x ∈ X to targets y ∈ Y.

11 What Makes A Good Model?

There are several things we may want out of a good model: 1. Interpretable features that explain
how x affects y. 2. Confidence intervals around y (we will see later how to obtain these) 3. Accurate
predictions of the targets y from inputs x.
In this lecture, we fill focus on the latter.

4
12 Hold-Out Dataset: Definition

A hold-out dataset
Ḋ = {(ẋ(i) , ẏ (i) ) | i = 1, 2, ..., m}
is another dataset that is sampled IID from the same distribution P as the training dataset D and
the two datasets are disjoint.
Let’s genenerate a hold-out dataset for the example we saw earlier.
[5]: import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)

def true_fn(X):
return np.cos(1.5 * np.pi * X)

X_test = np.linspace(0, 1, 100)

plt.plot(X_test, true_fn(X_test), label="True function")
plt.legend()

[5]: <matplotlib.legend.Legend at 0x12116be48>

Let’s genenerate a hold-out dataset for the example we saw earlier.

[6]: n_samples, n_holdout_samples = 30, 30

X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
X_holdout = np.sort(np.random.rand(n_holdout_samples))
y_holdout = true_fn(X_holdout) + np.random.randn(n_holdout_samples) * 0.1

plt.plot(X_test, true_fn(X_test), label="True function")

plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.scatter(X_holdout, y_holdout, edgecolor='r', s=20, label="Holdout Samples")

5
plt.legend()

[6]: <matplotlib.legend.Legend at 0x121440f28>

13 Defining What is an Accurate Model

Suppose that we have a function isaccurate(y, y ′ ) that determines if y is an accurate estimate of

y ′ , e.g.: * Is the the target variable close enough to the true target?

isaccurate(y, y ′ ) = true if (|y − y ′ | is small), else false

• Did we predict the right class?

isaccurate(y, y ′ ) = true if (y = y ′ ) else false

This defines accuracy on a data point. We say a supervised learning model is accurate if it correctly
predicts the target on new (held-out) data.
We can say that a predictive model f is accurate if it’s probability of making an error on a random
holdout sample is small:

1 − P [isaccurate(ẏ, f (ẋ))] ≤ ϵ

for ẋ, ẏ ∼ P, for some small ϵ > 0 and some definition of accuracy.
We can also say that a predictive model f is inaccurate if it’s probability of making an error on a
random holdout sample is large:

1 − P [isaccurate(ẏ, f (ẋ))] ≥ ϵ

or equivalently

P [isaccurate(ẏ, f (ẋ))] ≤ 1 − ϵ.

6
14 Generalization

In machine learning, generalization is the property of predictive models to achieve good perfor-
mance on new, heldout data that is distinct from the training set.
Will supervised learning return a model that generalizes?

15 Recall: Supervised Learning

Recall our intuitive definition of supervised learning.

1. First, we collect a dataset of labeled training examples.
2. We train a model to output accurate predictions on this dataset.
3. When the model sees new, similar data, it will also be accurate.

16 Applying Supervised Learning

Let’s say that we apply our supervised learning to our problem. 1. We define a model class M
containing H different models. 2. One of these models fits the training data perfectly (is accurate
on every point) and we choose that model.
(These are assumptions that could be changed.)

17 Why Supervised Learning Works

Claim: The probability that supervised learning will return an inaccurate model decreases expo-
nentially with training set size n.
1. A model f is inaccurate if P [isaccurate(ẏ, f (ẋ))]
∏n≤ 1−ϵ.
[ The probability that an] inaccurate
model f perfectly fits the training set is at most i=1 P isaccurate(y , f (x(i) )) ≤ (1−ϵ)n .
(i)

2. We have H models in M, and any of them could be in accurate. The probability that at
least one the at most H inaccurate models willl fit the training set perfectly is ≤ H(1 − ϵ)n .
Therefore, the claim holds.
# Part 3: Overfitting and Underfitting
Let’s now dive deeper into the concept of generalization and two possible failure modes of supervised
learning: overfitting and underfitting.

18 Review: Generalization

We will assume that the dataset is governed by a probability distribution P, which we will call the
data distribution. We will denote this as
x, y ∼ P.

7
˙ , y (i)
A hold-out set Ḋ = {(x(i) ˙ ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P and is distinct from the training set.
A model that generalizes is accurate on a hold-out set.

19 Review: Polynomial Regression

In 1D polynomial regression, we fit a model

fθ (x) := θ⊤ ϕ(x)

that is linear in θ but non-linear in x because the features ϕ(x) : R → Rp are non-linear.
By using polynomial features such as ϕ(x) = [1 x . . . xp ], we can fit any polynomial of degree p.

20 Polynomials Better Fit the Data

When we switch from linear models to polynomials, we can better fit the data and increase the
accuracy of our models.
Consider the synthetic dataset that we have seen earlier.
[7]: from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

np.random.seed(0)
n_samples = 30
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1

X_test = np.linspace(0, 1, 100)

plt.plot(X_test, true_fn(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")

[7]: <matplotlib.collections.PathCollection at 0x12e0c58d0>

8
Although fitting a linear model does not work well, qudratic or cubic polynomials improve the fit.
[8]: degrees = [1, 2, 3]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)

polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])

pipeline.fit(X[:, np.newaxis], y)

ax.plot(X_test, true_fn(X_test), label="True function")

ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
ax.scatter(X, y, edgecolor='b', s=20, label="Samples")
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("Polynomial of Degree {}".format(degrees[i]))

21 Towards Higher-Degree Polynomial Features?

As we increase the complexity of our model class M to even higher degree polynomials, we are able
to fit the data increasingly even better.
What happens if we further increase the degree of the polynomial?

9
[10]: degrees = [30]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)

polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)

linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])

pipeline.fit(X[:, np.newaxis], y)

X_test = np.linspace(0, 1, 100)

ax.plot(X_test, true_fn(X_test), label="True function")
ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
ax.scatter(X, y, edgecolor='b', s=20, label="Samples")
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("Polynomial of Degree {}".format(degrees[i]))

22 The Problem With Increasing Model Capacity

As the degree of the polynomial increases to the size of the dataset, we are increasingly able to fit
every point in the dataset.
However, this results in a highly irregular curve: its behavior outside the training set is wildly
inaccurate.

10
23 Overfitting

24 Underfitting

A related failure mode is underfitting.

• A small model (e.g. a straight line), will not fit the training data well.
• Held-out data is similar to training data, so it will not be accurate either.
Finding the tradeoff between overfitting and underfitting is one of the main challenges in applying
machine learning.

25 Overfitting vs. Underfitting: Evaluation

We can measure overfitting and underfitting by estimating accuracy on held-out data and comparing
it to the training data. * If training perforance is high but held-out performance is low, we are
overfitting. * If training perforance is low but held-out performance is low, we are underfitting.
[11]: degrees = [1, 20, 5]
titles = ['Underfitting', 'Overfitting', 'A Good Fit']
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)

polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)

linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])

pipeline.fit(X[:, np.newaxis], y)

ax.plot(X_test, true_fn(X_test), label="True function")

ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
ax.scatter(X, y, edgecolor='b', s=20, label="Samples", alpha=0.2)
ax.scatter(X_holdout[::3], y_holdout[::3], edgecolor='r', s=20,␣
,→label="Samples")

ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("{} (Degree {})".format(titles[i], degrees[i]))

11
ax.text(0.05,-1.7, 'Holdout MSE: %.4f' % ((y_holdout-pipeline.
predict(X_holdout[:, np.newaxis]))**2).mean())
,→

26 Dealing with Underfitting

Balancing overfitting vs. underfitting is a major challenges in applying machine learning. Briefly,
here are some approaches: * To fight under-fitting, we may increase our model class to encompass
more expressive models. * We may also create richer features for the data that will make the
dataset easier to fit.

27 Dealing with Overfitting

We will see many ways of dealing with overftting, but here are some ideas: * If we’re overfitting,
we may reduce the complexity of our model by reducing the size of M * We may also modify our
objective to penalize complex models that may overfit the data.
# Part 4: Regularization
We will now see a very important way to reduce overfitting — regularization. We will also see
several important new algorithms.

28 Review: Generalization

We will assume that the dataset is governed by a probability distribution P, which we will call the
data distribution. We will denote this as
x, y ∼ P.

˙ , y (i)
A hold-out set Ḋ = {(x(i) ˙ ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P and is distinct from the training set.

12
29 Review: Overfitting

Overfitting is one of the most common failure modes of machine learning. * A very expressive
model (a high degree polynomial) fits the training dataset perfectly. * The model also makes
wildly incorrect prediction outside this dataset, and doesn’t generalize.
We can visualize overfitting by trying to fit a small dataset with a high degree polynomial.
[12]: degrees = [30]
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)

polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)

linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])

pipeline.fit(X[:, np.newaxis], y)

X_test = np.linspace(0, 1, 100)

13
30 Regularization: Intuition

The idea of regularization is to penalize complex models that may overfit the data.
In the previous example, a less complex would rely less on polynomial terms of high degree.

31 Regularization: Definition

The idea of regularization is to train models with an augmented objective J : M → R defined over
a training dataset D of size n as

1∑
n
J(f ) = L(y (i) , f (x(i) )) + λ · R(f ).
n
i=1

Let’s dissect the components of this objective:

1∑
n
J(f ) = L(y (i) , f (x(i) )) + λ · R(f ).
n
i=1

• A loss function L(y, f (x)) such as the mean squared error.

• A regularizer R : M → R that penalizes models that are overly complex.
• A regularization parameter λ > 0, which controls the strength of the regularizer.
When the model fθ is parametrized by parameters θ, we can also use the following notation:

1∑
n
J(θ) = L(y (i) , fθ (x(i) )) + λ · R(θ).
n
i=1

32 L2 Regularization: Definition

How can we define a regularizer R : M → R to control the complexity of a model f ∈ M?

In the context of linear models f (x) = θ⊤ x, a widely used approach is L2 regularization, which
defines the following objective:

1∑
n
λ
J(θ) = L(y (i) , θ⊤ x(i) ) + · ||θ||22 .
n 2
i=1

Let’s dissect the components of this objective.

1∑
n
λ
J(θ) = L(y (i) , θ⊤ x(i) ) + · ||θ||22 .
n 2
i=1
∑d
• The regularizer R : M → R is the function R(θ) = ||θ||22 = 2
j=1 θj . This is also known as
the L2 norm of θ.

14
• The regularizer penalizes large parameters. This prevents us from over-relying on any single
feature and penalizes wildly irregular solutions.
• L2 regularization can be used with most models (linear, neural, etc.)

33 L2 Regularization for Polynomial Regression

Let’s consider an application to the polynomial model we have seen so far. Given polynomial
features ϕ(x), we optimize the following objective:

1 ∑ ( (i) )2 λ
n
J(θ) = y − θ⊤ ϕ(x(i) ) + · ||θ||22 .
2n 2
i=1

We are going to implement regularized and standard polynomial regression on three random training
sets sampled from the same distribution.
[13]: from sklearn.linear_model import Ridge

degrees = [15, 15, 15]

plt.figure(figsize=(14, 5))
for idx, i in enumerate(range(len(degrees))):
# sample a dataset
np.random.seed(idx)
n_samples = 30
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1

# fit a least squares model

polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)

linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])

pipeline.fit(X[:, np.newaxis], y)

# fit a Ridge model

polynomial_features = PolynomialFeatures(degree=degrees[i],␣
,→include_bias=False)

linear_regression = Ridge(alpha=0.1) # sklearn uses alpha instead of lambda

pipeline2 = Pipeline([("pf", polynomial_features), ("lr",␣
,→linear_regression)])

pipeline2.fit(X[:, np.newaxis], y)

# visualize results
ax = plt.subplot(1, len(degrees), i + 1)
# ax.plot(X_test, true_fn(X_test), label="True function")

15
ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="No␣
,→Regularization")
ax.plot(X_test, pipeline2.predict(X_test[:, np.newaxis]), label="L2␣
,→Regularization")

ax.scatter(X, y, edgecolor='b', s=20, label="Samples")

ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("Dataset sample #{}".format(idx))

We can show that by usinng small weights, we prevent the model from learning irregular functions.
[14]: print('Non-regularized weights of the polynomial model need to be large to fit␣
,→every point:')

print(pipeline.named_steps['lr'].coef_[:4])
print()

print('By regularizing the weights to be small, we force the curve to be more␣

,→regular:')

print(pipeline2.named_steps['lr'].coef_[:4])

Non-regularized weights of the polynomial model need to be large to fit every

point:
[-3.02370887e+03 1.16528860e+05 -2.44724185e+06 3.20288837e+07]

By regularizing the weights to be small, we force the curve to be more regular:

[-2.70114811 -1.20575056 -0.09210716 0.44301292]

16
34 How to Choose λ?

In brief, the most common approach is to choose the value of λ that results in the best performance
on a held-out validation set.
We will later see this strategies and several other in more detail

35 Normal Equations for Regularized Models

How, do we fit regularized models? In the linear case, we can do this easily by deriving generalized
normal equations!
Let L(θ) = 12 (Xθ − y)⊤ (Xθ − y) be our least squares objective. We can write the Ridge objective
as:
1 1
J(θ) = (Xθ − y)⊤ (Xθ − y) + λ||θ||22
2 2
This allows us to derive the gradient as follows:
( )
1 ⊤ 1
∇θ J(θ) = ∇θ (Xθ − y) (Xθ − y) + λ||θ||2
2
2 2
( )
1
= ∇θ L(θ) + λ||θ||22
2
= ∇θ L(θ) + λθ
= (X ⊤ X)θ − X ⊤ y + λθ
= (X ⊤ X + λI)θ − X ⊤ y

We used the derivation of the normal equations for least squares to obtain ∇θ L(θ) as well as the
fact that: ∇x x⊤ x = 2x.
We can set the gradient to zero to obtain normal equations for the Ridge model:

(X ⊤ X + λI)θ = X ⊤ y.

Hence, the value θ∗ that minimizes this objective is given by:

θ∗ = (X ⊤ X + λI)−1 X ⊤ y.

Note that the matrix (X ⊤ X +λI) is always invertible, which addresses a problem with least squares
that we saw earlier.

36 Algorithm: Ridge Regression

• Type: Supervised learning (regression)

• Model family: Linear models
• Objective function: L2-regularized mean squared error

17
• Optimizer: Normal equations
# Part 5: Regularization and Sparsity
We will now look another form of regularization, which will have an important new property called
sparsity.

37 Regularization: Definition

The idea of regularization is to train models with an augmented objective J : M → R defined over
a training dataset D of size n as

1∑
n
J(f ) = L(y (i) , f (x(i) )) + λ · R(f ).
n
i=1

Let’s dissect the components of this objective:

1∑
n
J(f ) = L(y (i) , f (x(i) )) + λ · R(f ).
n
i=1

• A loss function L(y, f (x)) such as the mean squared error.

• A regularizer R : M → R that penalizes models that are overly complex.

38 L1 Regularizion: Definition

Another closely related approach to regularization is to penalize the size of the weights using the
L1 norm.
In the context of linear models f (x) = θ⊤ x, L1 regularization yields the following objective:

1∑
n
J(θ) = L(y (i) , θ⊤ x(i) ) + λ · ||θ||1 .
n
i=1

Let’s dissect the components of this objective.

1∑
n
J(θ) = L(y (i) , θ⊤ x(i) ) + λ · ||θ||1 .
n
i=1
∑d
• The regularizer R : M → R is the function R(θ) = ||θ||1 = j=1 |θj |. This is also known as
the L1 norm of θ.
• The regularizer also penalizes large weights. It also forces more weights to decay to zero, as
opposed to just being small.

18
39 Algorithm: Lasso

L1-regularized linear regression is also known as the Lasso (least absolute shrinkage and selection
operator).
• Type: Supervised learning (regression)
• Model family: Linear models
• Objective function: L1-regularized mean squared error
• Optimizer: gradient descent, coordinate descent, least angle regression (LARS) and others

40 Regularizing via Constraints

Consider regularized problem with a penalty term:

min L(θ) + λ · R(θ).

θ∈Θ

We may also enforce an explicit constraint on the complexity of the model:

min L(θ)
θ∈Θ
such that R(θ) ≤ λ′

We will not prove this, but solving this problem is equivalent so solving the penalized problem for
some λ > 0 that’s different from λ′ .
In other words, * We can regularize by explicitly enforcing R(θ) to be less than a value instead of
penalizing it. * For each value of λ, we are implicitly setting a constraint of R(θ).

41 Regularizing via Constraints: Example

This is what it looks like for a linear model:

1 ∑ ( (i) )2
n
min y − θ⊤ x(i)
θ∈Θ 2n
i=1
such that ||θ|| ≤ λ′

where || · || can either be the L1 or L2 norm.

42 L1 vs. L2 Regularization

The following image by Divakar Kapil and Hastie et al. explains the difference between the two
norms.

19
43 Sparsity: Definition

A vector is said to be sparse if a large fraction of its entires is zero.

L1-regularized linear regression produces sparse weights. * This is makes the model more inter-
pretable * It also makes it computationally more tractable in very large dimensions.

44 Sparsity: Ridge Model

To better understand sparsity, we will fit L2-regularized linear models to the UCI diabetes dataset
and observe the magnitude of each weight (colored lines) as a function of the regularization param-
eter.
[15]: # based on https://scikit-learn.org/stable/auto_examples/linear_model/
,→plot_ridge_path.html

from sklearn.datasets import load_diabetes

from sklearn.linear_model import Ridge
from matplotlib import pyplot as plt

X, y = load_diabetes(return_X_y=True)

# create ridge coefficients

alphas = np.logspace(-5, 2, )
ridge_coefs = []
for a in alphas:
ridge = Ridge(alpha=a, fit_intercept=False)
ridge.fit(X, y)
ridge_coefs.append(ridge.coef_)

# plot ridge coefficients

plt.figure(figsize=(14, 5))
plt.plot(alphas, ridge_coefs)
plt.xscale('log')
plt.xlabel('Regularization parameter (lambda)')
plt.ylabel('Magnitude of model parameters')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')

[15]: (4.466835921509635e-06,
223.872113856834,
-868.4051623855127,
828.0533448059361)

20
45 Sparsity: Lasso Model

The above Ridge model did not produce sparse weights. Let’s now compare it to a Lasso model.
[16]: # Based on: https://scikit-learn.org/stable/auto_examples/linear_model/
,→plot_lasso_lars.html

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_diabetes
from sklearn.linear_model import lars_path

# create lasso coefficients

_, _, lasso_coefs = lars_path(X, y, method='lasso')
xx = np.sum(np.abs(lasso_coefs.T), axis=1)

# plot ridge coefficients

plt.figure(figsize=(14, 5))
plt.subplot('121')
plt.plot(alphas, ridge_coefs)
plt.xscale('log')
plt.ylabel('Regularization Strength (alpha)')
plt.ylabel('Coefficents')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')

# plot lasso coefficients

plt.subplot('122')
plt.plot(3500-xx, lasso_coefs.T)
ymin, ymax = plt.ylim()
plt.xlim(ax.get_xlim()[::-1]) # reverse axis

21
plt.ylabel('Coefficients')
plt.ylabel('Regularization Strength')
plt.title('LASSO Path')
plt.axis('tight')

[16]: (3673.0002477572816,
-133.00520290291772,
-869.3573357636973,
828.4524952229636)

The Traveling Salesman Problem and Its Variations
100% (1)
The Traveling Salesman Problem and Its Variations
836 pages
ML 01
No ratings yet
ML 01
24 pages
SML Lecture1
No ratings yet
SML Lecture1
37 pages
AI ch6
No ratings yet
AI ch6
42 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
ML Unit 2
No ratings yet
ML Unit 2
86 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Lecture 7
No ratings yet
Lecture 7
16 pages
Wa0001.
No ratings yet
Wa0001.
173 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
5 - Model For Predictions - ML
No ratings yet
5 - Model For Predictions - ML
52 pages
ML 3170724 Unit-3
No ratings yet
ML 3170724 Unit-3
48 pages
Evaluating Model Performance: Evaluation Strategies: Train/Validation/Test
No ratings yet
Evaluating Model Performance: Evaluation Strategies: Train/Validation/Test
127 pages
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
No ratings yet
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
14 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
Lecture 3: Applications of Machine Learning Algorithms Jul. 06 & 09, 2018
No ratings yet
Lecture 3: Applications of Machine Learning Algorithms Jul. 06 & 09, 2018
3 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
ML Unit 2
No ratings yet
ML Unit 2
37 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
12 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
ML Unit 2 Part 1
No ratings yet
ML Unit 2 Part 1
47 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
Guide
No ratings yet
Guide
24 pages
Machine Learning
No ratings yet
Machine Learning
64 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
Notes
No ratings yet
Notes
125 pages
Machine Learning Notes
100% (4)
Machine Learning Notes
134 pages
AI Chapter 5
No ratings yet
AI Chapter 5
31 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Unit3ModellingandEvaluationpptx 2023 09 02 15 19 21
No ratings yet
Unit3ModellingandEvaluationpptx 2023 09 02 15 19 21
49 pages
ML Unit Ii
No ratings yet
ML Unit Ii
16 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
100% (2)
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
21 pages
ML Opt
No ratings yet
ML Opt
89 pages
CS168: The Modern Algorithmic Toolbox Lecture #5: Generalization (Or, How Much Data Is Enough?)
No ratings yet
CS168: The Modern Algorithmic Toolbox Lecture #5: Generalization (Or, How Much Data Is Enough?)
16 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Classification
No ratings yet
Classification
53 pages
A Practical and Technical Introduction To Machine Learning
No ratings yet
A Practical and Technical Introduction To Machine Learning
23 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Inherent Stochasticity
No ratings yet
Inherent Stochasticity
12 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Lecture-4 Model Evaluation
No ratings yet
Lecture-4 Model Evaluation
28 pages
CHP 3
No ratings yet
CHP 3
70 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
OR Forecasting Tool
No ratings yet
OR Forecasting Tool
39 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
L2 - Problems in ML & Performance Evaluation
No ratings yet
L2 - Problems in ML & Performance Evaluation
30 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Neural Modeling Fields: Fundamentals and Applications
From Everand
Neural Modeling Fields: Fundamentals and Applications
Fouad Sabry
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
1 Lecture 1: Introduction To Machine Learning
No ratings yet
1 Lecture 1: Introduction To Machine Learning
12 pages
1 Lecture 2: Supervised Machine Learning
No ratings yet
1 Lecture 2: Supervised Machine Learning
20 pages
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
No ratings yet
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
6 pages
1 Lecture 3: Optimization and Linear Regression
No ratings yet
1 Lecture 3: Optimization and Linear Regression
27 pages
Practical 4
No ratings yet
Practical 4
9 pages
Probability Work Sheet-2
No ratings yet
Probability Work Sheet-2
4 pages
CS3401 Algorithm Unit3
No ratings yet
CS3401 Algorithm Unit3
11 pages
Lecture 9
No ratings yet
Lecture 9
30 pages
4 - en - MIA - O2.3 - Exp Course 6 - Course Material - Part 4 MP
No ratings yet
4 - en - MIA - O2.3 - Exp Course 6 - Course Material - Part 4 MP
46 pages
WBUT Numerical Method Paper 2012
No ratings yet
WBUT Numerical Method Paper 2012
7 pages
Everything That You Are Curious To Know: by Sajan Mathew
No ratings yet
Everything That You Are Curious To Know: by Sajan Mathew
34 pages
Time Series Analysis
No ratings yet
Time Series Analysis
49 pages
DSP
No ratings yet
DSP
95 pages
Chapter Two - DS Algorithm Analysis
No ratings yet
Chapter Two - DS Algorithm Analysis
32 pages
Deep Learning Unit I II MCQ
No ratings yet
Deep Learning Unit I II MCQ
2 pages
Admt Stat Final - SP24
No ratings yet
Admt Stat Final - SP24
6 pages
Corporate Finance Lecture 2
No ratings yet
Corporate Finance Lecture 2
43 pages
MATHS Mini Project Sem4
No ratings yet
MATHS Mini Project Sem4
10 pages
Mathematics
No ratings yet
Mathematics
2 pages
Statistics and Biostatistics: Mrs. Khushbu K. Patel Assistant Professor Shri Sarvajanik Pharmacy College
100% (1)
Statistics and Biostatistics: Mrs. Khushbu K. Patel Assistant Professor Shri Sarvajanik Pharmacy College
87 pages
Dynamic Matrix Control
100% (2)
Dynamic Matrix Control
16 pages
Column Generation With Gams
No ratings yet
Column Generation With Gams
19 pages
AI Fundamentals Level 1 Quiz - Attempt Review
No ratings yet
AI Fundamentals Level 1 Quiz - Attempt Review
9 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
16 pages
Ranked Positional Weight Method: of Assembly Line Balancing
No ratings yet
Ranked Positional Weight Method: of Assembly Line Balancing
11 pages
Who Invented Queuing Theory?
No ratings yet
Who Invented Queuing Theory?
16 pages
Counting Divisors of A Number in (Tutorial) - Codeforces
No ratings yet
Counting Divisors of A Number in (Tutorial) - Codeforces
8 pages
Asymptotic Notations
No ratings yet
Asymptotic Notations
18 pages
Problem Solving 7 Steps of Problem Solving
No ratings yet
Problem Solving 7 Steps of Problem Solving
3 pages
Ce21 Engineering Statistics Syllabus 2nd Semester 2014 2015
No ratings yet
Ce21 Engineering Statistics Syllabus 2nd Semester 2014 2015
3 pages
CLRM
No ratings yet
CLRM
15 pages
OM-Chapter 5
No ratings yet
OM-Chapter 5
38 pages
Engineering: Analysis Design
No ratings yet
Engineering: Analysis Design
2 pages