0% found this document useful (0 votes)

24 views31 pages

Statistical Learning

Uploaded by

maha.butterfly.33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views31 pages

Statistical Learning

Uploaded by

maha.butterfly.33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

What is Statistical Learning?

2
5

2
5
2
0

2
0

2
0
Sales

Sales

Sales
1
5

1
5

1
5
1
0

1
0

1
0
5

5
0 50 200 300 0 10 20 30 40 0 20 40 60 80
100 50 100
TV Radio
Newspaper

Shown are S a l e s vs TV, Radio and Newspaper, with a blue

linear-regression line fit separately to each.
Can we predict S a l e s using these three?
Perhaps we can do better using a model

S a l e s ≈ f (TV, Radio,
Newspaper)
1 / 30
Notation
Here S a l e s is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it
X 1 . Likewise name Radio as X 2 , and so on.
We can refer to the input vector collectively as
 
X
X = 1 X 
 2
 X3

Now we write our model as

Y = f (X ) + ϵ

where ϵ captures measurement errors and other discrepancies.

2 / 30
What is f (X ) good for?

• With a good f we can make predictions of Y at new points

X = x.
• We can understand which components of
X = (X 1 , X 2 , . . . , X p ) are important in explaining Y ,
and which are irrelevant. e.g. S e n i o r i t y and Years o f
Educati on have a big impact on Income, but M arital
Status typically does not.
• Depending on the complexity of f , we may be able
to understand how each component X j of X affects
Y.

3 / 30
●●
● ●

6
●
●
● ●
●
●●● ●
●

4
●● ●●●●
●● ●●●
●
●●●●●●●●●● ●
● ●● ●●●
●●●
●●
●
●●● ●●●●●●●●●
●●
● ●
● ●●●●●●●●
y

2 ●● ● ●●● ●●●
●
●●
●●
●●● ●●●●●
●
●●●●● ●●
●●●●
● ●●●●●●●● ●
●
●●●
●● ●● ●
●●●●● ●●●●● ●●●●● ●
●●● ●●
●● ● ●●
● ●●●●● ● ●●
● ●
● ●●●●● ●●●●●●●●●●●●●
●● ●●●●●
●
●●● ● ● ●●●●●●
●
● ●● ● ●●● ●●
●●●● ● ●●●●●
●●●● ●● ● ●●
●
● ● ● ●●●
●●●
●●
●● ●● ●●●●
●●●● ●●●●●●●
● ●●●● ●
● ● ●●●
● ●● ● ● ● ●●
●●●●● ●●● ●●● ● ● ●●●
● ●● ● ●●● ●●●●●●● ●
●● ●●●●●●
●●●●● ●●●
●●● ●●●●
●● ●●●●●●●
●●●
●●
● ●● ● ●●●
●●●●
●● ●●●●●●●●●●●
●●● ●●●●●● ●●●● ●●
●
●
●
● ●●●●●
●●●●●
●● ●●
●●● ●●●●●
●● ●● ●
●●
● ● ●●●●●
●
● ●●
●●●●
● ●●
●● ●●
●●●●●●
●●●●●●●● ●●
●● ●●● ●●●●● ●●● ●●●●●
●● ●●●●●●●●●● ●
●●●●●●●● ●● ● ● ●●
● ● ●●●●
●● ● ●●●●●●●
● ● ●●●
●●●● ●●●●●
−2

● ●●●
●●
●●● ●●●
●● ●●●●● ● ●●● ●
● ●●●● ● ●● ● ●● ●●● ●●●●●●●●● ●● ●
●●●●●●
● ●●
●● ●●
●● ● ● ●●●
0

●●●●● ● ●●●
●●●●● ●

1 2 3 4 5 6 7

Is there an ideal f ( X ) ? In particular, what is a good value for

f ( X ) at any selected value of X , say X = 4? There can be
many Y values at X = 4. A good value is

f (4) = E (Y |X = 4)
E ( Y | X = 4) means expected value (average) of Y given X = 4.
This ideal f (x) = E ( Y | X = x ) is called the regression function.
4 / 30
The regression function f (x )

• Is also defined for vector X ; e.g.

f (x ) = f (x 1 , x 2 , x 3 ) = E (Y |X 1 = x 1, X 2 = x 2, X 3 = x 3)
• Is the ideal or optimal predictor of Y with regard to
mean-squared prediction error: f (x ) = E (Y |X = x ) is
the function that minimizes E [ ( Y − g ( X ) ) 2 | X = x] over
all functions g at all points X = x.
• ϵ = Y − f (x) is the irreducible error — i.e. even if we knew
f (x ), we would still make errors in prediction, since at
each X = x there is typically a distribution of possible Y
values.
• F or any estimate fˆ(x ) of f (x ), we have
E [ ( Y − f ˆ (X)) 2 |X = x] = `[f (x) − fˆ (x)]2 + Var(ϵ)
x ˛ ¸
Red u cible
` ˛ ¸ x
Ir red u ci ble 5 / 30
How to estimate f
• Typically we have few if any data points with X = 4
exactly.
• So we cannot compute E ( Y | X = x)!
• Relax the definition and let

fˆ(x ) = A ve(Y |X ∈ N
(x )) where N (x ) is some neighborhood of
x.

● ●
2

●
●
●●● ●
● ●
● ●
1

● ● ●
● ●●
●
●●
y

●
● ●● ●
●●●
0

● ●● ●
● ●● ● ● ●●
● ● ● ● ● ● ●
●●
●
● ● ● ●● ● ●
−1

●
● ● ●●
● ● ●
●●
−2
3

1 2 3 4 5 6

x
6 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p ≤ 4 and large-ish N .
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
• Nearest neighbor methods can be lousy when p is large.
Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.

• We need to get a reasonable fraction of the N values of

yi
to average to bring the variance down—e.g. 10%.
• A 10% neighborhood in high dimensions need no longer be
local, so we lose the spirit of estimating E ( Y | X = x ) by
local averaging.

7 / 30
The curse of dimensionality

10% Neighborhood

p=
1.
0

●● ● ●●●
● ● ● ●
●
●●
10
● ● ●
● ● ● ●

1.
●

5
●
● ● ●
● ● ● ● p=
0.

●
5

● ●
● ● ● 5
●
●●●
● ● p=

1.
● ● 3

0
Radiu
● ●
● ●● ● ● ●
0.

●
2
x

●● ●
●
● ●
●●
p=

s
●
●
●
●
●●
● 2
p=
● ●
−0.5

●
●● ● ●●
1

0.
●

5
● ●
● ● ●
● ●●
● ●
● ●
● ● ● ●
● ● ● ●
−1.0

●
●
●

0.
0
●

−1.0 −0.5 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.7
x1 Fraction of
Volume

8 / 30
Parametric and structured models

The linear model is an important example of a parametric

model:
f L (X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p .

• A linear model is specified in terms of p + 1

parameters
β0 , β1 , . . . , βp .
• We estimate the parameters by fitti ng the model to
training data.
• Although it is almost never correct, a linear model often
serves as a good and interpretable approximation to the
unknown true function f ( X ) .

9 / 30
A linear model f ˆ L (X) = βˆ0 + βˆ1X gives a reasonable fit here
● ●

2
●
●
● ●● ●
● ● ●● ●

1
●
● ●●
●
●●

y
● ● ●
●●
●●
● ● ● ●● ● ●

0
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ●● ● ●
● ● ●

−1
● ●
● ●● ●● ●

−2
3

1 2 3 4 5 6

A quadratic model f ˆ Q (X) = βˆ0 + βˆ1X + βˆ2X2 fits

slightly better.
● ●
2

●
●
●● ●
● ●●
●
1

● ●●●
● ●
●●
y

●●
● ● ●
●● ●●
● ● ●
0

● ● ● ● ● ●
● ● ● ● ● ●
●
● ●
● ● ● ●● ● ● ●
●
●
−1

● ●
● ●● ●● ●
−2
3

1 2 3 4 5 6

10 / 30
In co
e
m

rit
Ye

io
E d ars

y en
uc of

S
at
io
n

Simulated example. Red points are simulated values for income

from the model

income = f (educati on, s e n i o r i t y ) +

ϵ f is the blue surface.

11 / 30
In co
e
m

rit
Ye

io
E d ars

y en
uc of

S
at
io
n

Linear regression model fit to the simulated data.

fˆL (educati on, s e n i o r i t y ) =

βˆ 0 + βˆ 1 ×educati on+ βˆ 2 ×s e n i o r i t y
12 / 30
In co
e
m

rit
Ye

io
E d ars

y en
uc of

S
at
io
n

More flexible regression model fˆS (educati on, s e n i o r i t y ) fit to

the simulated data. Here we use a technique called a thin-plate
spline to fit a flexible surface. We control the roughness of the
fit.

13 / 30
In co
e
m

rit
Ye

io
E d ars

y en
uc of

S
at
io
n

Even more flexible spline regression model

fˆS (educati on, s e n i o r i t y ) fit to the simulated data. Here the
fitted model makes no errors on the training data! Also known
as overfitting.

14 / 30
Some trade-offs

• Prediction accuracy versus interpretability.

— Linear models are easy to interpret; thin-plate splines
are not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?
• Parsimony versus black-box.
— We often prefer a simpler model involving fewer
variables over a black-box predictor involving them all.

15 / 30
Subset
Hig Selection
h Lasso

Least
Interpretabil

S
q
u
a
ity

r
e Bagging,
s Boosting

Support Vector Machines

Generalized Additive
Lo
w

Models Trees

Low Hig
h
Flexibilit
y

16 / 30
Assessing Model Accuracy

Suppose we fit a model f ˆ (x) to some training data

Tr = { xi , yi N1} , and we wish to see how well it
performs.
• We could compute the average squared prediction
error over Tr:
M S E Tr = A vei ∈Tr [y i − fˆ(x i )]2
This may be biased toward more overfit models.
• Instead we should,M if possible, compute it using fresh
data Te = { xi , iy 1}
test
:
M S E Te = A vei ∈Te[y i −
fˆ(x i )]2

17 / 30
2.
5
1
2

2.
0
1
0

Mean Squared

1.
5
8
Y

1.
6

0
Error
4

0.
5
2

0.
0
0 20 40 60 80 2 5 10
100 20
X

Flexibility

Black curve is truth. Red curve on right is MSE Te , grey curve is

MSE Tr . Orange, blue and green curves/squares correspond to fits of
different flexibility.
18 / 30
2.
5
1
2

2.
0
1
0

Mean Squared

1.
5
8
Y

1.
6

0
Error
4

0.
5
2

0.
0
0 20 40 60 80 2 5 10
100 20
X

Flexibility

Here the truth is smoother, so the smoother fit and linear model do
really well.

19 / 30
2
0
2
0

1
5
Mean Squared
1
0
Y

1
0
Error
0

5
−10

0
0 20 40 60 80 2 5 10
100 20
X

Flexibility

Here the truth is wiggly and the noise is low, so the more flexible fits
do the best.

20 / 30
Bias-Variance Trade-off

Suppose we have fit a model f ˆ (x) to some training data Tr, and
let (x 0 , y 0 ) be a test observation drawn from the population. If
the true model is Y = f ( X ) + ϵ (with f (x) = E ( Y | X = x)),
then
2
0
ˆ 0 ˆ 0 ˆ 2
E y0 − f ( x ) = Var(f ( x )) + [Bias(f ( x ))] +
The Var(ϵ).
expectation averages over the variability of y as well as
0

the variability in Tr. Note that Bias(f ˆ (x 0 ))] = E[f ˆ (x 0 )] − f

(x 0 ).

Typically as the flexibility of fˆ increases, its variance increases,

and its bias decreases. So choosing the flexibility based on
average test error amounts to a bias-variance trade-off.
21 / 30
Bias-variance trade-off for the three examples

MSE
2.

2
0
5

5
Bia
s
Var
2.

2.
0

1
5
1.

1.
5

1
0
1.

1.
0

5
0.

0.
5

0
0.

0.
0

2 5 10 2 5 10 2 5 10
20 20 20
Flexibilit Flexibilit Flexibilit
y y y

22 / 30
Classification Problems

Here the response variable Y is qualitative — e.g. email is one

of C = (spam, ham) (ham=good email), digit class is one of
C = { 0, 1, . . . , 9} . Our goals are to:
• Build a classifier C ( X ) that assigns a class label from C to
a future unlabeled observation X .
• Assess the uncertainty in each classification
• Understand the roles of the different predictors among
X = (X 1 , X 2 , . . . , X p ).

23 / 30
| | | | | || | ||| || |||||||||||||||| |||| | ||| || | | | || || | | | || | || | ||||||

0.8
|||||| |||| ||| | | ||

0.6
y

0.4
0.2
0.0
1.0

|| | | || ||| || || ||| | |||| | | | || |||||| ||

| | ||||||| | |
1 4 5 6 7
2
x
3
Is there an ideal C ( X ) ? Suppose the K elements in C are
numbered 1, 2, . . . , K . Let
pk (x ) = P r(Y = k|X = x ), k = 1, 2, . . . , K .
These are the conditional class probabilities at x; e.g. see little
barplot at x = 5. Then the Bayes optimal classifier at x is
C ( x ) = j if p j (x) = max{p 1 (x), p 2 (x), . . . , p K
(x)} 24 / 30
| | || || || || || || || |||| ||| | | ||||||| || || | ||||
0.8

| || | || | |
0.6
y

0.4
0.2
0.0
1.0

| | | || | | || | | | |||| || |||| | || | | || | ||
| | |
2 3 4 5 6

Nearest-neighbor averaging can be used as before.

Also breaks down as dimension grows. However, the impact on
Cˆ (x ) is less than on pˆk (x ), k = 1, . . . , K .

25 / 30
Classification: some details

Bayes Classifier in a 2 class classification assigns Class 1 if

probability > 0.5, else Class 2

Bayes Decision Boundary

Bayes Error Rate

K Nearest Neighbor

26 / 30
Classification: some details

• Typically we measure the performance of C ˆ (x) using the

misclassification error rate:

Err Te = Ave i ∈ Te I[y i / = C ˆ (x i )]

• The Bayes classifier (using the true p k (x)) has smallest

error (in the population).

• Support-vector machines build structured models for

C(x).
• We will also build structured models for representing
the
p k (x). e.g. Logistic regression, generalized additive
models. 27 / 30
Example: K-nearest neighbors in two dimensions
oo
oo
oo o o
o
o o oo o o
o o o
o o o
o
o o ooo o ooooo o oo
o o ooo oo o oo o
o o oo
oo o oo
o o
oo o
o o o
o o o o o o
o
ooo o o o
o oo o o o
oo o
o o o oo oo oo oo
oo o
X2

o o o o o o oooo
oo o o
o o oo
oo o o o ooo o o o o o o
o oo oo o
oo o o
o oo
oo o ooo
o o
o o o ooo o o
o o
ooo oo o
o o o oooooo o
o
o oo o
o
o oo o

X1
28 / 30
K N N:
K=10

oo oo
oo o o
o oo o
oo
o
o oo o ooo oooo
o
oooo o
oo ooo ooo o oo
o o oo o o oo
o o o oo
oo o o
o o o o
o o o o oo o o o o
o
o oo o o oo o o o
o o oo o o o o o oo o
oo o oo
X2

o o o o oo o o
o o
oo o o
o ooo
oo o oo o o o o
o o oo ooo o o o
o o o o
o
o o
o
oo o o
o oooo
o o o oo ooo
oooo o o o o
o o oooooo o
o
o oo o
o o
o o o

29 / 30
K N N: K N N:
K=1 K=100

o o o o
o o oo o o o o oo o o
o oo o o oo o
oo oo
o o
o oo oo ooo o o oo oo ooo o
o oo oo oo o o oo oo oo o
o oo oo o o oo
oo oo o oo oo o o oo
oo oo
o oo o oo
o o o oo o o o o oo o
o o
oo
o ooo o o o oo
o ooo o o o
o o o o o o o o o o o o o o
o o o o
o oo o oo o o o o oo o oo o o o
oo oo
o o o o oo o oo
ooo o o o o o oo o oo
ooo o
oo oo
o o o oo oo o o o oo oo
o o o oo o o o oo
oo oo oo oo
oo o o o oo o o o
o o ooo o o ooo
o oo o o oo o
o o o o o o o o
o o o o
o o oo o o o oo o
o ooo oo ooo o o ooo oo ooo o
o o
ooo o ooo o
o o oo oo o o oo oo
oo o oo oo o oo
o o o ooo oo o o o ooo oo
o
o o
o
o o o o
o o o o o o
o o
o o
o o o o

30 / 30
0.2
0
0.1
5
0.1
Error

0
Rate
0.0
5

Training
Errors Test
0.0
0

Errors
0.0 0.0 0.0 0.1 0.2 0.5 1.0
1 2 5 0 0 0 0
1/K

31 / 30

Orphanage (Adarne, Fe Isabel A.) - Thesis Book
92% (13)
Orphanage (Adarne, Fe Isabel A.) - Thesis Book
89 pages
Merge
No ratings yet
Merge
240 pages
Safety Instrumentation Module - 5 Operation, Maintenance, Testing, Reporting and Management of PSMP
100% (1)
Safety Instrumentation Module - 5 Operation, Maintenance, Testing, Reporting and Management of PSMP
104 pages
Linear Review 1
No ratings yet
Linear Review 1
235 pages
1988 The Measurement of End-User Computing Satisfaction PDF
No ratings yet
1988 The Measurement of End-User Computing Satisfaction PDF
17 pages
STAT 714 Linear Statistical Models: Lecture Notes
No ratings yet
STAT 714 Linear Statistical Models: Lecture Notes
150 pages
Regression
No ratings yet
Regression
45 pages
Test Name: CPHQ Practice Exam: Form A: Your Score Status Initial Score
100% (2)
Test Name: CPHQ Practice Exam: Form A: Your Score Status Initial Score
41 pages
Unit 2
No ratings yet
Unit 2
133 pages
2023 Staar Questions
No ratings yet
2023 Staar Questions
53 pages
SAMPLING
0% (1)
SAMPLING
62 pages
Week 5 Notes
No ratings yet
Week 5 Notes
175 pages
Interpretation of Geochemical Data - 2019.
100% (1)
Interpretation of Geochemical Data - 2019.
19 pages
Google Data Science Interview Questions
No ratings yet
Google Data Science Interview Questions
6 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
ML PPT 2
No ratings yet
ML PPT 2
206 pages
Time Series Forecasting Business Report
No ratings yet
Time Series Forecasting Business Report
42 pages
Class 3 - Classification
No ratings yet
Class 3 - Classification
80 pages
Matrix Data Analysis Diagram
75% (4)
Matrix Data Analysis Diagram
11 pages
ML042250036 PDF
No ratings yet
ML042250036 PDF
241 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
VO - Overall Brochure (MBA MCA BBA BCA)
No ratings yet
VO - Overall Brochure (MBA MCA BBA BCA)
15 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
Csit Batch10 Projectreport
No ratings yet
Csit Batch10 Projectreport
76 pages
StatLearning2r PDF
No ratings yet
StatLearning2r PDF
267 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
Predictive Analytics in Marketing
No ratings yet
Predictive Analytics in Marketing
90 pages
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
No ratings yet
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
24 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
Statistics 17 by Keller
No ratings yet
Statistics 17 by Keller
76 pages
02 - Data Pre Processing
No ratings yet
02 - Data Pre Processing
91 pages
Chapter 1. Elements in Predictive Analytics
No ratings yet
Chapter 1. Elements in Predictive Analytics
66 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Ch2 Statistical Learning
No ratings yet
Ch2 Statistical Learning
51 pages
Lect 5 - Process Strategy & Capacity and Constraint Management
No ratings yet
Lect 5 - Process Strategy & Capacity and Constraint Management
63 pages
Machine Learning
No ratings yet
Machine Learning
92 pages
Module 5
No ratings yet
Module 5
48 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
Updated Dissertation Writing Guide
No ratings yet
Updated Dissertation Writing Guide
30 pages
Module 5
No ratings yet
Module 5
53 pages
Module-5 Notes
No ratings yet
Module-5 Notes
53 pages
1 Statistical Learning
No ratings yet
1 Statistical Learning
42 pages
Nigeria Transport Policy
No ratings yet
Nigeria Transport Policy
19 pages
Summer Booklet-Statistcs
No ratings yet
Summer Booklet-Statistcs
15 pages
2.SupervisedLearning Error
No ratings yet
2.SupervisedLearning Error
32 pages
Chap4.Introduction To Statistical Learning
No ratings yet
Chap4.Introduction To Statistical Learning
31 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Lec-01-Introduction To Statistical Learning
No ratings yet
Lec-01-Introduction To Statistical Learning
38 pages
02 Statistical Learning
No ratings yet
02 Statistical Learning
37 pages
W1.2 Regression 1
No ratings yet
W1.2 Regression 1
28 pages
W1.3 Regression 2
No ratings yet
W1.3 Regression 2
28 pages
Week2 StatisticalLearning
No ratings yet
Week2 StatisticalLearning
46 pages
Capitulo 2 Big Data
No ratings yet
Capitulo 2 Big Data
25 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Data Science - The 12th Statistika Ria 2017 v.1.2
No ratings yet
Data Science - The 12th Statistika Ria 2017 v.1.2
36 pages
226 Lecture5 Prediction
No ratings yet
226 Lecture5 Prediction
45 pages
Business Analytics
No ratings yet
Business Analytics
19 pages
Introduction To Statistical Learning
No ratings yet
Introduction To Statistical Learning
16 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
Chapter 15 - Machine Learning New
No ratings yet
Chapter 15 - Machine Learning New
19 pages
Forecasting: Operations Management - 5 Edition
No ratings yet
Forecasting: Operations Management - 5 Edition
57 pages
ISL Answers
No ratings yet
ISL Answers
19 pages
Exegeses ANOVA III
No ratings yet
Exegeses ANOVA III
26 pages
Slides 1 Handout
No ratings yet
Slides 1 Handout
23 pages
MGS3100 Chapter 13 Forecasting: Slides 13c: Causal Models and Regression Analysis
No ratings yet
MGS3100 Chapter 13 Forecasting: Slides 13c: Causal Models and Regression Analysis
36 pages
Lec 9
No ratings yet
Lec 9
14 pages
Data Warehousing: Modern Database Management 8 Edition
No ratings yet
Data Warehousing: Modern Database Management 8 Edition
34 pages
Prob Stat Lesson 9
No ratings yet
Prob Stat Lesson 9
44 pages
Lecture #2: Prediction, K-Nearest Neighbors: CS 109A, STAT 121A, AC 209A: Data Science
No ratings yet
Lecture #2: Prediction, K-Nearest Neighbors: CS 109A, STAT 121A, AC 209A: Data Science
28 pages
Unit - 1
No ratings yet
Unit - 1
8 pages
ML Cheat
No ratings yet
ML Cheat
9 pages
Chapter 5 Learning Deterministic Models
No ratings yet
Chapter 5 Learning Deterministic Models
28 pages
Intro To Data Science Lecture 1
No ratings yet
Intro To Data Science Lecture 1
7 pages
Empirical Models: Data Collection
No ratings yet
Empirical Models: Data Collection
16 pages
What Is Empirical - Models
No ratings yet
What Is Empirical - Models
14 pages
Drive Test PDF
100% (1)
Drive Test PDF
3 pages
DWM Project
No ratings yet
DWM Project
7 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
Course 572
No ratings yet
Course 572
8 pages
Predictive Modelling Process: A First Tour
No ratings yet
Predictive Modelling Process: A First Tour
11 pages
Casio Calc Regression
No ratings yet
Casio Calc Regression
2 pages
Research On Variogram Theory and Its Application in Property Modeling
No ratings yet
Research On Variogram Theory and Its Application in Property Modeling
7 pages
Outline Evaluation Sheet - 2022WB21545 (1) - 1
No ratings yet
Outline Evaluation Sheet - 2022WB21545 (1) - 1
3 pages
Diptimayee Sahoo-Resume-Data Analyst
No ratings yet
Diptimayee Sahoo-Resume-Data Analyst
2 pages
ESL: Chapter 1: 1.1 Introduction To Linear Regression
No ratings yet
ESL: Chapter 1: 1.1 Introduction To Linear Regression
4 pages
Nishi's Resume-3
No ratings yet
Nishi's Resume-3
1 page
Fill Your Glass With Gold-When It's Half-Full or Even Completely Shattered
From Everand
Fill Your Glass With Gold-When It's Half-Full or Even Completely Shattered
Hillary Saffran
No ratings yet
A Field Guide to the Aliens of Star Trek: The Next Generation
From Everand
A Field Guide to the Aliens of Star Trek: The Next Generation
Zachary Auburn
No ratings yet

Statistical Learning

Uploaded by

Statistical Learning

Uploaded by

What is Statistical Learning?

Shown are S a l e s vs TV, Radio and Newspaper, with a blue

Now we write our model as

where ϵ captures measurement errors and other discrepancies.

• With a good f we can make predictions of Y at new points

Is there an ideal f ( X ) ? In particular, what is a good value for

• Is also defined for vector X ; e.g.

• We need to get a reasonable fraction of the N values of

The linear model is an important example of a parametric

• A linear model is specified in terms of p + 1

A quadratic model f ˆ Q (X) = βˆ0 + βˆ1X + βˆ2X2 fits

Simulated example. Red points are simulated values for income

income = f (educati on, s e n i o r i t y ) +

ϵ f is the blue surface.

Linear regression model fit to the simulated data.

fˆL (educati on, s e n i o r i t y ) =

More flexible regression model fˆS (educati on, s e n i o r i t y ) fit to

Even more flexible spline regression model

• Prediction accuracy versus interpretability.

Support Vector Machines

Suppose we fit a model f ˆ (x) to some training data

Black curve is truth. Red curve on right is MSE Te , grey curve is

the variability in Tr. Note that Bias(f ˆ (x 0 ))] = E[f ˆ (x 0 )] − f

Typically as the flexibility of fˆ increases, its variance increases,

Here the response variable Y is qualitative — e.g. email is one

|| | | || ||| || || ||| | |||| | | | || |||||| ||

Nearest-neighbor averaging can be used as before.

Bayes Classifier in a 2 class classification assigns Class 1 if

Bayes Decision Boundary

Bayes Error Rate

• Typically we measure the performance of C ˆ (x) using the

Err Te = Ave i ∈ Te I[y i / = C ˆ (x i )]

• The Bayes classifier (using the true p k (x)) has smallest

• Support-vector machines build structured models for

You might also like