0% found this document useful (0 votes)
20 views

1 Introduction

Uploaded by

microstart95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

1 Introduction

Uploaded by

microstart95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

STAT6121-ML

Introduction
Machine learning
Machine learning approaches to data analysis
cannot be done without computers
Common with statistical modelling/analysis
• For prediction and classification
• Requires an optimisation procedure
• Obtain parameters or functions from observations
• Uncertainty of learning vs. prediction/classification
New elements or techniques
• Explanation or theoretical construct not emphasis
• Data can be ‘organic’, such as text, image
• Distinction of training vs. validation/test data
• Reliance on ready-made software for implementation

2
Some broad remarks
Supervised vs. unsupervised learning
• Target outcome y and covariates/features x?
NB. log-linear models of contingency tables
• Can the learned result be applied to unseen units?
NB. principal components, clustering
Prediction vs. classification
• Best prediction of y is its expectation µx = E(y | x)
2 2 2
 
E (y − µ) | x = (µx − µ) + E (y − µx) | x

• Best classification of categorical y is



y0 = arg max

Pr(y = y | x)
y

e.g. if y ∼ N (µ, σ 2), then E(y) = µ but Pr(y = µ) = 0


however, let z = I(y > µ − σ), then z0 = 1

3
Some broad remarks
Parametric vs. non-parametric models
• function/model f (x; θ) fixed given θ, i.e. parameters
f (x) = E(y | x) or f (x) = Pr(y | x)
• parametric if θ contains a fixed number of constants
NB. linear regression model as a typical example
• non-parametric if no. unknowns in θ grows with the
no. observations, or if f is indeterminate in advance
Error vs. residual
• Given f (x) = E(y | x) or y0 = arg max

Pr(y = y ′
| x), error
y

e = y − f (x) or e = I(y = y0)


• Given fˆ or ŷ0 as estimate f or y0, residual
ê = y − fˆ(x) or ê = I(y = ŷ0)
if (y, x) are used for obtaining fˆ or ŷ0
4
Bias-variance trade-off

Eq. (2.7), mean squared error (MSE) of fˆ(x) for y given x


2 2
ˆ ˆ
E{ y − f (x) } = E{ y − f (x) + f (x) − f (x) }
2 2
ˆ
= E{ y − f (x) } + E f (x) − f (x)
ˆ
 
− 2E{ y − f (x) f (x) − f (x) }
2
ˆ
= V e(x) + V f (x) + Bias f (x) ˆ
 

over fˆ(x) and


 y that are independent of each other
NB. V e(x) unaffected by whichever fˆ
2
ˆ ˆ

Q: Reduce V f (x) and Bias f (x) at the same time?
• to reduce V f (x) , let fˆ(x) be obtained based on many
ˆ

observations, e.g. by using parametric f (x; θ)...
2
• to reduce Bias f (x) , let fˆ(x) only depend on close-by
ˆ
observations, provided f is reasonably smooth...
• hence, the bias-variance trade-off
5
Ch. 3, exercise 4
Answer by ML, e.g. x ∈ (1, 10) and f (x) = β0 + β1 log(x) for (c)
get.dta <- function(n=100, beta=c(0.5,1), nonlnr=F)
{
x = seq(1,10,length=n)
if (nonlnr) { f = beta[1] + beta[2]*log(x) }
else { f = beta[1] + beta[2]*x }
y = f + rnorm(n, 0, 1)
x2 = x^2; x3 = x^3
data.frame(y,x,x2,x3,f)
}

main <- function(n=100, beta=c(0.5,1), nonlnr=F, vis=F)


{
dta = get.dta(n=n, beta=beta, nonlnr=nonlnr)
if (nonlnr) { cat("data generated under nonlinear model\n\n") }
else { cat("data generated under linear model\n\n") }
cat("fitting simple linear regression:\n")
print(summary(lm(y ~ x, data=dta)))
cat("fitting cubic (polynomial) regression:\n")
print(summary(lm(y ~ x + x2 + x3, data=dta)))
if (vis) { plot(dta$x, dta$y); lines(dta$x, dta$f) }
}

6
Additional exercise

16
14
12
y

10
8
6
4

6 8 10 12 14

Equally spaced x, fˆ1(x) = β̂x (solid), fˆ2(x) = y (dashed)


• What is V f (x) at any given x for fˆ = fˆ1 or fˆ2?
ˆ

ˆ
What can you say about Bias f (x) ?


Consider KNN predictor given K


• How would you apply the method if x = 5 or 10?
ˆ ˆ
• What about V f (x) and Bias f (x) in this case?
 

7
Additional exercise
n
X n
X
β̂ = xi yi / x2i
i=1 i=1
V fˆ1(x) = V (β̂x) = x V (β̂) 2

n n n
2 2
X X X
2 2 2
x2i

= x V (yi | xi) xi / xi = x V (yi | xi)/
i=1 i=1 i=1
n
X
V̂ (yi | xi) = (yi − β̂xi)2/(n − 1)
i=1
V fˆ2(x) = V (y | x) = V (yi | xi) NB. non-existant V̂ (yi | xi)

K
X
fˆ(x) = yj (x)/K
j=1
XK
V fˆ(x) = V yj (x) /K 2
 
j=1
K
X 2
yj (x) − fˆ(x) /(K − 1) NB. from K obs.

V̂ yj (x) =
j=1
Assume unbiasedness in all the cases...

You might also like