0% found this document useful (0 votes)
20 views9 pages

Sta 3

This document discusses statistical machine learning techniques beyond parameter estimation, including modeling multivariate distributions, discrete conditional distributions, continuous conditional distributions, and other more complex estimation tasks. It introduces concepts like regression, classification, supervised learning, and linear models.

Uploaded by

wayacel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

Sta 3

This document discusses statistical machine learning techniques beyond parameter estimation, including modeling multivariate distributions, discrete conditional distributions, continuous conditional distributions, and other more complex estimation tasks. It introduces concepts like regression, classification, supervised learning, and linear models.

Uploaded by

wayacel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Beyond parameter estimation

INFO-F-422: Statistical foundations of machine


learning So far simple task: estimation of parameters of univariate
distributions.
Linear regression More complex estimation tasks:
◮ Parameters of multivariate distributions: consider for
Gianluca Bontempi example multivariate gaussians.
◮ More complex functionals.
Machine Learning Group
Computer Science Department
◮ Discrete conditional distributions: pattern recognition or
mlg.ulb.ac.be pattern classification.
◮ Continuous conditional distributions: regression.

1/34 2/34

Bivariate continuous random scatterplot Bivariate density distribution

Input/output training set


4

3.5

2.5

2
y

1.5

0.5

0
0 2 4 6 8 10
x

3/34 4/34
Prediction problems Input/output problems

All the previous examples are characterized by


◮ Predict whether a patient, hospitalized due to a heart attack, 1. An outcome measurement, also called output, usually
will have a second heart attack, on the basis of demographic, quantitative (like a stock price) or categorical (like heart
diet and clinical measurements. attack/no heart attack).
◮ Predict the price of a stock in 6 months from now, on the 2. a set of features or inputs, also quantitative or categorical,
basis of company performance measures and economic data. that we wish to use to predict the output.
◮ Identify the risk factors for breast cancer, based on clinical,
demographic and genetic variables.
Assumption: input variables provide some explanation for the
◮ Classify the category of a text email (spam or not) on the variability of the output.
basis of its text content.
◮ Characterize the mechanical property of a steel plate on the We collected a set of input/output data (training set), we use
basis of its physical and chemical composition. statistical methods to build a prediction model (learner) to
predict the outcome for new unseen objects.

5/34 6/34

Supervised learning Regression and classification

INPUT UNKNOWN OUTPUT PREDICTION


DEPENDENCY ERROR

According to the type of output, two prediction tasks:


TRAINING
◮ Regression: quantitative outputs, e.g. real or integer numbers
DATASET ◮ Classification (or pattern recognition): qualitative or
categorical outputs which take values in a finite set of classes
(e.g. black, white and red) where there is no explicit ordering.
MODEL
PREDICTION Qualitative variables are also referred to as factors.

Supervised learning because of the presence of the outcome


variable which guides the learning process.
Collecting a set of training data is like having a teacher suggesting
the correct answer for each input.

7/34 8/34
Simple linear model Linear regression function

The simplest regression model is is the linear model The function f (x) = E [y|x] is also known as regression function.

y = β0 + β1 x + w
y

where
◮ x ∈ R is the regressor (or independent) variable,
◮ y ∈ R is the measured response (or dependent) variable, 1
0
0
1
E[y|x 1] 0
1
0
1 E[y|x 2]
◮ β0 is the intercept, β1 is the slope regression 0
1
function
◮ E [w] = 0 where w is the model error
This implies that

E [y|x] = f (x) = β0 + β1 x
Var [y|x] = Var [w] x1 x2 x

9/34 10/34

What does “linear” mean? Example of linear models

According to our definition of linear model, then


◮ y = β0 + β1 x is a linear model
◮ y = β0 + β1 x 2 is again a linear model. Simply by making the
In the following, linear model is each input/output relationships
transformation X = x 2 , the model can be put in in the linear
which is linear in the model parameters and not necessarily in the
form y = β0 + β1 X
dependent variables. This means that
◮ y = B0 x β1 can be studied as a linear model between
1. any value of the response variable y is described by a linear
Y = log(y ), X = log(x) and β0 = log(B0 ) thanks to the
combination of a series of parameters (regression slopes,
equality
intercept)
2. no parameter appears as an exponent or is multiplied or log(y ) = log(B0 ) + β1 log(x) ⇔ Y = β0 + β1 X
divided by another parameter.
◮ the relationship y = β0 + β1 β2x cannot be linearized.
◮ let z a categorical variable taking 4 possible values
{c1 , . . . , c4 }. It is possible to model a linear dependence with y
by creating four binary variables xj such that xj = 1 ⇔ z = cj .

11/34 12/34
Model estimation Least squares formulation

The method of least squares is designed to provide


◮ N i.id. pairs of observations DN = {hxi , yi i}, i = 1, . . . , N 1. estimations β̂0 and β̂1 of β0 and β1
◮ Data generated by the stochastic process 2. the fitted values of the response y

yi = β0 + β1 xi + wi , i = 1, . . . , N ŷi = β̂0 + β̂1 xi , i = 1, . . . , N


where so that the residual sum of squares
1. wi are iid realizations of the r.v. w having mean zero and
constant variance σw2 (homoscedasticity), N
X
2. xi are non random and observed with negligible error SSEemp (b0 , b1 ) = (yi − b0 − b1 xi )2
◮ Then the unknown parameters (also known as regression i =1

coefficients) β0 and β1 can be estimated by the least-squares is minimized.


method.
See Shiny script leastsquares.R.

13/34 14/34

Least-squares solution Empirical risk

From SSEemp we can define the term


Since the error function SSEemp (b0 , b1 ) is a quadratic function of
the coefficients b0 and b1 , the minimization of the error function
has a unique solution which can be found in closed form. [ emp = min SSEemp (b0 , b1 ) =
MISE
{b0 ,b1 } N
PN
This is called the least-squares solution SSEemp (β̂0 , β̂1 ) i =1 (yi − β̂0 − β̂1 xi )2
= =
N N
N
is called the empirical risk or training error.
X
{β̂0 , β̂1 } = arg min SSEemp (b0 , b1 ) = arg min (yi −b0 −b1 xi )2
{b0 ,b1 } {b0 ,b1 }
i =1
Note that the term SSEemp is a function of the training set and as
such it can be considered as a realization of a random variable.

15/34 16/34
Univariate least-squares solution Properties of the least-squares estimators
If the dependency underlying the data is P
linear then the estimators
It can be shown that the least-squares solution is
are unbiased. Since x is nonrandom and N i =1 (xi − x̄) = 0
Sxy
β̂1 = β̂0 = ȳ − β̂1 x̄   N
Sxx Sxy X (xi − x̄)E [yi ]
EDN [β̂ 1 ] = EDN =
Sxx Sxx
where PN PN
i =1

i =1 xi i =1 yi
N N
!
x̄ = , ȳ = 1 X X β1 Sxx
N N = [(xi − x̄)β0 ] + [(xi − x̄)β1 xi ] = = β1
Sxx Sxx
i =1 i =1
and
N
Also it can be shown that
X
Sxy = (xi − x̄)yi h i σ2
Var β̂ 1 = w
i =1 Sxx
N N
X X E [β̂ 0 ] = β0
Sxx = (xi − x̄)2 = (xi − x̄)xi
x̄ 2
 
i =1 i =1
h i
2 1
Var β̂ 0 = σw +
N Sxx

17/34 18/34

Properties of the least-squares estimators (II) Sample correlation coefficient


The usual estimator of the correlation
◮ It can be shown that the error mean-square Cov[x, y]
ρ(x, y) = p
PN
− ŷi )2 Var [x] Var [y]
2 i =1 (yi
σ̂w =
N −2 between two r.v. x and y is
is an unbiased estimator of σw2 under the (strong) assumption
Sxy
that the linear model is correct. ρ̂ = p
Sxx Syy
◮ The denominator is often referred to as the residual degrees of
freedom, also denoted by df. Note that since
Sxy
◮ The degree of freedom can be be seen as the number N of β̂1 =
Sxx
samples reduced by the numbers p of parameters estimated
(slope and intercept). the following relation holds
◮ The estimate of the variance σw 2 allows the estimation of the
β̂1 Sxy
variance of the intercept and slope, respectively. ρ̂2 =
Syy

19/34 20/34
Variance of the response Multiple linear dependency
◮ Since β̂0 = ȳ − β̂1 x̄ ◮ Consider a linear relation between an independent variable
^
y = β̂ 0 + β̂ 1 x = ȳ − β̂ 1 x̄ + β̂ 1 x = ȳ + β̂ 1 (x − x̄) x ∈ X ⊂ Rn and a dependent random variable y ∈ Y ⊂ R

is the estimator of the conditional expectation in x. y = β0 + β1 x·1 + β2 x·2 + · · · + βn x·n + w


◮ Under the linear hypothesis, we have for a specific x = x0
where w represents a random variable with mean zero and
y|x0 ] = E [β̂ 0 ] + E [β̂ 1 ]x0 = β0 + β1 x0 = E [y|x0 ]
E [^ constant variance σw2.

◮ Since ◮ In matrix notation the equation can be written as:


h i 2
σw
Var β̂ 1 = y = xT β + w
Sxx
and Cov[ȳ, β̂ 1 ] = 0, the variation of ^y at x0 if repeated data where x stands for the [p × 1] vector x = [1, x·1 , x·2 , . . . , x·n ]T ,
collection and consequent regressions were conducted is β = [β0 , . . . , βn ]T is the vector of parameters and p = n + 1 is
h i 
1 (x0 − x̄)2
 the total number of model parameters.
2
Var [^y|x0 ] = Var ȳ + β̂ 1 (x0 − x̄) = σw +
N Sxx ◮ NB: in the following x·j (and xj ) will denote the jth
PN (j = 1, . . . , n) variable of the vector x, while xi (i = 1, . . . , N)
i =1 xi
where x̄ = N . will denote the i th observation of the vector x.
21/34 22/34

The multiple linear regression model with n = 2 The multiple linear regression model
Consider N observations DN = {hxi , yi i : i = 1, . . . , N}, where
xi = (1, xi 1 , . . . , xin ), generated according to the previous model.
We suppose that the following multiple linear relation holds

Y = Xβ + W

where Y is the [N × 1] response vector, X is the [N × p] data


matrix, whose (j + 1)th column of X contains readings on the j th
regressor, β is the [p × 1] vector of parameters
x1T
 
y1 1 x11 x12 ... x1n β0 w1
       
y2 1 x21 x22 ... x2n  xT   β1  w2
2
       
       
Y = . X = . . . .  .
= β =
 .. W =
 ..
   
. . . . .

     .   
 .   . . . .   .   .   . 
yN 1 xN1 xN2 ... xNn xNT βn wN

where wi are assumed uncorrelated, with mean zero and constant


variance σw 2 (homogeneous variance). Then
2I .
Var [w1 , . . . , wN ] = σw N
(excerpt from "The Elements of Statistical Learning " book)
23/34 24/34
The least-squares solution Normal equations
The the least-squares estimator β̂ is Differentiating the residual sum of squares we obtain the
N
X   least-squares normal equations
β̂ = arg min (yi − xiT b)2 = arg min (Y − Xb)T (Y − Xb) (X T X )β̂ = X T Y
b b
i =1
As a result, assuming X is of full column rank
Given β̂ we obtain
  β̂ = (X T X )−1 X T Y
SSEemp = (Y − X β̂)T (Y − X β̂) = e T e
where the X T X matrix is a positive definite symmetric [p × p]
where SSEemp represents the residual sum of squares for linear matrix which plays an important role in multiple linear regression.
models and e is the [N × 1] vector of residuals. We define also the The predicted values for the training set are
empirical (or training) error quantity
Ŷ = X β̂ = X (X T X )−1 X T Y
[ emp = SSEemp
MISE where H = X (X T X )−1 X T is the Hat matrix.
N
The vector β̂ must satisfy In R notation:

[(Y − X β̂)T (Y − X β̂)] = 0 ⇔ −2X T (Y − X β̂) = 0 betahat=solve(t(X)%*%X) %*%t(X)%*%Y
∂ β̂
25/34 26/34

R function lm Analysis of the LS estimate

If the linear dependency assumption holds


summary(lm(Y~X))
◮ If E [w] = 0 then β̂ is an unbiased estimator of β.
Call:
lm(formula = Y ~ X) ◮ The residual mean square estimator
Residuals:
Min 1Q Median
-0.40141 -0.14760 -0.02202
3Q
0.03001
Max
0.43490 2 (Y − X β̂)T (Y − X β̂)
σ̂ =
Coefficients: N −p
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.09781 0.11748 9.345 6.26e-09
X 0.02196 0.01045 2.101 0.0479
2.
is an unbiased estimator of the error variance σw
(Intercept) ***
X *
◮ If the wi are uncorrelated and have common variance, the
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
variance-covariance matrix of β̂ is given by
Residual standard error: 0.2167 on 21 degrees of freedom
Multiple R-Squared: 0.1737, Adjusted R-squared: 0.1343
2
Var[β̂] = σw (X T X )−1
F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479

◮ R script Linear/bv_mult.R.

27/34 28/34
Variance of the prediction Generalization error of the linear model
◮ The prediction ^
y for a generic input value x = x0 is unbiased ◮ The linear predictor
ŷ = x T β̂
y|x0 ] = x0T β
E [^
has been estimated by using the training dataset
◮ The variance of the prediction ^
y for a generic input value DN = {hxi , yi i : i = 1, . . . , N}. Then β̂ is a r.v.
x = x0 is given by ◮ Now we want to use it to predict for a test input x the future
x0 (X T X )−1 x0
2 T
y|x0 ] = σw
Var[^ output y(x).
◮ The test output y(x) is independent of the training set DN .
◮ Assuming a normal error w, the 100(1 − α)% confidence ◮ Which precision can we expect from ŷ (xi ) = xiT β̂ on average?
y|x = x0 ] is given by
bound for the regression value E [^ ◮ A measure of error is the MSE
q
ŷ(x0 ) ± tα/2,N−p σ̂w x0T (X T X )−1 x0 MSE(x) = EDN ,y [(y(x) − x T β̂)2 ] = σw
2
+ EDN [(x T β − x T β̂)2 ]

where tα/2,N−p is the upper α/2 percent point of the where y(xi ) is independent of DN and the integrated version
t-distribution with N − p degrees of freedom and the quantity
Z
q MISE = MSE(x)p(x)dx
σ̂w x0T (X T X )−1 x0 is the standard error of prediction for X
multiple regression. ◮ How can we estimate this quantity?
29/34 30/34

The expected empirical error MISE error

◮ Is the empirical risk a good estimate of the MISE


generalization error? ◮ Let us compute now the expected prediction error of a linear
◮ The expectation of the residual sum of squares can be written model trained on DN when this is used to predict a set of test
as1 outputs distributed according to the same linear dependency
but independent of the training set.
"P #
h i N
(y − x T β̂)2 ◮ It can be shown2 that in case of linear dependency
\ emp = ED
EDN MISE i =1 i i
=
N
N "P
N
#
(y − T β̂)2
i =1 x i N +p 2
MISE = EDN ,y = σw
"P #
N T β̂)2
N −p (y
i =1 i − x i N −p 2 N N
= EDN = σw
N N −p N
Note that in the MISE formula the y distribution is independent of
◮ This is the expectation of the error made by a linear model DN and then of β̂.
trained on DN to predict the value of the output in DN .

1 2
derivation in the handbook derivation in the handbook
31/34 32/34
MISE error The PSE and the FPE

◮ 2 we have the Predicted Square Error


Given the estimate σ̂w
Then it follows that the empirical error returns a biased estimate of
(PSE) criterion
MISE, that is
[ emp + 2σ̂w
PSE = MISE 2
p/N
\ emp ] = N − p σw
EDN [MISE 2
6= MISE =
N +p 2
σw
N N 2
◮ Taking as estimate of σw
[ emp with
If we replace MISE
2 1
σ̂w = SSEemp
[ emp + 2 p σw
MISE 2 N −p
N
we have the Final Prediction Error (FPE)
we obtain an unbiased estimator of the quantity MISE (see R file
Linear/ee.R). 1 + p/N [
FPE = MISEemp
Nevertheless, this estimator requires an estimate of the noise 1 − p/N
variance.
◮ See the R script Linear/fpe.R

33/34 34/34

You might also like