MIT18 650F16 Regression
MIT18 650F16 Regression
MIT18 650F16 Regression
Chapter 7: Regression
1/43
Heuristics of the linear regression (1)
Consider a cloud of i.i.d. random points (Xi , Yi ), i = 1, . . . , n :
2/43
Heuristics of the linear regression (2)
(unknown) a, b 2 IR.
I ˆ ˆ
3/43
Heuristics of the linear regression (3)
Examples:
Di ⇡ a + bpi , i = 1, . . . , n.
4/43
Linear regression of a r.v. Y on a r.v. X (1)
5/43
Linear regression of a r.v. Y on a r.v. X (2)
Y = a + bX + ",
6/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.
7/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.
8/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.
9/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , . . . , Xn ) with same
distribution as (X, Y ) is available.
10/43
Linear regression of a r.v. Y on a r.v. X (3)
A sample of n i.i.d. random pairs (X1 , Y1 ), . . . , (Xn , Yn ) with
same distribution as (X, Y ) is available.
11/43
Linear regression of a r.v. Y on a r.v. X (4)
Definition
The least squared error (LSE) estimator of (a, b) is the minimizer
of the sum of squared errors:
n
X
(Yi − a − bXi )2 .
i=1
â = Y¯ − ˆbX.
¯
12/43
Linear regression of a r.v. Y on a r.v. X (5)
13/43
Multivariate case (1)
Yi = Xi β + "i , i = 1, . . . , n.
Dependent variable: Yi .
Definition
The least squared error (LSE) estimator of β is the minimizer of
the sum of square errors:
n
X
β̂ = argmin (Yi − Xi t)2
t2IRp i=1
14/43
Multivariate case (2)
Y = Xβ + ".
β̂ = argmin kY − Xtk22 .
t2IRp
15/43
Multivariate case (3)
β̂ = (X X)−1 X Y.
Xβ̂ = P Y,
16/43
Linear regression with deterministic design and Gaussian
noise (1)
Assumptions:
" ⇠ Nn (0, σ 2 In ),
17/43
Linear regression with deterministic design and Gaussian
noise (2)
⇣ ⌘
LSE = MLE: ˆ ⇠ Np β, σ (X X)
β 2 −1
.
h i ⇣ ⌘
−1
Quadratic risk of β̂: IE kβ̂ − βk22 2
= σ tr (X X) .
h i
Prediction error: IE kY − Xβ̂k22 = σ 2 (n − p).
1
Unbiased estimator of σ 2 : σ̂ 2 = kY − Xβ̂k22 .
n−p
Theorem
σ̂ 2
(n − p) 2 ⇠ χ2n−p .
σ
β̂ ?? σ̂ 2 .
18/43
Significance tests (1)
Test whether the j-th explanatory variable is significant in the
linear regression (1 j p).
H0 : βj = 0 v.s. H1 : βj = 0.
β̂j − βj
p ⇠ tn−p .
2
σ̂ γj
β̂j
Let Tn(j) = p .
σ̂ 2 γj
H0 : βj = 0, 8j 2 S v.s. H1 : 9j 2 S, βj = 0, where
S ✓ {1, . . . , p}.
(j)
Bonferroni’s test: δ↵B = max δ↵/k , where k = |S|.
j2S
20/43
More tests (1)
H0 : Gβ = λ v.s. H1 : Gβ = λ.
If H0 is true, then:
and
−1
−2 −1
σ (Gβ̂ − λ) G(X X) G (Gβ − λ) ⇠ χ2k .
21/43
More tests (2)
� �−1
1 (Gβ̂ − λ) G(X X)−1 G (Gβ − λ)
Let Sn = .
σ̂ 2 k
Definition
The Fisher distribution with p and q degrees of freedom, denoted
U/p
by Fp,q , is the distribution of , where:
V /q
U ⇠ χ2p , V ⇠ χ2q ,
U ?? V .
22/43
Concluding remarks
23/43
Linear regression and lack of identifiability (1)
Consider the following model:
Y = Xβ + ",
with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown;
3. " ⇠ Nn (0, σ 2 In ).
kŶ − Yk22
2
⇠ χ2n k,
σ
25/43
Linear regression and lack of identifiability (3)
In particular:
26/43
Linear regression in high dimension (1)
Consider again the following model:
Y = Xβ + ",
with:
1. Y 2 IRn (dependent variables), X 2 IRn⇥p (deterministic
design) ;
2. β 2 IRp , unknown: to be estimated;
3. " ⇠ Nn (0, σ 2 In ).
For each i, Xi 2 IRp is the vector of covariates of the i-th
individual.
β̂ S 2 argmin kY − XS tk2 ,
t2IRS
28/43
Linear regression in high dimension (3)
29/43
Linear regression in high dimension (4)
Each of these criteria is equivalent to finding β 2 IRp that
minimizes:
kY − Xbk22 + λkbk0 ,
where kbk0 is the number of nonzero coefficients of b.
Lasso estimator:
p
X p
X � �
replacekbk0 = 1I{bj = 0} with kbk1 = �bj �
j=1 j=1
How to choose λ ?
31/43
Linear regression in high dimension (6)
32/43
Nonparametric regression (1)
33/43
Nonparametric regression (2)
LSE, MLE, all the previous theory for the linear case could be
adapted.
34/43
Nonparametric regression (3)
Yi ⇡ f (x) + "i .
35/43
Nonparametric regression (4)
36/43
Nonparametric regression (5)
0.5
●
0.4
●
● ● ●
●●
0.3
● ●
●
● ●
● ● ●
●
● ●
●
●
●
● ●
0.2
●
●
●
●
● ●●
●
● ●
●
●
Y
● ●
0.1
●
●
●
●
●
●
● ●
0.0
●
●
●
−0.2
37/43
Nonparametric regression (6)
0.5
l
0.4
l
l l l
ll ll
ll ll
0.3
ll
l l
l l ll ll
l ll
l
l
l
l l
0.2
l
l
ll
l
l
l
ll ll
l ll l
l
Y
l l
0.1
l
l
l
l
l
l
l l
0.0
l
l
x 0.6 l
h 0.1
^
−0.2
l f x 0.27
38/43
Nonparametric regression (7)
How to choose h ?
39/43
Nonparametric regression (8)
Example:
n = 100, f (x) = x(1 − x),
h = .005.
0.4
^
l
f nh
l
l
l
l l
l l
l
0.3
l l
l
l
l
l
l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2
l
l l
l
l l l
l l l
l l
l l
l l
l
l l l
Y
l l
l l l
l
l l l
l l l
l
0.1
l l l l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
0.0
l l
l
l
l l l
X
40/43
Nonparametric regression (9)
Example:
n = 100, f (x) = x(1 − x),
h = 1.
0.4
^
l
f nh
l
l
l
l l
l l
l
0.3
l l
l
l
l
l
l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2
l
l l
l
l l l
l l l
l l
l l
l l
l
l l l
Y
l l
l l l
l
l l l
l l l
l
0.1
l l l l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
0.0
l l
l
l
l l l
X
41/43
Nonparametric regression (10)
Example:
n = 100, f (x) = x(1 − x),
h = .2.
0.4
^
l
f nh
l
l
l
l l
l l
l
0.3
l l
l
l
l
l
l
l l
l
l
l
l
l
l
l l
l
l
ll
l
l l l
l l
l
0.2
l
l l
l
l l l
l l l
l l
l l
l l
l
l l l
Y
l l
l l l
l
l l l
l l l
l
0.1
l l l l
l
l
l
l
l
l
l
l
l
l l
l
l
l l
0.0
l l
l
l
l l l
X
42/43
Nonparametric regression (11)
Choice of h ?
43/43
0,72SHQ&RXUVH:DUH
KWWSVRFZPLWHGX
6WDWLVWLFVIRU$SSOLFDWLRQV
)DOO
)RULQIRUPDWLRQDERXWFLWLQJWKHVHPDWHULDOVRURXU7HUPVRI8VHYLVLWKWWSVRFZPLWHGXWHUPV