0% found this document useful (0 votes)
44 views36 pages

Trimmed Sample Means For Uniform Mean Estimation and Regression

The document discusses trimmed sample means for robust uniform mean estimation and regression. It begins by introducing the problem of estimating expected values and discusses how sample means can be used to estimate means but are not robust to outliers. It then presents trimmed sample means as an alternative estimator that is more robust. The document discusses theoretical bounds on the performance of trimmed sample means and how they can achieve minimax optimal error rates. It also discusses how trimmed means can be applied to problems beyond mean estimation, such as regression and covariance estimation.

Uploaded by

rimfo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views36 pages

Trimmed Sample Means For Uniform Mean Estimation and Regression

The document discusses trimmed sample means for robust uniform mean estimation and regression. It begins by introducing the problem of estimating expected values and discusses how sample means can be used to estimate means but are not robust to outliers. It then presents trimmed sample means as an alternative estimator that is more robust. The document discusses theoretical bounds on the performance of trimmed sample means and how they can achieve minimax optimal error rates. It also discusses how trimmed means can be applied to problems beyond mean estimation, such as regression and covariance estimation.

Uploaded by

rimfo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Trimmed sample

means for robust


uniform mean
estimation and
regression

Roberto Imbuzeiro Oliveira


Columbia University
Oct 6th 2023
Estimating expected values
Quantites of interest in Statistics often are expected values or
can be defined in terms of them.

Example of the latter: M-estimators.

𝜃!"#$ = argmin 𝔼(∼* ℓ 𝜃, 𝑍 𝜃∈Θ .


%∈'

How well can one estimate expected values?


1d data and sample means
Scalar-valued data
𝑋+ , 𝑋, , … , 𝑋- : i.i.d. with mean 𝜇 ∈ ℝ and variance 𝜎 , ∈ ℝ. .
-
1
𝑋- : = 9 𝑋/
𝑛
/0+
Mean squared error
,
𝜎,
𝔼 𝑋- − 𝜇 =
𝑛
under no further assumptions.
High-confidence finite sample bounds

ℙ 𝑋- − 𝜇 ≤ 𝑟(𝛼, 𝑛) ≥ 1 − 𝛼

Asymptotic (CLT) Worst case - Chebyshev

2
2 log( ) 𝜎
𝑟(𝛼, 𝑛) ∼ 𝜎 𝛼 𝑟 𝛼, 𝑛 ≈
𝑛 𝛼𝑛
Catoni (AnIHP’12) + Lee/Valiant (STOC’21)

ℙ 𝜇G- − 𝜇 ≤ 𝑟(𝛼, 𝑛) ≥ 1 − 𝛼 with best possible estimator.

Asymptotic (CLT) Worst distr.- best estimator

2 2
2 log( ) 2 log( )
𝑟 𝛼, 𝑛 ∼ 𝜎 𝛼 𝑟 𝛼, 𝑛 ∼ 𝜎 𝛼
𝑛 𝑛
The main pont

When it comes to ,inite-sample bounds for deviations,


the sample mean is exponentially worse in 𝛼
than the best possible estimator!

See also experiments by Catoni.


Adversarial sample contamination
Uncontaminated sample
𝑋+ , … , 𝑋- ∼ 𝑃 i.i.d.

Contaminated sample
𝑋+2 , … , 𝑋-2 such that
#{𝑖 ≤ 𝑛 ∶ 𝑋/2 ≠ 𝑋/ } ≤ 𝜀𝑛.

Sample mean cannot tolerate contamination for any 𝜀 > 0.


General problem

Given a statistical task,


find the estimator with smallest possible error
in terms of the sample size 𝑛, the desired confidence 1 − 𝛼
and the contamination level 𝜀.

“Robustness” against heavy tails and contamination.


General problem

Given a statistical task,


find the estimator with smallest possible error
in terms of the desired confidence 1 − 𝛼
and the contamination level 𝜀.

“Robustness” against heavy tails and contamination.


Lots of other work with this perspective
Study of 1d mean est.: Devroye, Lerasle, Lugosi, O. (AnnStat’16)

Higher dimensions/covariances: Minsker (Bernoulli’15, AnnStat’18); Lugosi &


Mendelson (Ann Stat’19&21/FoCM’19/PTRF’19); Hopkins (AnnStat’20);
Depersin & Lecué (AnnStat’21,PTRF’21); Diakonikolas, Kamath, Kane, Li,
Moitra, Stewart (FOCS’16); Cherapanamjeri, Flammarion, Bartlett (COLT’19);
Lei, Lu, Venkat, Zhang (COLT’20); Hopkins, Li, Zhang (NeurIPS’20); Abdalla &
Zhivotovskiy (arXiv’22); Rico & O. (arXiv’22)...

Regression/M-estimation: Lerasle & O. (arXiv’11); Brownlees, Lugosi & Joly


(Ann Stat’15); Diakonikolas, Kamath, Kane, Li, Steinhardt, Stewart (ICML’19);
Jambulapati, Li, Schramm, Tiam (NeurIPS’21); book by Diakonikolas and Kane
(Cambridge; forthcoming)...
Trimmed means
Trimmed mean
Trimmed mean for scalar data
𝑋+ , 𝑋, , … , 𝑋- : i.i.d. random variables
𝑋(+) ≤ 𝑋 , ≤ ⋯ ≤ 𝑋(-) : order statistics

-76
1
𝑋-,6 : = 9 𝑋(/)
𝑛 − 2𝑘
/06.+
How to compute a trimmed mean

𝑛 = 8 and 𝑘 = 2

𝑋! 𝑋" 𝑋# 𝑋$ 𝑋% 𝑋( 𝑋' 𝑋&

𝑋(") 𝑋(%) 𝑋(#) 𝑋(&) 𝑋(() 𝑋(') 𝑋(!) 𝑋($)

Trimmed mean averages these points


Minimax optimal bounds
Theorem (Z. F. Rico’s thesis, 2022 – assuming 𝜀 < 1/4)
!
Set 𝜈8 ≔ (𝔼9∼* 𝑋 − 𝜇 8 ) , 𝑘 = 𝜀𝑛 + ⌈ 𝜀𝑛 ∨ 8 ln(2/𝛼 7+ )⌉ and
"

#
! "1
% ,-. $
𝑟 𝛼, 𝑛, 𝜀 ≔ 𝐶 inf 𝜈+ "
+𝐶 inf 𝜈2 𝜀 "1"/2 .
+:;:, 0 23"

Then ℙ 𝑋-,6 − 𝜇 ≤ 𝑟 𝛼, 𝑛, 𝜀 ≥ 1 − 𝛼.
Better than your MoM
Median of Means for 1d data
Break data into 𝑘 blocks, take averages of blocks
& then take the median of these averages.

2- +
Requires 𝑘 ≥ ∨ log( ) for robustness + high prob.
, <
#
! "1
,-. $
"
𝑟=>= 𝛼, 𝑛, 𝜀 ≔ 𝐶 inf 𝜈+ 𝜀 + .
+:;:, 0
Can trimmed means lead to
improvements more generally?
Joint with Lucas Resende – IMPA
arXiv:2302.06710
What we did
Trimmed means give nearly optimal results for 2 problems:
uniform mean estimation + regression with mean sq. error.

Experiments and heuristics for linear regression


that improve previous results in a number of settings.

Zoraida’s talk: covariance estimation via trimmed means.


Uniform mean estimation (Minsker’18)
Given
I.i.d. (possibly corrupted) sample from 𝑃 over some set 𝕏
Family ℱ of functions from 𝕏 to ℝ.

Goal
Estimate 𝑃𝑓 = 𝔼9∼ * 𝑓(𝑋) for each 𝑓 ∈ ℱ with small worst-case
error:
}
Loss( 𝑃𝑓 ?∈ℱ , 𝐸𝑓 ): = sup 𝑃𝑓 − } .
𝐸𝑓
?∈ℱ ?∈ℱ
Applications
M-estimation/regression
We’ll see an example soon.

Vector mean estimation under general norms


Lugosi & Mendelson PTRF2019, Depersin & Lecué PTRF2021
If 𝑋 ∼ 𝑃 takes values in ℝA , estimating the mean
𝜇: = 𝔼9∼* 𝑋 with error measured by a norm || ⋅ ||
is equivalent to estimating 𝑃𝑓 over ℱ ≔ dual unit ball of || ⋅ || .
Towards uniform mean estimation
Setup
𝕏, 𝒳, 𝑃 a probability space
2
𝑋+:- ≔ 𝑋+ , … , 𝑋- ∼ 𝑃; 𝑋+:- with #{𝑖 ∈ 𝑛 ∶ 𝑋/2 ≠ 𝑋/ } ≤ 𝜀𝑛.
ℱ ≔ a family of 𝑃-integrable functions 𝑓: 𝕏 → ℝ.

-76
1
𝑇„-,6
2
𝑓 ≔ 9 𝑓(𝑋 2/,? )
𝑛 − 2𝑘
/06.+

Here 𝑓 𝑋 $!,# ≤ 𝑓 𝑋 $%,# ≤ ⋯ ≤ 𝑓 𝑋 $&,# .


Relevant parameters of function class
Global “complexity” parameter
"
Emp5 ℱ ≔ 𝔼6#:& ∼ 8 99: sup ∑09>"(𝑓 𝑋9 − 𝑃𝑓) .
;∈ℱ 0

Minimax error for worst-case function #


! "1
% ,-. $
𝑟ℱ 𝛼, 𝑛, 𝜀 ≔ 𝐶 inf 𝜈+ (ℱ) "
+𝐶 inf 𝜈2 (ℱ)𝜀 "1"/2
"?+?% 0 23"
#
where 𝜈+ ℱ ≔ sup{(𝔼6∼8 𝑓 𝑋 − 𝑃𝑓 2 ) : 𝑓 ∈ ℱ} . '
Uniform performance of the trimmed mean
Theorem (O. & Resende 2023)
+
With a choice of 𝑘 ≈ 𝜀𝑛 + log <
,

ℙ sup{ 𝑇„-,6
2
𝑓 − 𝑃𝑓 ∶ 𝑓 ∈ ℱ} ≤ 𝑅 𝛼, 𝑛, 𝜀 ≥ 1 − 𝛼,
where
𝑅 𝛼, 𝑛, 𝜀 ≔ 𝐶EmpC ℱ + 𝐶𝑟ℱ 𝛼, 𝑛, 𝜀 .

Both terms in R are needed in general.


Main proof ideas
Counting Lemma (simplified)
ℱ a family of functions 𝑓: 𝕏 → ℝ, 𝑀D , 𝑀+ > 0 such that:
𝑡
𝑠up?∈ℱ 𝑃 |𝑓 𝑋 − 𝑃𝑓| > 𝑀D ≤
100𝑛
-E!
and 𝔼 sup?∈ℱ | 𝑃- − 𝑃 𝑓| ≤ .
F
Then with prob ≥ 1 − 𝑒 7F ,
∀𝑓 ∈ ℱ ∶ #{𝑖 ≤ 𝑛 ∶ 𝑓 𝑋/ − 𝑃𝑓 > 𝑀D ∨ 𝑀+ } ≤ 𝑡.

Generalization of Lugosi and Mendelson (AnnStat’21)


Main proof ideas
Bounding Lemma (simplified)
ℱ a family of functions 𝑓: 𝕏 → ℝ, 𝑀 > 0 such that:

∀𝑓 ∈ ℱ ∶ #{𝑖 ≤ 𝑛 ∶ 𝑓 𝑋/ − 𝑃𝑓 > 𝑀} ≤ 𝑘 − 𝜀𝑛

Let 𝜏E 𝑓 − 𝑃𝑓 = max{−𝑀, min 𝑓 − 𝑃𝑓, 𝑀 }. Then ∀𝑓 ∈ ℱ ∶

𝐶𝑀𝑘
„2
|𝑇 𝑓 − 𝑃𝑓 − 𝑃- 𝜏E 𝑓 − 𝑃𝑓 | ≤ .
-,6 𝑛
Improved vector mean estimation
Theorem (O. & Resende 2023)
If 𝕏 = ℝA , ∃𝜇G
-,6 estimator of the mean 𝜇 st. with prob. ≥ 1 − 𝛼:

||𝜇G
-,6 − 𝜇|| ≤ 𝐶𝔼9!:$ ∼//A * || 𝑋- − 𝜇||
#
! !-
% )*+ $
+ 𝐶 inf 𝜈( "
+𝐶 inf 𝜈. 𝜀 !-!/. .
!'('% & ./!

"
2 2
𝜈+ : = sup{(𝔼6∼8 ⟨𝑋 − 𝜇, 𝑓⟩ ) ∶ 𝑓 ∈ ℝ: , ||𝑓||∗ ≤ 1}.
Regression with squared loss
Given
I.i.d. (corrupted) sample of pairs 𝑋, 𝑌 ∈ 𝕏×ℝ with law 𝑃
Family ℱ of functions from 𝕏 to ℝ.
Goal
Estimate the best fit of 𝑌 from 𝑓(𝑋)
,
𝑓!"#$ ≔ arg min 𝔼 9,G ∼* 𝑌 − 𝑓 𝑋 vs 𝑓“$H! ∈ ℱ
?∈ℱ
,
Loss(𝑓!"#$ , 𝑓“$H! ) = 𝔼 9,G ∼* 𝑓“$H! (𝑋) − 𝑓!"#$ 𝑋 .
Results on regression
Setup
𝕏×ℝ, 𝒳×ℬ(ℝ), 𝑃 a probability space
𝑍!:& ≔ 𝑍! , … , 𝑍& ∼ 𝑃 with each 𝑍2 = 𝑋2 , 𝑌2 ∈ 𝕏×ℝ.
$
𝑍!:& satisfying #{𝑖 ∈ 𝑛 ∶ 𝑍2$ ≠ 𝑍2 } ≤ 𝜀𝑛.
%
ℱ ≔ some functions 𝑓: 𝕏 → ℝ; set ℓ# 𝑥, 𝑦 ≔ 𝑦 − 𝑓 𝑥 .

&-3
1
𝑇Y&,3
$
ℓ# − ℓ4 ≔ ^ ℓ# (𝑋 2,ℓ% -ℓ& , 𝑌(2,ℓ%-ℓ&) )
𝑛 − 2𝑘
2536!

Here (ℓ# − ℓ4 ) 𝑋 !,ℓ% -ℓ& ≤ ⋯ ≤ (ℓ# −ℓ4 ) 𝑋 &,ℓ% -ℓ& .


Regression with squared loss
𝑓!"#$ ≔ arg min 𝑃ℓ? = arg min(sup 𝑃(ℓ? −ℓI : 𝑔 ∈ ℱ} )
?∈ℱ ?∈ℱ

𝑓“$H! ≔ arg min(sup 𝑇„-,6


2
(ℓ? −ℓI : 𝑔 ∈ ℱ} )
?∈ℱ

Theorem (O. & Resende 2023)


If ℱ ⊂ 𝐿, (𝑃) is closed and convex, and a small-ball condition is
satisfied,
then ℙ ||𝑓“$H! − 𝑓FJKL ||M% * ≤ 𝑟?8 𝑛, 𝛼, 𝜀 ≥ 1 − 𝛼, where...
Localized bound – informal!
For each 𝑟 > 0,

ℱ; 𝑟 ≔ {𝑓 − 𝑓FJKL : 𝑓 ∈ ℱ , ||𝑓 − 𝑓FJKL ||M% (*) = 𝑟},


ℱN 𝑟 ≔ {(𝑌 − 𝑓FJKL ) (𝑓 − 𝑓FJKL ): 𝑓 ∈ ℱ , ||𝑓 − 𝑓FJKL ||M% * ≤ 𝑟}.

𝑟?8 = 𝑟?8 𝑛, 𝛼, 𝜀 basically solves an equation of the form


,
EmpC ℱ; (𝑟?8 ), 𝑃 + EmpC ℱN (𝑟?8 ), 𝑃 + noise ≤ 𝑐 𝑟?8
(a la Mendelson JACM 2014, Lerasle & Lecué Ann Stat 2020)
Linear regression with random design
Model
Covariates in ℝA , linear model with mean-0 noise.

𝑌/ = 𝛽FJKL , 𝑋/ + 𝜉/
Assumptions
2nd moment + small ball on 𝑋
𝑝-th moment bound on 𝜉/ (1 < 𝑝 ≤ 2)

! ,7,/8
A.Q>R .2
||𝛽“LOF − 𝛽FJKL ||,P ≤ 𝐶8 &
.
-
Linear regression with random design
Heuristic - alternating minimization/maximization
Performs quite well in experiments.

}D , 𝛽
Set initial 𝛽 }+ ∈ ℝA arbitrarily.
Repeat until convergence.
Trim ℓUT' 𝑋/2 , 𝑌/2 − ℓUT! 𝑋/2 , 𝑌/2 .
Choose one of 𝛽 }D or 𝛽
}+ to update.
}D or 𝛽
Perform OLS on trimmed sample to obtain new 𝛽 }+ .
Experiments vs. Median-of-means
TM MOM OLS

102
L2

?∞
∞Ø̂n ° Ø ∞

101
∞ "

100

10°1
0

02

04

06

08

15

4
0.

0.

0.

0.

0.
0.

0.

0.

0.

0.
"
Linear regression with normal errors and contamination.
Experiments vs. Median-of-means
TM MOM OLS

102
L2

?∞
∞Ø̂n ° Ø ∞

101
∞ "

100

10°1
0

02

04

06

08

15

4
0.

0.

0.

0.

0.
0.

0.

0.

0.

0.
"
Linear regression with student(1) errors and contamination.
Conclusion
Theory says trimmed means give the best-known estimators
for the problems we consider. Dependence on 𝜀 is optimal.
Gaussian approx. of trimmed process: upcoming by L. Resende.

Seems to perform very well in practice,


but there is no theory to back this up.
Work in progress with Philip Thompson, Zoraida Fernández-
Rico, Damien Vilcocq...
Thank you!

You might also like