01 Intro Densities
01 Intro Densities
Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
April 18, 2024
Pattern Recognition Recap and Unsupervised Learning
→x f (x) y
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 1
Further Aspects of Interest: Parameters and Hyperparameters
• Less parameters make the model more robust, more parameters make the
model more flexible
• To continue the example, consider linear regression on a basis expansion of
a scalar unknown x, e.g., fitting a d-dimensional polynomial to the vector
(1, x , x 2 , . . . , x d ): larger d enables more complex polynomials
• The dimension d is a hyperparameter, i.e., a parameter that somehow
parameterizes the choice of parameters
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 2
Further Aspects of Interest: Local Operators and High
Dimensional Spaces
• Thinking about model flexibility: more “local” models are more flexible, but
require more parameters and are less robust
• How can we find a good trade-off? This is the model selection problem
• Another issue: all local models perform poorly in higher dimensional spaces
• A probably surprising consequence is that high-dimensional methods must
be non-local along some direction
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 3
A Study of Distributions
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 4
Recap on Probability Vocabulary
• Please browse the book by Bishop, Sec. 1.2.3, to refresh your mind if
necessary!
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 5
Sampling from a PDF
• The key idea is to use the cumulative density function (CDF) P (z ) of p(X ),
Zz
P (z ) = p(X )dX (2)
−∞
1 1
p(X ) P (Z ) P (Z )
X z z
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 7
Sampling Algorithm
z∗ = argmin P (z ) ≥ u (3)
z
• This method is not used in high-dim. spaces. Can you find the reason?
• We will later look at more advanced sampling strategies
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 8
Lecture Pattern Analysis
Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
April 18, 2024
Introduction
2
See Bishop Sec. 2.5.1
N D /2 ∥x − xi ∥22
1 X 1
p(x) = · exp − , (9)
N 2π h2 2
2h
i =1
k (u) ≥0 (10)
Z
k (u) du =1 (11)
K # points in R
p(x) = = (12)
N·V total # of points · Volume of R
• Note that both the Parzen window estimator and the k-NN estimator are
“non-parametric”, but they are not free of parameters
• The kernel scaling h and the number of neighbors k are hyper-parameters,
i.e., some form of prior knowledge to guide the model creation
• The model parameters are the samples themselves. Both estimators need to
store all samples, which is why they are also called memory methods
3
See Bishop Sec. 5.2.2
• When training data is limited, it can be better exploited with cross validation
• In this case, the data is subdivided into k folds (partitions). Do k training/eval.
runs (using each fold once for validation and the rest for training), and select
that h∗ with ML across all folds
• The choice of k is a hyper-hyperparameter that affects the quality of the
predicted error (check Hastie/Tibshirani/Friedman Chap. 7 if curious)
C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 7
Cross Validation (CV) for Unsupervised Methods?
Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
April 18, 2024
Introduction
1
This may sound as if there were a unique minimum, maybe even of a convex function — in practice, there is not that one single best solution; so read this as a
somewhat hypothetical statement
2
See PR lecture or Hastie/Tibshirani/Friedman Sec. 7-7.3 if more details are desired
• Bias is the square of the average deviation of an estimator from the ground
truth
• Variance denotes is the variance of the estimates, i.e., the expected squared
deviation from the estimated mean3
• Informal interpretation:
• High bias indicates model undercomplexity: we obtain a poor fit to the data
• High variance indicates model overcomplexity: the fit also models not just
the structure of the data, but also its noise
• Higher model complexity (= more model parameters) tends to lower bias and
higher variance
• We will usually not be able to get bias and variance simultaneously to 0
• Regularization increases bias and lowers variance
3
See Hastie/Tibshirani/Friedman Sec. 7.3 Eqn. (7.9) for a detailed derivation