0% found this document useful (0 votes)
14 views23 pages

01 Intro Densities

The document discusses various aspects of machine learning, focusing on pattern recognition, unsupervised learning, and density estimation. It covers the importance of feature representation, model parameters, and the challenges of high-dimensional spaces, as well as techniques for sampling from probability distribution functions. Additionally, it introduces non-parametric density estimation methods, including histograms and kernel-based approaches, while emphasizing the significance of hyperparameter optimization and the bias-variance tradeoff in model selection.

Uploaded by

sanxchep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views23 pages

01 Intro Densities

The document discusses various aspects of machine learning, focusing on pattern recognition, unsupervised learning, and density estimation. It covers the importance of feature representation, model parameters, and the challenges of high-dimensional spaces, as well as techniques for sampling from probability distribution functions. Additionally, it introduces non-parametric density estimation methods, including histograms and kernel-based approaches, while emphasizing the significance of hyperparameter optimization and the bias-variance tradeoff in model selection.

Uploaded by

sanxchep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Lecture Pattern Analysis

Part 01: Introduction and First Sampling

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
April 18, 2024
Pattern Recognition Recap and Unsupervised Learning

• Remember the steps of the classical pattern recognition pipeline:

Prepro- Feature Classi-


(Data) Sampling (Class)
cessing Extraction fication

→x f (x) y

• Fundamental ML assumption: good feature representations map similar


objects to similar features
• Classifier training is almost always supervised,
i.e. a training sample is a tupel (xi , yi ) (cf. lecture “Pattern Recognition”)
• Unsupervised ML works without labels, i.e., it only operates on inputs (xi )
• Unsup. ML can be seen as representation or summary of a distribution
• So, “classification versus representation” could be a jingle to further distinguish
PR from PA (cf. our discussion in the joint meeting)

C. Riess | Part 01: Introduction and First Sampling April 18, 2024 1
Further Aspects of Interest: Parameters and Hyperparameters

• Every machine learning model has parameters


• For example, linear regression predicts with d parameters βi a d-dimensional
hyperplane that predicts y for a d-dimensional input x̃ = (1, x1 , . . . , xd −1 )⊤ ,
d
X

y = β x̃ = βi · x̃i (1)
i =0

• Less parameters make the model more robust, more parameters make the
model more flexible
• To continue the example, consider linear regression on a basis expansion of
a scalar unknown x, e.g., fitting a d-dimensional polynomial to the vector
(1, x , x 2 , . . . , x d ): larger d enables more complex polynomials
• The dimension d is a hyperparameter, i.e., a parameter that somehow
parameterizes the choice of parameters
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 2
Further Aspects of Interest: Local Operators and High
Dimensional Spaces

• Thinking about model flexibility: more “local” models are more flexible, but
require more parameters and are less robust
• How can we find a good trade-off? This is the model selection problem

• Another issue: all local models perform poorly in higher dimensional spaces
• A probably surprising consequence is that high-dimensional methods must
be non-local along some direction

• Also summarization methods (clustering) performs poorly in higher


dimensional spaces

• All these points motivate to also look into dimensionality reduction

C. Riess | Part 01: Introduction and First Sampling April 18, 2024 3
A Study of Distributions

• In PA, we look at data in feature spaces


• To understand and manipulate these data points, they are mathematically
commonly represented as probability distribution functions (PDFs)
• Additionally, inference allows to draw conclusions from distributions

• Common operations on distributions:


• Fitting a distribution model to the data (parametric or non-parametric)
represents the data as a distribution
• Sampling from a distribution creates new data points that follow the
distribution (i.e., they are plausible)
• Factorizing a distribution is a key technique for reducing the complexity

C. Riess | Part 01: Introduction and First Sampling April 18, 2024 4
Recap on Probability Vocabulary

• Let X , Y denote two random variables


• Important vocabulary and equations are:
Joint distribution p (X , Y )

Conditional distribution of X given Y p(X |Y )

Sum rule / marginalization over Y p (X ) p(X , Y )


P
=
Y

Product rule p (X , Y ) = p(Y |X ) · p(X )


p(X |Y )·p(Y )
Bayes rule p(Y |X ) = p (X )
likelihood·prior
Bayes rule in the language of ML posterior = evidence

• Please browse the book by Bishop, Sec. 1.2.3, to refresh your mind if
necessary!

C. Riess | Part 01: Introduction and First Sampling April 18, 2024 5
Sampling from a PDF

• Oftentimes, it is necessary to draw samples from a PDF


• Example:
• Logistic Regression fits a single regression curve to the data (cf. PR)
• Bayesian Logistic Regression fits a distribution of curves

The distribution is narrow at observations (crosses), and wider otherwise


• Sample curves from the distribution to obtain its spread (“uncertainty”)

• Special PDFs like Gaussians have closed-form solutions for sampling


• We look now at a sampling method that works on arbitrary PDFs
C. Riess | Part 01: Introduction and First Sampling April 18, 2024 6
Idea of the Sampling Algorithm

• The key idea is to use the cumulative density function (CDF) P (z ) of p(X ),

Zz
P (z ) = p(X )dX (2)
−∞

• A sample uniformly drawn from the CDF y -axis intersects P (z ) at location z


• This z position is our random draw from p(x ):

1 1

p(X ) P (Z ) P (Z )

X z z

C. Riess | Part 01: Introduction and First Sampling April 18, 2024 7
Sampling Algorithm

• Discretize the domain of the PDF p(X )


• Linearize p(X ) if it is multi-dimensional
• Calculate the cumulative density function P (z ) of p(X ),
the range of that CDF must be between 0 to 1
• Draw a uniformly distributed number u between 0 and 1
• The sample from the PDF is

z∗ = argmin P (z ) ≥ u (3)
z

• This method is not used in high-dim. spaces. Can you find the reason?
• We will later look at more advanced sampling strategies

C. Riess | Part 01: Introduction and First Sampling April 18, 2024 8
Lecture Pattern Analysis

Part 02: Non-Parametric Density Estimation

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
April 18, 2024
Introduction

• Density Estimation = create a PDF from a set of samples


• The lecture Pattern Recognition introduces parametric density estimation:
• There, a parametric model (e.g., a Gaussian) is fitted to the data
• Maximum Likelihood (ML) estimator:

θ ∗ = argmax p(x1 , . . . , xN |θ) (1)


θ

• Maximum a Posteriori (MAP) estimator:

Bayes p(x1 , . . . , xN |θ) · p(θ)


θ ∗ = argmax p(θ|x1 , . . . , xN ) = (2)
θ p(x1 , . . . , xN )

• Browse the PR slides if you like to know more

• Parametric density estimators require a good function representation


• Non-parametric density estimators can operate on arbitrary distributions
C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 1
Non-Parametric Density Estimation: Histograms

• Non-parametric estimators do not use functions with a limited set of


parameters
• A simple non-parametric baseline is to create a histogram of samples1
• The number of bins is important to obtain a good fit

• Pro: Good for a quick visualization


• Pro: “Cheap” for many samples in low-dimensional space
• Con: Discontinuities at bin boundaries
• Con: Scales poorly to high dimensions (cf. curse of dimensionality later)
1
See introduction of Bishop Sec. 2.5

C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 2


Improving on the Histogram Approach

• A kernel-based method and a nearest-neighbor method are slightly better


• Both variants share their mathematical framework:
• Let p(x) be a PDF in D-dim. space, and R a small region around x
R
→ The probability mass in R is p = p(x) dx
R
• Assumption 1: in R are many points → p is a relative frequency,
# points in R K
p = = (3)
total # of points N

• Assumption 2: R is small enough s.t. p(x) is approximately constant,


Z Z
p = p(x) dx = p(x) dx = p(x) · V (4)
R R

• Both assumptions together are slightly contradictory, but they yield


K # points in R
p(x) = = (5)
N·V total # of points · Volume of R
C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 3
Kernel-based DE: Parzen Window Estimator (1/2)

• The Parzen window estimator fixes V and leaves K /N variable2


• D-dimensional Parzen window kernel function (a.k.a. “box kernel”):

1 if |ui | ≤ 12 ∀i = 1, . . . , D
k (u) = (6)
0 otherwise

• Calculate K with this kernel function:


N  
X x − xi
K (x) = k (7)
h
i =1

where h is a scaling factor that adjusts the box size


• Hence, the whole density is
N  
1 X 1 x − xi
p(x) = k (8)
N hD h
i =1

2
See Bishop Sec. 2.5.1

C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 4


Kernel-based DE: Parzen Window Estimator (2/2)

• The kernel removes much of the discretization error of the fixed-distance


histogram bins, but it still leads to blocky estimates
• Replacing the box kernel by a Gauss kernel further smooths the result,

N D /2 ∥x − xi ∥22
 
1 X 1
p(x) = · exp − , (9)
N 2π h2 2
2h
i =1

where h is the standard deviation of the Gaussian


• Mathematically, also any other kernel is possible if these conditions hold:

k (u) ≥0 (10)

Z
k (u) du =1 (11)

C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 5


K-Nearest Neighbors (k-NN) Density Estimation

• Recall our derived equation for estimating the density

K # points in R
p(x) = = (12)
N·V total # of points · Volume of R

• The Parzen window estimator fixes V , and K varies


• The k-Nearest Neighbors estimator fixes K , and V varies
• k-NN calculates V from the distance of the K nearest neighbors3

• Note that both the Parzen window estimator and the k-NN estimator are
“non-parametric”, but they are not free of parameters
• The kernel scaling h and the number of neighbors k are hyper-parameters,
i.e., some form of prior knowledge to guide the model creation
• The model parameters are the samples themselves. Both estimators need to
store all samples, which is why they are also called memory methods
3
See Bishop Sec. 5.2.2

C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 6


First Glance at the Model Selection Problem
• Optimizing the hyperparameters is also called Model Selection Problem
• Hyperparameters must be optimized on a held-out part of the training data,
the validation set:
train on training data with different hyperparameter sets hi , evaluate on
validation data to get the best performing set h∗ via maximum likelihood (ML)
• What if hyperparameters are optimized directly on the training data?
Then the most complex (largest, most flexible) model wins, because it
achieves the lowest training error

• When training data is limited, it can be better exploited with cross validation
• In this case, the data is subdivided into k folds (partitions). Do k training/eval.
runs (using each fold once for validation and the rest for training), and select
that h∗ with ML across all folds
• The choice of k is a hyper-hyperparameter that affects the quality of the
predicted error (check Hastie/Tibshirani/Friedman Chap. 7 if curious)
C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 7
Cross Validation (CV) for Unsupervised Methods?

• CV requires an objective function for ML, hence it is almost exclusively used


on supervised tasks, where labels make performance measurement trivial
• Density estimation is unsupervised, hence we need an additional trick to
measure its performance
• The trick is to optimize the DE hyperparameters by using the prediction of
held-out samples as objective function:
• Split the data into J folds:
j j j
Strain = S \ {x N ·j , . . . x
⌊J⌋ ⌊ NJ ⌋·(j +1)−1 } , Stest = S \ Strain
• Let α be the unknown hyperparameters, and
j
let pj (x|α) be the density estimate for samples Strain on hyperparams α
• Then, the ML estimate is
J −1
Y Y
α∗ = argmax pj (x|α) (13)
α
j =0 x∈S j
test

• In practice, take the logarithm (“log likelihood”) to mitigate numerical issues


→ the product becomes a sum
C. Riess | Part 02: Non-Parametric Density Estimation April 18, 2024 8
Lecture Pattern Analysis

Part 03: Bias and Variance

Christian Riess
IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg
April 18, 2024
Introduction

• The motivation behind the hyperparameter optimization is to aim for


generalization to new data
• For kernel density estimation, the pitfalls are:
• Too large kernel: covers all space with some probability mass, but the density
is too uniform (does not represent the structure)
• Too small kernel: closely represents the training data, but might assign too low
probabilities in areas without training data
• In contrast, the “optimal”1 kernel size: represents the structure of the training
data and also covers unobserved areas to some extent
• This is an instance of the bias-variance tradeoff2

1
This may sound as if there were a unique minimum, maybe even of a convex function — in practice, there is not that one single best solution; so read this as a
somewhat hypothetical statement
2
See PR lecture or Hastie/Tibshirani/Friedman Sec. 7-7.3 if more details are desired

C. Riess | Part 03: Bias and Variance April 18, 2024 1


Bias and Variance in Regression

• Bias is the square of the average deviation of an estimator from the ground
truth
• Variance denotes is the variance of the estimates, i.e., the expected squared
deviation from the estimated mean3
• Informal interpretation:
• High bias indicates model undercomplexity: we obtain a poor fit to the data
• High variance indicates model overcomplexity: the fit also models not just
the structure of the data, but also its noise
• Higher model complexity (= more model parameters) tends to lower bias and
higher variance
• We will usually not be able to get bias and variance simultaneously to 0
• Regularization increases bias and lowers variance

3
See Hastie/Tibshirani/Friedman Sec. 7.3 Eqn. (7.9) for a detailed derivation

C. Riess | Part 03: Bias and Variance April 18, 2024 2


Sketches for Model Undercomplexity and Overcomplexity

• Note that this example implicitly contains a smoothness assumption


• It does not claim that there is a universally best fit on arbitrary input
distributions (because of the No-Free-Lunch Theorem)
C. Riess | Part 03: Bias and Variance April 18, 2024 3
Transferring Bias and Variance to our Density Estimators

• Our kernel framework can directly replicate these investigations by


retargeting our kernels to regression or classification:
• Regression:
• Estimate f (x) at position x as a kernel-weighted sum of the neighbors or
• as a k -NN mean of k neighbors
• Classification:
• Estimate for classes c1 and c2 individual densities, evaluate pc1 (x) and pc2 (x),
and select the class with higher probability or
• Select the majority class within k nearest neighbors
• We will then observe that
• Larger kernel support / larger k increases bias and lowers variance
• Smaller kernel support / smaller k lowers bias and increases variance

• Analogously, we can use the notion of bias/variance also on our initial


unsupervised density estimation task

C. Riess | Part 03: Bias and Variance April 18, 2024 4

You might also like