0% found this document useful (0 votes)
15 views23 pages

TEAA - Memory Based Tecniques

Uploaded by

jazzyjebb2505
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views23 pages

TEAA - Memory Based Tecniques

Uploaded by

jazzyjebb2505
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Memory Based Tecniques: Kernel Density Estimation

& K-Nearest Neighbours

Introduction: Memory-Based Techniques

This slide introduces the concept of memory-based techniques, also known as non-parametric
techniques, which differ from parametric approaches typically studied in previous units. These two types of
methods are compared as follows:

1. Parametric Models:
○ These models involve defining a fixed number of parameters and training them to adjust to
the data. Examples include linear regression and logistic regression.
○ The model assumes a specific probability distribution or functional form, making it
dependent on predefined parameters and assumptions.
2. Non-Parametric Models:
○ These models do not assume a specific form for the underlying data distribution. Instead,
they rely on the data itself (or subsets of it) to make predictions or classifications.
○ The prediction for a new sample x is determined based on its distance or similarity to the
training samples {𝑥𝑛}
𝑁
​.
𝑛=1

○ Non-parametric models are "memory-based," meaning the entire dataset (or a strategically
chosen subset) must be retained for predictions.

In this unit, two key non-parametric techniques are highlighted:

● Kernel Density Estimation (KDE): A method for estimating the probability density function of
data without assuming a parametric distribution.
● K-Nearest Neighbors (KNN): A simple classification and regression technique that assigns a label
or value based on the closest neighbors to the query point.

Additionally, the slide notes that these methods are particularly suited for distributed environments when
the training set is manageable or strategically reduced, allowing highly flexible classification tasks.

Kernel Density Estimation (KDE): Histograms as a Starting Point

This slide introduces histograms as a simple yet naive strategy to estimate the density of a dataset.
Histograms serve as an entry point to understanding Kernel Density Estimation (KDE), highlighting the
limitations of binning approaches.

1. How Histograms Work:

1
○ The data is divided into a predefined number of bins. Each bin represents a range of
values, and the number of data points falling into a bin determines the height of the bar in
the histogram.
○ The normalized histogram, where bin heights are scaled to sum to 1, provides a rough
approximation of the probability density function (PDF) of the data.
2. Naive Assumptions:
○ Histograms make rigid assumptions about the bin width and boundaries. For instance, all
points in a bin are treated as equally likely, leading to abrupt changes in the density
estimate between bins.
○ The placement of bin edges can significantly affect the results. Small shifts in the bins may
lead to different histograms (as shown in the slide with the shifted bins).
3. Limitations:
○ Discontinuity: Histograms are piecewise constant, failing to represent smooth transitions
in the underlying data.
○ Bias-Variance Tradeoff: Choosing a smaller bin size captures more detail (low bias) but
can lead to noisy, high-variance estimates. Conversely, larger bins smooth the estimate (low
variance) but may oversimplify the structure of the data.

Transition to KDE: Kernel Density Estimation addresses these limitations by replacing bins with smooth
kernel functions, providing a more continuous and adaptable estimate of the data's density. KDE will
adjust the influence of each data point over the entire range, overcoming the rigid boundaries of histograms.

Kernel Density Estimation: Available Kernels

This slide visualizes various kernel functions that can be used in KDE, providing a clearer understanding of
how each kernel influences the density estimate:

1. Gaussian Kernel:
○ The most commonly used kernel due to its smooth and continuous shape.
○ Mathematically:

2
2
𝑥
1 − 2
𝐾(𝑥) = 𝑒

○ Characteristics:
■ Assigns the highest weight to points closest to x, with influence decreasing
exponentially as distance increases.
■ Ideal for smooth and general-purpose density estimation.
2. Tophat Kernel:
○ A uniform kernel with constant weight within a fixed range and zero weight outside:

𝐾(𝑥) = {1 𝑖𝑓 |𝑥| ≤ 1; 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒}

○ Characteristics:
■ Provides equal influence for all points within the range.
■ Results in a blocky, less smooth density estimate compared to the Gaussian kernel.
3. Epanechnikov Kernel:
○ A parabolic-shaped kernel, optimal in the sense of minimizing mean integrated squared
error:

3 2
𝐾(𝑥) = 4
(1 − 𝑥 ) 𝑖𝑓 |𝑥| ≤ 1, 𝑒𝑙𝑠𝑒 0.

○ Characteristics:
■ Balances computational efficiency with smoothness.
■ Suitable for applications requiring faster calculations.
4. Exponential Kernel:
○ Assigns exponentially decaying weights as distance increases.
○ Characteristics:
■ Similar to Gaussian, but with sharper emphasis on nearby points.
■ Useful for datasets where local influence is critical.
5. Linear and Cosine Kernels:
○ Linear kernel: Provides a triangular influence with decreasing weight as distance increases.
○ Cosine kernel: Assigns weights based on a cosine function, creating a smooth but periodic
influence.

3
Kernel Density Estimation: Slots and Probability Density Calculation

This slide focuses on the mathematical formulation of Kernel Density Estimation (KDE) in
one-dimensional space. The probability density pi of a slot is calculated based on the number of data points
within that slot and its width.

1. Mathematical Representation:
○ Let Δi represent the width of the i-th slot, and ni the number of samples falling into that
slot. The probability density of the slot is given by:

𝑛𝑖
𝑝𝑖 = 𝑁∆𝑖

where N is the total number of data points in the dataset. This formula captures how
density depends on both the number of points in the slot and the slot’s width.

2. Simplifying the Slots:


○ To standardize the computation, the width of all slots is set to a constant value Δ, resulting
in:

𝑛𝑖 𝑛𝑖
𝑝𝑖 = 𝑁∆
= 𝑁

○ This simplifies the process but introduces the critical role of Δ, which must be carefully
chosen.
3. Adjusting Slot Width (Δ):
○ The choice of Δ directly impacts the density estimation:
■ Too narrow slots lead to overfitting, where many slots might have no data points,
resulting in a jagged and noisy density estimate.
■ Too wide slots lead to oversmoothing, obscuring important details in the data
distribution.

The slide emphasizes that selecting an appropriate Δ is a task for the machine learning expert, balancing
detail and smoothness.

Kernel Density Estimation: Visualizing the Role of Bandwidth

This slide provides a visual explanation of the impact of slot width h (equivalent to Δ in the previous slide)
using histograms and kernel-based density estimates. Key observations include:

1. Histograms:
○ The top-left plot shows a standard histogram with fixed bin widths, where the placement
of bin edges impacts the density estimation.

4
○ The top-right plot illustrates the effect of shifting the bin edges, demonstrating how the
histogram can change based on bin alignment.
2. Kernel-Based Density Estimation:
○ The bottom-left plot uses a Tophat kernel, similar to a histogram, but provides finer
granularity by treating each point as a mini histogram.
○ The bottom-right plot uses a Gaussian kernel, resulting in a smooth and continuous
density estimate that overcomes the discontinuities of histograms.
3. Role of Bandwidth (h):
○ Bandwidth controls the width of the kernels and determines the level of smoothing.
■ Small h: Captures more details but risks overfitting, resulting in a bumpy
estimate.
■ Large h: Provides a smoother estimate but risks losing important structure.

This slide visually demonstrates why KDE provides a more flexible and adaptive approach compared to
histograms, especially when using smooth kernels like Gaussian.

Kernel Density Estimation: Extending to Higher Dimensions

This slide explains the challenges of scaling KDE to higher-dimensional spaces and how kernel functions
address these issues:

1. Challenges with Histograms in Higher Dimensions:


○ In 1D, histograms divide the space into bins. In 2D, the number of bins increases
quadratically, and in d-dimensions, it increases exponentially.
○ This rapid growth in the number of bins makes histograms impractical for
high-dimensional data because most bins become empty, and the density estimate becomes
unreliable.
2. Kernel Functions in Higher Dimensions:

5
○ KDE overcomes these challenges by replacing fixed bins with smooth kernel functions.
○ Each kernel measures the influence of a training point xn on a new point x based on their
distance.
3. Distance Metrics:
○ The most common distance metric is Euclidean distance:

𝑑
2
||𝑥 − 𝑥𝑛|| = 𝑖=1
∑ (𝑥𝑖 − 𝑥𝑛,𝑖)

○ In some cases, Mahalanobis distance is used, as it accounts for correlations between


dimensions.
4. Gaussian Kernel in KDE:
○ Using the Gaussian kernel, the probability density function at a point x is given by:

𝑁
1 1 ||𝑥−𝑥𝑛||2
𝑝(𝑥 | 𝑋) = 𝑁
∑ 2
𝑒𝑥𝑝(− 2 )
𝑛=1 2πℎ 2ℎ

Here:

■ h is the bandwidth, controlling the smoothness of the estimate.


■ N is the number of samples.
■ Each term contributes to the density based on the distance between x and xn.
5. Interpretation of Bandwidth:
○ In higher dimensions, h functions as a smoothing factor that determines how far each
point xn influences the density estimate at x.

This slide highlights the flexibility of KDE in high-dimensional spaces, making it a practical alternative to
histograms when combined with appropriate kernel functions and distance metrics. The Gaussian kernel is
particularly well-suited for producing smooth and continuous density estimates in these scenarios.

6
Figure 1: Approaching the True Distribution with Sufficient Samples

This figure demonstrates how Kernel Density Estimation (KDE) can approximate the true distribution
when there are enough training points (here, N = 2000). The plot shows the performance of three
different kernels: Gaussian, Tophat, and Epanechnikov.

1. Key Observations:
○ The input distribution (shaded in gray) represents the true underlying density.
○ All three kernels provide reasonably close approximations to the true distribution, with
slight differences in their smoothness.
○ The Gaussian kernel (blue line) is the smoothest and most continuous, while the Tophat
kernel (green line) is less smooth due to its uniform weighting. The Epanechnikov kernel
(red line) balances between the two.
2. Noise and Data Representation:
○ The black dots below the plot represent the training samples. These are generated from the
true distribution but include added noise.
○ With a sufficient number of samples, KDE can average out the noise and produce an
accurate estimate of the true density.

This figure illustrates that KDE performs well when the sample size is large, regardless of the kernel type.

7
Figure 2: Weak Approximation with Few Training Samples

This figure highlights the limitations of KDE when fewer training samples are available. The same kernels
are used (Gaussian, Tophat, Epanechnikov), but the reduced number of samples leads to significant
deviations from the true distribution.

1. Key Issues:
○ The density estimates show considerable fluctuations, especially for the Tophat kernel
(green line), which lacks smooth transitions.
○ The Epanechnikov kernel (red line) and Gaussian kernel (blue line) attempt to smooth
the distribution, but with fewer data points, the estimates remain noisy and inaccurate.
2. Impact of Sample Size:
○ The lack of samples means KDE struggles to represent the true density, and the added
noise from the data becomes more prominent.
○ This demonstrates the reliance of KDE on having sufficient data to produce reliable
density estimates.

In summary, while KDE is flexible, its performance degrades significantly with fewer data points.

8
Figure 3: Poor Approximation with Very Few Samples

This figure further emphasizes the challenges of KDE with extremely limited training samples (e.g., only
5 points). The density estimates provided by all three kernels are practically unusable.

1. Observations:
○ Each kernel produces wildly varying estimates that fail to resemble the true distribution
(gray shading).
○ The Tophat kernel (green) results in blocky and disconnected density estimates, while the
Gaussian (blue) and Epanechnikov (red) kernels attempt smoothing but are still highly
inaccurate.
2. Limitations:
○ With only a handful of points, the KDE process cannot capture the underlying structure
of the data, as the kernels heavily depend on the few available samples.
○ The figure demonstrates that KDE is unsuitable for very small datasets, where the lack
of information leads to extreme inaccuracies.

This underscores the importance of having a sufficient number of training samples for KDE to function
effectively.

9
Figure 4: Approximating the True Distribution with Moderate Samples

This figure provides a middle ground, showing KDE performance with a moderate number of training
samples (around 150). While the results are better than in Figure 2 and 3, the estimates still show some
deviations from the true distribution.

1. Improvements:
○ The estimates are smoother and closer to the true distribution compared to when fewer
samples were used.
○ The Gaussian kernel (blue line) continues to provide the most consistent and smooth
estimate, while the Tophat kernel (green line) shows more abrupt changes due to its
uniform weighting.
2. Remaining Challenges:
○ The reduced sample size still introduces noticeable noise in the estimates.
○ KDE struggles to fully capture finer details of the true distribution, especially in regions
with fewer samples.

This figure highlights the gradual improvement in KDE performance as the sample size increases. With 150
samples, the results are usable but still not as robust as when more data is available (as seen in Figure 1).

10
Figure 5: The Effect of Bandwidth on KDE with Sufficient Samples

This figure illustrates how bandwidth (h) influences the density estimation in Kernel Density Estimation
(KDE), using 2000 data points and a Gaussian kernel.

1. Different Bandwidths:
○ Blue Line (h=0.2):
■ A very small bandwidth creates a highly detailed but noisy estimate. It overfits to
individual data points, capturing even small variations in the dataset.
○ Green Line (h=0.5):
■ A moderate bandwidth balances smoothness and detail, closely approximating the
true input distribution (shaded in gray).
○ Red Line (h=1.0):
■ A large bandwidth oversmooths the density estimate, obscuring finer details of the
true distribution.
2. Key Insights:
○ A small bandwidth (h=0.2) causes the density estimate to become too sensitive to noise,
leading to spiky and irregular behavior.
○ A large bandwidth (h=1.0) reduces the impact of noise but fails to capture important
structure, such as distinct peaks in the data.

This figure demonstrates the critical role of bandwidth in KDE and the need to carefully tune hh to achieve
accurate and meaningful density estimates.

11
Figure 6: Poor KDE Approximation with Few Samples

This figure explores the effect of bandwidth when KDE is applied to a much smaller dataset (N=9) using a
Gaussian kernel. It highlights the challenges of density estimation with limited data.

1. Bandwidth Impact:
○ Blue Line (h=0.2):
■ The small bandwidth amplifies the effect of individual points, resulting in an
extremely noisy and spiky estimate.
○ Green Line (h=0.5):
■ A moderate bandwidth provides some smoothing, but the estimate remains
inconsistent and does not approximate the true distribution well.
○ Red Line (h=1.0):
■ The large bandwidth oversmooths the estimate, effectively masking the distinct
features of the input distribution.
2. Key Observations:
○ With only 9 points, the density estimate is heavily influenced by the limited data, regardless
of bandwidth.
○ A small dataset makes KDE unreliable, as the true structure of the data cannot be captured
effectively.

This figure demonstrates the limitations of KDE when applied to very small datasets, even with bandwidth
tuning.

12
Figure 7: KDE with Moderate Sample Size (N=83)

This figure examines KDE with 83 samples and different bandwidths, showing some improvement
compared to Figure 6 but still far from optimal.

1. Bandwidth Effects:
○ Blue Line (h=0.2):
■ The small bandwidth continues to overfit, capturing noise in the data and
producing a jagged density estimate.
○ Green Line (h=0.5):
■ A moderate bandwidth provides a smoother estimate that starts to approximate
the true distribution more accurately.
○ Red Line (h=1.0):
■ The large bandwidth oversmooths the data, merging distinct peaks and masking
finer details of the distribution.
2. Improved Approximation:
○ With 83 samples, KDE begins to stabilize, especially with a moderate bandwidth.
○ However, the estimate is still sensitive to bandwidth choice, and the performance remains
limited compared to larger datasets.

This figure highlights the gradual improvement of KDE as the sample size increases, though it also
underscores the need for sufficient data to achieve accurate results. Moderate bandwidth (h=0.5) continues
to strike a balance between overfitting and oversmoothing.

13
Kernel Density Estimation for Classification (1)

This slide explains how Kernel Density Estimation (KDE) can be applied to classification problems
using Bayes' theorem. KDE is a non-parametric method, meaning it does not assume that the data samples
come from a known probability distribution, making it highly flexible for real-world applications.

1. Training Set Splitting:


Let 𝐷 = {(𝑥𝑛, 𝑦𝑛)}𝑛=1 be the complete training dataset, where xn is a feature vector, and yn is
𝑁

its corresponding label.
○ The dataset is divided into K subsets, one for each target class ωk:

𝐷𝑘 = {𝑥𝑛 | 𝑦𝑛 = 𝑘}

where Dk contains only the samples belonging to class ωk.

2. Bayes' Rule:
○ The probability of a new sample x belonging to class ωk is given by Bayes' theorem:

π(ω𝑘)·𝑝(𝑥 | ω𝑘)
𝑃𝑟(ω𝑘 | 𝑥) = Σ𝑗π(ω𝑗)·𝑝(𝑥 | ω𝑗)

where:

■ π(ωk): The prior probability of class ωk.


■ p(x ∣ ωk): The likelihood of x given class ωk.
3. Likelihood Estimation Using KDE:
○ The likelihood p(x ∣ ωk) is estimated using KDE:

1
𝑝(𝑥 | ω𝑘) = |𝐷𝑘|
∑ 𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑛)
𝑥𝑛ϵ𝐷𝑘

■ ∣Dk|: The number of samples in class ωk.


■ kernel(x, xn): A function that measures the similarity between x and xn, often
based on a Gaussian kernel.
4. Final Form:

○ Substituting this KDE-based likelihood into Bayes' theorem gives:

1
π(ω𝑘)· |𝐷 | Σ𝑥 ϵ𝐷 𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑛)
𝑘 𝑛 𝑘
𝑃𝑟(ω𝑘 | 𝑥) = 1
Σ𝑗π(ω𝑗)· |𝐷𝑗|
Σ𝑥 ϵ𝐷 𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑖)
𝑖 𝑗

This formulation demonstrates how KDE can be combined with Bayes' rule to classify samples without
assuming any parametric form of the data distribution.

14
Kernel Density Estimation for Classification (2)

This slide builds on the previous formulation, refining the likelihood expression using a Gaussian kernel
and showing how KDE can also be applied to regression tasks.

1. Likelihood with Gaussian Kernel:


○ The likelihood p(x ∣ ωk) can be written using the Gaussian kernel function:

2
||𝑥−𝑥𝑛||

1 1 2ℎ
2

𝑝(𝑥 | ω𝑘) = |𝐷 | ∑ 𝑒
2
𝑘 𝑥𝑛ϵ𝐷𝑘 2πℎ

■ h: The bandwidth parameter controlling the smoothness of the kernel.


■ ∥x−xn∥2: The squared distance between the sample x and the training sample xn.
2. Bayesian Classification:
○ Substituting this Gaussian kernel-based likelihood into Bayes' rule gives:

2
||𝑥−𝑥 ||
𝑛
− 2
1 1 2ℎ
π(ω𝑘)· |𝐷 | ∑ 2
𝑒
𝑘 𝑥𝑛ϵ𝐷𝑘 2πℎ
𝑃𝑟(ω𝑘 | 𝑥) = ||𝑥−𝑥 ||
2
𝑖
− 2
1 1 2ℎ
Σ𝑗π(ω𝑗)· |𝐷 | ∑ 2
𝑒
𝑗 𝑥𝑖ϵ𝐷𝑗 2πℎ

3. KDE for Regression:


○ In regression, the predicted value 𝑦 is calculated as a weighted average of the training labels:

∑ 𝑦𝑛·𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑛)
(𝑥𝑛, 𝑦𝑛)ϵ𝐷
𝑦=
∑ 𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑛)
(𝑥𝑛, 𝑦𝑛)ϵ𝐷

Each training sample yn contributes to the prediction based on its similarity (kernel value)
to the input x.

15
First Set of Images (Bandwidth h=5.0, K=100)

In the first set of images, the bandwidth hhh is relatively large, creating very smooth decision boundaries.

1. Top Images: The scatter plots of two classes, cyan for Class 1 and orange for Class 2, show the
original dataset and its reduced representation using 100 centroids obtained by KMeans. The
centroid-based KDE approximation preserves the general structure of the data while reducing
complexity.
2. Bottom Images: The decision regions created by KDE are very smooth due to the large bandwidth.
However, this smoothness may oversimplify the boundaries, potentially blending regions where the
two classes overlap.

Second Set of Images (Bandwidth h=2.0, K=100)

Reducing the bandwidth to h=2.0 introduces finer details to the KDE decision boundaries.

1. Top Images: Similar to the first set, the data is reduced using KMeans centroids. However, the
reduced bandwidth captures more granularity in the distributions of the two classes.
2. Bottom Images: The decision regions now show sharper transitions between the classes. While this
provides better boundary accuracy, it could result in minor overfitting, particularly in regions where
classes overlap.

Third Set of Images (Bandwidth h=1.0, K=100)

With h=1.0, the decision boundaries become significantly more detailed, capturing subtle variations in the
data distribution.

1. Top Images: The scatter plots remain unchanged, but the KDE’s sensitivity to finer features
increases.
2. Bottom Images: The decision regions exhibit much sharper transitions. While this improves
accuracy for complex boundaries, it may start to overfit noise or minor fluctuations in the dataset.

16
Fourth Set of Images (Bandwidth h=0.5,K=100)

A further reduction in bandwidth to h=0.5h = 0.5h=0.5 leads to very localized decision boundaries.

1. Top Images: The scatter plots show the same structure, but the KDE heavily focuses on local
variations due to the smaller bandwidth.
2. Bottom Images: The decision regions are now highly irregular, reflecting individual centroids
rather than a smoothed distribution. This results in overfitting and poor generalization for unseen
samples.

Fifth and Sixth Sets of Images (K=100)

Bandwidth h=0.2

● Top Row: The scatter plots show the original data and KMeans-reduced centroids. Centroids
preserve the dataset’s structure.
● Bottom Row: Decision boundaries are smooth but detailed enough to differentiate regions.
Overlapping areas (x = 8) still blend slightly, offering a balance between generalization and accuracy.

Bandwidth h=0.1

● Top Row: Scatter plots are unchanged, showing data and centroids.
● Bottom Row: Boundaries are highly detailed and closely fit the centroids, capturing subtle
distinctions in overlapping regions. However, they risk overfitting, reflecting noise rather than
overall patterns.

17
Seventh and Eighth Sets of Images (Fewer Centroids: K=2)

For the final sets, the dataset is represented by only 2 centroids per class. This drastically reduces the
complexity of the KDE approximation.

1. Bandwidth h=0.2:
○ The reduced centroids create extremely localized decision boundaries that fail to generalize
to the overall structure of the data. The decision regions reflect individual points rather
than a cohesive separation.
2. Bandwidth h=1.0:
○ The increased bandwidth smooths the decision regions, but the lack of centroids makes it
impossible to capture the original data’s structure. The decision boundaries remain too
simple to accurately classify samples in regions of overlap.

Challenges in Bandwidth Adjustment and KNN Overview

The slide highlights a fundamental limitation of Kernel Density Estimation (KDE): its dependency on a
single global parameter, the bandwidth (h). Bandwidth significantly impacts KDE's performance because it
dictates the degree of smoothness in the estimated probability density. When the same bandwidth is applied
to all classes and new samples, it can lead to poor estimation—too narrow a bandwidth can overfit the data,
capturing noise, while too wide a bandwidth can oversmooth the distribution, losing critical distinctions
between classes. These issues are particularly prominent in high-dimensional spaces, where data sparsity
exacerbates the "curse of dimensionality." This makes KDE less robust in certain scenarios, especially for
imbalanced or complex datasets.

KNN provides an alternative approach that mitigates the challenges of KDE. Instead of fixing a global
parameter, KNN dynamically adjusts the region of consideration by fixing the number of nearest neighbors
(k) and allowing the radius of the hypersphere around the query point to expand until k neighbors are
included. This flexibility ensures that the classification is inherently local and adapts to the data distribution.
Additionally, KNN eliminates the need to estimate a global bandwidth, making it less sensitive to the
high-dimensional input space. The algorithm assigns a class label to a new sample x by majority voting
among its k-nearest neighbors, which provides robustness against data imbalance and noise.

Conditional Probability in KNN

This slide delves into the mathematical foundation of KNN by describing how the conditional probability
of a sample x belonging to a particular class ωc​is computed. The formula:

𝐾𝑐
𝑝(𝑥 | ω𝑐) = 𝑁𝑐𝑉

provides a probabilistic interpretation of KNN. Here:

18
● Kc is the number of samples of class ωc within the hypersphere centered at x.
● Nc​represents the total number of samples in class ωc​in the training set.
● V is the volume of the hypersphere, dynamically adjusted to include at least k samples.

The total density of x, considering all classes, is calculated as p(x) = K / NV​, where K is the total number of
neighbors in the hypersphere, and N is the total number of training samples. This formula accounts for the
overall density of the data and ensures the probabilities are normalized across all classes.

Additionally, the prior probability of each class, π(ωc), is defined as Nc / N​​, which reflects the relative
frequency of each class in the training set. Using Bayes' rule, the posterior probability P(ωc ∣ x) is calculated,
representing the probability that x belongs to class ωc given its neighborhood distribution. The
simplification to Kc​ occurs under specific assumptions, such as uniform prior probabilities or equal class
sizes, which makes KNN computationally efficient and intuitive.

This probabilistic formulation highlights KNN’s flexibility in accommodating varying class distributions
and its reliance on local density rather than global assumptions, making it suitable for non-parametric
classification tasks.

𝑝(𝑥 | ω𝑐) π(ω𝑐) 𝐾𝑐 𝑁𝑐 𝑁𝑉 𝐾𝑐


𝑃𝑟(ω𝑐 | 𝑥) =
𝑝(𝑥)
= 𝑁𝑐𝑉 𝑁 𝐾
= 𝐾

Classification with KNN

The final slide focuses on the practical aspect of KNN classification. Once the k-nearest neighbors of a query
sample x are identified, the class assignment is straightforward. The sample is assigned to the class with the
highest Kc​, the number of neighbors belonging to class ωc​within the hypersphere centered at x. This process
is encapsulated by the formula:

𝐾𝑐
𝑝(ω𝑐 | 𝑥) = 𝐾

Here, the posterior probability is directly proportional to the representation of class ωc​within the k-nearest
neighbors. This method ensures that the decision boundary adapts to the local density of the data, making
KNN highly effective for imbalanced datasets and overlapping class distributions.

By dynamically adjusting the hypersphere radius and relying on local neighbor density, KNN offers a robust,
adaptable approach to classification, particularly in scenarios where global models like KDE may struggle.
However, the choice of k and distance metric still plays a critical role in determining KNN’s effectiveness.

19
Ball Trees - Trivial Example

This slide illustrates a basic example of how ball trees


operate by dividing the space hierarchically based on a
specified parameter. This parameter indicates the
minimum number of samples that a region, or "ball"
(hyper-sphere), must contain to avoid further splitting.
The process begins with a root node representing the
center of mass (or mean vector) of all data points in the
training dataset. A large radius is used initially to
encompass all samples. The recursive algorithm then
subdivides the space by creating smaller overlapping or
non-overlapping regions, forming a hierarchical
structure. This visual provides an intuitive
understanding of how ball trees progressively refine
regions of interest in the dataset.

Recursive Structure and KDE Application

This slide explains the recursive nature of ball tree


construction. Unlike the trivial example, here it is
emphasized that the balls at the same level in the tree may
overlap. The primary objective is to distribute training
samples hierarchically into subsets, which facilitates
efficient search operations such as branch-and-bound
methods for identifying the k-nearest neighbors. Beyond
KNN, ball trees are also employed in Kernel Density
Estimation (KDE) to approximate the density at a given
point in the space. This application is particularly useful
in higher-dimensional spaces where large k values are
often preferred for smoother approximations. The
illustration further demonstrates how ball trees
adaptively cover the data space.

Overlapping Regions in Ball Trees

This slide presents a side-by-side view of ball tree coverage in two scenarios. The left diagram shows a
broader distribution of data points, with overlapping spheres that progressively refine the coverage of dense
regions. The right diagram provides a zoomed-in perspective, emphasizing how smaller, overlapping balls
capture finer details of the data distribution. Together, these visuals highlight the flexibility of ball trees in
accommodating various densities and spatial arrangements.

20
The recursive division ensures that even complex datasets are efficiently represented, which is critical for
tasks like KNN search or density estimation in high-dimensional data.

BallTrees – Tree Generator Algorithm (with formulas)

This algorithm generates a Ball Tree from a dataset Dn​ recursively. Each node represents a "ball," or
hypersphere, containing a subset of the data points. The key steps are as follows:

1. Stopping Criterion: If the size of the current dataset ∣Dn∣ is smaller than the parameter
min_samples, the algorithm terminates for this branch, returning a null node:

if |Dn| < min_samples, then return null.

2. Calculate Node Center and Radius:


○ The center of the ball is the mean vector of the data points in Dn​:

1
𝑛𝑜𝑑𝑒. 𝑐𝑒𝑛𝑡𝑟𝑒 ← |𝐷𝑛|
∑ 𝑥
𝑥ϵ𝐷𝑛

○ The radius of the ball is the maximum distance of any point in Dn from the center:

𝑛𝑜𝑑𝑒. 𝑟𝑎𝑑𝑖𝑢𝑠 ← 𝑚𝑎𝑥𝑥ϵ𝐷 ||𝑥 − 𝑛𝑜𝑑𝑒. 𝑐𝑒𝑛𝑡𝑟𝑒||


𝑛

3. Principal Axis Projection:


○ The algorithm identifies two points, x1​and x2​, in Dn that maximizes the distance between
them:

21
𝑥1 ← 𝑎𝑟𝑔 𝑚𝑎𝑥𝑥ϵ𝐷 ||𝑥 − 𝑛𝑜𝑑𝑒. 𝑐𝑒𝑛𝑡𝑟𝑒||, 𝑥2 ← 𝑎𝑟𝑔 𝑚𝑎𝑥𝑥ϵ𝐷 ||𝑥 − 𝑥1||
𝑛 𝑛

○ All data points are projected onto the axis defined by (x1 − x2):

𝑇
𝑧 = {𝑥 · (𝑥1 − 𝑥2) | 𝑥 ϵ 𝐷𝑛}

4. Split the Data:


○ Using the median of the projections, the dataset Dn is divided into two subsets, DL​and
DR​:

𝑚 ← 𝑚𝑒𝑑𝑖𝑎𝑛(𝑧), 𝐷𝐿 = {𝑥 ϵ 𝐷𝑛 | 𝑧𝑖 ≤ 𝑚}, 𝐷𝑅 = {𝑥 ϵ 𝐷𝑛 | 𝑧𝑖 > 𝑚}

5. Recursive Calls:
○ The left and right child nodes are constructed recursively:

𝑛𝑜𝑑𝑒. 𝑙𝑒𝑓𝑡 ← 𝐵𝑎𝑙𝑙𝑇𝑟𝑒𝑒(𝐷𝐿, 𝑚𝑖𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠), 𝑛𝑜𝑑𝑒. 𝑟𝑖𝑔ℎ𝑡 ← 𝐵𝑎𝑙𝑙𝑇𝑟𝑒𝑒(𝐷𝑅, 𝑚𝑖𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠)

○ If both child nodes are non-null, the current node does not store data directly to save
memory.

The algorithm initializes with:

𝑟𝑜𝑜𝑡. 𝑛𝑜𝑑𝑒 ← 𝐵𝑎𝑙𝑙𝑇𝑟𝑒𝑒(𝐷, 𝑚𝑖𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠)

where D is the full dataset.

BallTrees – Finding the k-Nearest Neighbors

This algorithm retrieves the k-nearest neighbors of a point x using the Ball Tree. The recursive process
leverages the hierarchical structure for efficient searches:

1. Initial Check: The algorithm checks if the point x lies within the current node’s ball:

||𝑥 − 𝑛𝑜𝑑𝑒. 𝑐𝑒𝑛𝑡𝑟𝑒|| ≤ 𝑛𝑜𝑑𝑒. 𝑟𝑎𝑑𝑖𝑢𝑠

If this condition holds, the algorithm proceeds to examine the data or children of the node.

2. Leaf Node Data: If the node contains data (i.e., it’s a leaf), the algorithm iterates through all points
xi in the node and calculates their distance from x:

||𝑥 − 𝑥𝑖||

It maintains a max heap, knn, of size k to store the k-nearest neighbors. For each point:

○ If the distance is smaller than the current maximum in knn, the point is added:

22
𝑘𝑛𝑛. 𝑝𝑢𝑠ℎ(𝑥𝑖)

○ If knn exceeds k, the farthest neighbor is removed:

𝑘𝑛𝑛. 𝑝𝑜𝑝()

3. Recursive Search: If the node has children, the algorithm recursively checks both the left and right
child nodes:

𝑖𝑓 𝑛𝑜𝑑𝑒. 𝑙𝑒𝑓𝑡 𝑖𝑠 𝑛𝑜𝑡 𝑛𝑢𝑙𝑙: 𝐹𝑖𝑛𝑑𝐾𝑁𝑁(𝑛𝑜𝑑𝑒. 𝑙𝑒𝑓𝑡, 𝑘𝑛𝑛, 𝑥)

Similarly, the right child is processed if it exists:

𝑖𝑓 𝑛𝑜𝑑𝑒. 𝑙𝑒𝑓𝑡 𝑖𝑠 𝑛𝑜𝑡 𝑛𝑢𝑙𝑙: 𝐹𝑖𝑛𝑑𝐾𝑁𝑁(𝑛𝑜𝑑𝑒. 𝑟𝑖𝑔ℎ𝑡, {}, 𝑥)

4. Final Neighbors: Once all relevant branches have been searched, the knn heap contains the
k-nearest neighbors of x.

The procedure is initialized as:

knn ←FindKNN(root_node, {}, x)

where knn starts as an empty heap.

By combining the hierarchical partitioning of Ball Trees and the recursive search strategy, these algorithms
significantly reduce the computational complexity of nearest-neighbor searches, especially in
high-dimensional spaces. The formulas ensure a precise definition of distances and subsets at each step.

23

You might also like