TEAA - Memory Based Tecniques
TEAA - Memory Based Tecniques
This slide introduces the concept of memory-based techniques, also known as non-parametric
techniques, which differ from parametric approaches typically studied in previous units. These two types of
methods are compared as follows:
1. Parametric Models:
○ These models involve defining a fixed number of parameters and training them to adjust to
the data. Examples include linear regression and logistic regression.
○ The model assumes a specific probability distribution or functional form, making it
dependent on predefined parameters and assumptions.
2. Non-Parametric Models:
○ These models do not assume a specific form for the underlying data distribution. Instead,
they rely on the data itself (or subsets of it) to make predictions or classifications.
○ The prediction for a new sample x is determined based on its distance or similarity to the
training samples {𝑥𝑛}
𝑁
.
𝑛=1
○ Non-parametric models are "memory-based," meaning the entire dataset (or a strategically
chosen subset) must be retained for predictions.
● Kernel Density Estimation (KDE): A method for estimating the probability density function of
data without assuming a parametric distribution.
● K-Nearest Neighbors (KNN): A simple classification and regression technique that assigns a label
or value based on the closest neighbors to the query point.
Additionally, the slide notes that these methods are particularly suited for distributed environments when
the training set is manageable or strategically reduced, allowing highly flexible classification tasks.
This slide introduces histograms as a simple yet naive strategy to estimate the density of a dataset.
Histograms serve as an entry point to understanding Kernel Density Estimation (KDE), highlighting the
limitations of binning approaches.
1
○ The data is divided into a predefined number of bins. Each bin represents a range of
values, and the number of data points falling into a bin determines the height of the bar in
the histogram.
○ The normalized histogram, where bin heights are scaled to sum to 1, provides a rough
approximation of the probability density function (PDF) of the data.
2. Naive Assumptions:
○ Histograms make rigid assumptions about the bin width and boundaries. For instance, all
points in a bin are treated as equally likely, leading to abrupt changes in the density
estimate between bins.
○ The placement of bin edges can significantly affect the results. Small shifts in the bins may
lead to different histograms (as shown in the slide with the shifted bins).
3. Limitations:
○ Discontinuity: Histograms are piecewise constant, failing to represent smooth transitions
in the underlying data.
○ Bias-Variance Tradeoff: Choosing a smaller bin size captures more detail (low bias) but
can lead to noisy, high-variance estimates. Conversely, larger bins smooth the estimate (low
variance) but may oversimplify the structure of the data.
Transition to KDE: Kernel Density Estimation addresses these limitations by replacing bins with smooth
kernel functions, providing a more continuous and adaptable estimate of the data's density. KDE will
adjust the influence of each data point over the entire range, overcoming the rigid boundaries of histograms.
This slide visualizes various kernel functions that can be used in KDE, providing a clearer understanding of
how each kernel influences the density estimate:
1. Gaussian Kernel:
○ The most commonly used kernel due to its smooth and continuous shape.
○ Mathematically:
2
2
𝑥
1 − 2
𝐾(𝑥) = 𝑒
2π
○ Characteristics:
■ Assigns the highest weight to points closest to x, with influence decreasing
exponentially as distance increases.
■ Ideal for smooth and general-purpose density estimation.
2. Tophat Kernel:
○ A uniform kernel with constant weight within a fixed range and zero weight outside:
○ Characteristics:
■ Provides equal influence for all points within the range.
■ Results in a blocky, less smooth density estimate compared to the Gaussian kernel.
3. Epanechnikov Kernel:
○ A parabolic-shaped kernel, optimal in the sense of minimizing mean integrated squared
error:
3 2
𝐾(𝑥) = 4
(1 − 𝑥 ) 𝑖𝑓 |𝑥| ≤ 1, 𝑒𝑙𝑠𝑒 0.
○ Characteristics:
■ Balances computational efficiency with smoothness.
■ Suitable for applications requiring faster calculations.
4. Exponential Kernel:
○ Assigns exponentially decaying weights as distance increases.
○ Characteristics:
■ Similar to Gaussian, but with sharper emphasis on nearby points.
■ Useful for datasets where local influence is critical.
5. Linear and Cosine Kernels:
○ Linear kernel: Provides a triangular influence with decreasing weight as distance increases.
○ Cosine kernel: Assigns weights based on a cosine function, creating a smooth but periodic
influence.
3
Kernel Density Estimation: Slots and Probability Density Calculation
This slide focuses on the mathematical formulation of Kernel Density Estimation (KDE) in
one-dimensional space. The probability density pi of a slot is calculated based on the number of data points
within that slot and its width.
1. Mathematical Representation:
○ Let Δi represent the width of the i-th slot, and ni the number of samples falling into that
slot. The probability density of the slot is given by:
𝑛𝑖
𝑝𝑖 = 𝑁∆𝑖
where N is the total number of data points in the dataset. This formula captures how
density depends on both the number of points in the slot and the slot’s width.
𝑛𝑖 𝑛𝑖
𝑝𝑖 = 𝑁∆
= 𝑁
○ This simplifies the process but introduces the critical role of Δ, which must be carefully
chosen.
3. Adjusting Slot Width (Δ):
○ The choice of Δ directly impacts the density estimation:
■ Too narrow slots lead to overfitting, where many slots might have no data points,
resulting in a jagged and noisy density estimate.
■ Too wide slots lead to oversmoothing, obscuring important details in the data
distribution.
The slide emphasizes that selecting an appropriate Δ is a task for the machine learning expert, balancing
detail and smoothness.
This slide provides a visual explanation of the impact of slot width h (equivalent to Δ in the previous slide)
using histograms and kernel-based density estimates. Key observations include:
1. Histograms:
○ The top-left plot shows a standard histogram with fixed bin widths, where the placement
of bin edges impacts the density estimation.
4
○ The top-right plot illustrates the effect of shifting the bin edges, demonstrating how the
histogram can change based on bin alignment.
2. Kernel-Based Density Estimation:
○ The bottom-left plot uses a Tophat kernel, similar to a histogram, but provides finer
granularity by treating each point as a mini histogram.
○ The bottom-right plot uses a Gaussian kernel, resulting in a smooth and continuous
density estimate that overcomes the discontinuities of histograms.
3. Role of Bandwidth (h):
○ Bandwidth controls the width of the kernels and determines the level of smoothing.
■ Small h: Captures more details but risks overfitting, resulting in a bumpy
estimate.
■ Large h: Provides a smoother estimate but risks losing important structure.
This slide visually demonstrates why KDE provides a more flexible and adaptive approach compared to
histograms, especially when using smooth kernels like Gaussian.
This slide explains the challenges of scaling KDE to higher-dimensional spaces and how kernel functions
address these issues:
5
○ KDE overcomes these challenges by replacing fixed bins with smooth kernel functions.
○ Each kernel measures the influence of a training point xn on a new point x based on their
distance.
3. Distance Metrics:
○ The most common distance metric is Euclidean distance:
𝑑
2
||𝑥 − 𝑥𝑛|| = 𝑖=1
∑ (𝑥𝑖 − 𝑥𝑛,𝑖)
𝑁
1 1 ||𝑥−𝑥𝑛||2
𝑝(𝑥 | 𝑋) = 𝑁
∑ 2
𝑒𝑥𝑝(− 2 )
𝑛=1 2πℎ 2ℎ
Here:
This slide highlights the flexibility of KDE in high-dimensional spaces, making it a practical alternative to
histograms when combined with appropriate kernel functions and distance metrics. The Gaussian kernel is
particularly well-suited for producing smooth and continuous density estimates in these scenarios.
6
Figure 1: Approaching the True Distribution with Sufficient Samples
This figure demonstrates how Kernel Density Estimation (KDE) can approximate the true distribution
when there are enough training points (here, N = 2000). The plot shows the performance of three
different kernels: Gaussian, Tophat, and Epanechnikov.
1. Key Observations:
○ The input distribution (shaded in gray) represents the true underlying density.
○ All three kernels provide reasonably close approximations to the true distribution, with
slight differences in their smoothness.
○ The Gaussian kernel (blue line) is the smoothest and most continuous, while the Tophat
kernel (green line) is less smooth due to its uniform weighting. The Epanechnikov kernel
(red line) balances between the two.
2. Noise and Data Representation:
○ The black dots below the plot represent the training samples. These are generated from the
true distribution but include added noise.
○ With a sufficient number of samples, KDE can average out the noise and produce an
accurate estimate of the true density.
This figure illustrates that KDE performs well when the sample size is large, regardless of the kernel type.
7
Figure 2: Weak Approximation with Few Training Samples
This figure highlights the limitations of KDE when fewer training samples are available. The same kernels
are used (Gaussian, Tophat, Epanechnikov), but the reduced number of samples leads to significant
deviations from the true distribution.
1. Key Issues:
○ The density estimates show considerable fluctuations, especially for the Tophat kernel
(green line), which lacks smooth transitions.
○ The Epanechnikov kernel (red line) and Gaussian kernel (blue line) attempt to smooth
the distribution, but with fewer data points, the estimates remain noisy and inaccurate.
2. Impact of Sample Size:
○ The lack of samples means KDE struggles to represent the true density, and the added
noise from the data becomes more prominent.
○ This demonstrates the reliance of KDE on having sufficient data to produce reliable
density estimates.
In summary, while KDE is flexible, its performance degrades significantly with fewer data points.
8
Figure 3: Poor Approximation with Very Few Samples
This figure further emphasizes the challenges of KDE with extremely limited training samples (e.g., only
5 points). The density estimates provided by all three kernels are practically unusable.
1. Observations:
○ Each kernel produces wildly varying estimates that fail to resemble the true distribution
(gray shading).
○ The Tophat kernel (green) results in blocky and disconnected density estimates, while the
Gaussian (blue) and Epanechnikov (red) kernels attempt smoothing but are still highly
inaccurate.
2. Limitations:
○ With only a handful of points, the KDE process cannot capture the underlying structure
of the data, as the kernels heavily depend on the few available samples.
○ The figure demonstrates that KDE is unsuitable for very small datasets, where the lack
of information leads to extreme inaccuracies.
This underscores the importance of having a sufficient number of training samples for KDE to function
effectively.
9
Figure 4: Approximating the True Distribution with Moderate Samples
This figure provides a middle ground, showing KDE performance with a moderate number of training
samples (around 150). While the results are better than in Figure 2 and 3, the estimates still show some
deviations from the true distribution.
1. Improvements:
○ The estimates are smoother and closer to the true distribution compared to when fewer
samples were used.
○ The Gaussian kernel (blue line) continues to provide the most consistent and smooth
estimate, while the Tophat kernel (green line) shows more abrupt changes due to its
uniform weighting.
2. Remaining Challenges:
○ The reduced sample size still introduces noticeable noise in the estimates.
○ KDE struggles to fully capture finer details of the true distribution, especially in regions
with fewer samples.
This figure highlights the gradual improvement in KDE performance as the sample size increases. With 150
samples, the results are usable but still not as robust as when more data is available (as seen in Figure 1).
10
Figure 5: The Effect of Bandwidth on KDE with Sufficient Samples
This figure illustrates how bandwidth (h) influences the density estimation in Kernel Density Estimation
(KDE), using 2000 data points and a Gaussian kernel.
1. Different Bandwidths:
○ Blue Line (h=0.2):
■ A very small bandwidth creates a highly detailed but noisy estimate. It overfits to
individual data points, capturing even small variations in the dataset.
○ Green Line (h=0.5):
■ A moderate bandwidth balances smoothness and detail, closely approximating the
true input distribution (shaded in gray).
○ Red Line (h=1.0):
■ A large bandwidth oversmooths the density estimate, obscuring finer details of the
true distribution.
2. Key Insights:
○ A small bandwidth (h=0.2) causes the density estimate to become too sensitive to noise,
leading to spiky and irregular behavior.
○ A large bandwidth (h=1.0) reduces the impact of noise but fails to capture important
structure, such as distinct peaks in the data.
This figure demonstrates the critical role of bandwidth in KDE and the need to carefully tune hh to achieve
accurate and meaningful density estimates.
11
Figure 6: Poor KDE Approximation with Few Samples
This figure explores the effect of bandwidth when KDE is applied to a much smaller dataset (N=9) using a
Gaussian kernel. It highlights the challenges of density estimation with limited data.
1. Bandwidth Impact:
○ Blue Line (h=0.2):
■ The small bandwidth amplifies the effect of individual points, resulting in an
extremely noisy and spiky estimate.
○ Green Line (h=0.5):
■ A moderate bandwidth provides some smoothing, but the estimate remains
inconsistent and does not approximate the true distribution well.
○ Red Line (h=1.0):
■ The large bandwidth oversmooths the estimate, effectively masking the distinct
features of the input distribution.
2. Key Observations:
○ With only 9 points, the density estimate is heavily influenced by the limited data, regardless
of bandwidth.
○ A small dataset makes KDE unreliable, as the true structure of the data cannot be captured
effectively.
This figure demonstrates the limitations of KDE when applied to very small datasets, even with bandwidth
tuning.
12
Figure 7: KDE with Moderate Sample Size (N=83)
This figure examines KDE with 83 samples and different bandwidths, showing some improvement
compared to Figure 6 but still far from optimal.
1. Bandwidth Effects:
○ Blue Line (h=0.2):
■ The small bandwidth continues to overfit, capturing noise in the data and
producing a jagged density estimate.
○ Green Line (h=0.5):
■ A moderate bandwidth provides a smoother estimate that starts to approximate
the true distribution more accurately.
○ Red Line (h=1.0):
■ The large bandwidth oversmooths the data, merging distinct peaks and masking
finer details of the distribution.
2. Improved Approximation:
○ With 83 samples, KDE begins to stabilize, especially with a moderate bandwidth.
○ However, the estimate is still sensitive to bandwidth choice, and the performance remains
limited compared to larger datasets.
This figure highlights the gradual improvement of KDE as the sample size increases, though it also
underscores the need for sufficient data to achieve accurate results. Moderate bandwidth (h=0.5) continues
to strike a balance between overfitting and oversmoothing.
13
Kernel Density Estimation for Classification (1)
This slide explains how Kernel Density Estimation (KDE) can be applied to classification problems
using Bayes' theorem. KDE is a non-parametric method, meaning it does not assume that the data samples
come from a known probability distribution, making it highly flexible for real-world applications.
𝐷𝑘 = {𝑥𝑛 | 𝑦𝑛 = 𝑘}
2. Bayes' Rule:
○ The probability of a new sample x belonging to class ωk is given by Bayes' theorem:
π(ω𝑘)·𝑝(𝑥 | ω𝑘)
𝑃𝑟(ω𝑘 | 𝑥) = Σ𝑗π(ω𝑗)·𝑝(𝑥 | ω𝑗)
where:
1
𝑝(𝑥 | ω𝑘) = |𝐷𝑘|
∑ 𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑛)
𝑥𝑛ϵ𝐷𝑘
1
π(ω𝑘)· |𝐷 | Σ𝑥 ϵ𝐷 𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑛)
𝑘 𝑛 𝑘
𝑃𝑟(ω𝑘 | 𝑥) = 1
Σ𝑗π(ω𝑗)· |𝐷𝑗|
Σ𝑥 ϵ𝐷 𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑖)
𝑖 𝑗
This formulation demonstrates how KDE can be combined with Bayes' rule to classify samples without
assuming any parametric form of the data distribution.
14
Kernel Density Estimation for Classification (2)
This slide builds on the previous formulation, refining the likelihood expression using a Gaussian kernel
and showing how KDE can also be applied to regression tasks.
2
||𝑥−𝑥𝑛||
−
1 1 2ℎ
2
𝑝(𝑥 | ω𝑘) = |𝐷 | ∑ 𝑒
2
𝑘 𝑥𝑛ϵ𝐷𝑘 2πℎ
2
||𝑥−𝑥 ||
𝑛
− 2
1 1 2ℎ
π(ω𝑘)· |𝐷 | ∑ 2
𝑒
𝑘 𝑥𝑛ϵ𝐷𝑘 2πℎ
𝑃𝑟(ω𝑘 | 𝑥) = ||𝑥−𝑥 ||
2
𝑖
− 2
1 1 2ℎ
Σ𝑗π(ω𝑗)· |𝐷 | ∑ 2
𝑒
𝑗 𝑥𝑖ϵ𝐷𝑗 2πℎ
∑ 𝑦𝑛·𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑛)
(𝑥𝑛, 𝑦𝑛)ϵ𝐷
𝑦=
∑ 𝑘𝑒𝑟𝑛𝑒𝑙(𝑥, 𝑥𝑛)
(𝑥𝑛, 𝑦𝑛)ϵ𝐷
Each training sample yn contributes to the prediction based on its similarity (kernel value)
to the input x.
15
First Set of Images (Bandwidth h=5.0, K=100)
In the first set of images, the bandwidth hhh is relatively large, creating very smooth decision boundaries.
1. Top Images: The scatter plots of two classes, cyan for Class 1 and orange for Class 2, show the
original dataset and its reduced representation using 100 centroids obtained by KMeans. The
centroid-based KDE approximation preserves the general structure of the data while reducing
complexity.
2. Bottom Images: The decision regions created by KDE are very smooth due to the large bandwidth.
However, this smoothness may oversimplify the boundaries, potentially blending regions where the
two classes overlap.
Reducing the bandwidth to h=2.0 introduces finer details to the KDE decision boundaries.
1. Top Images: Similar to the first set, the data is reduced using KMeans centroids. However, the
reduced bandwidth captures more granularity in the distributions of the two classes.
2. Bottom Images: The decision regions now show sharper transitions between the classes. While this
provides better boundary accuracy, it could result in minor overfitting, particularly in regions where
classes overlap.
With h=1.0, the decision boundaries become significantly more detailed, capturing subtle variations in the
data distribution.
1. Top Images: The scatter plots remain unchanged, but the KDE’s sensitivity to finer features
increases.
2. Bottom Images: The decision regions exhibit much sharper transitions. While this improves
accuracy for complex boundaries, it may start to overfit noise or minor fluctuations in the dataset.
16
Fourth Set of Images (Bandwidth h=0.5,K=100)
A further reduction in bandwidth to h=0.5h = 0.5h=0.5 leads to very localized decision boundaries.
1. Top Images: The scatter plots show the same structure, but the KDE heavily focuses on local
variations due to the smaller bandwidth.
2. Bottom Images: The decision regions are now highly irregular, reflecting individual centroids
rather than a smoothed distribution. This results in overfitting and poor generalization for unseen
samples.
Bandwidth h=0.2
● Top Row: The scatter plots show the original data and KMeans-reduced centroids. Centroids
preserve the dataset’s structure.
● Bottom Row: Decision boundaries are smooth but detailed enough to differentiate regions.
Overlapping areas (x = 8) still blend slightly, offering a balance between generalization and accuracy.
Bandwidth h=0.1
● Top Row: Scatter plots are unchanged, showing data and centroids.
● Bottom Row: Boundaries are highly detailed and closely fit the centroids, capturing subtle
distinctions in overlapping regions. However, they risk overfitting, reflecting noise rather than
overall patterns.
17
Seventh and Eighth Sets of Images (Fewer Centroids: K=2)
For the final sets, the dataset is represented by only 2 centroids per class. This drastically reduces the
complexity of the KDE approximation.
1. Bandwidth h=0.2:
○ The reduced centroids create extremely localized decision boundaries that fail to generalize
to the overall structure of the data. The decision regions reflect individual points rather
than a cohesive separation.
2. Bandwidth h=1.0:
○ The increased bandwidth smooths the decision regions, but the lack of centroids makes it
impossible to capture the original data’s structure. The decision boundaries remain too
simple to accurately classify samples in regions of overlap.
The slide highlights a fundamental limitation of Kernel Density Estimation (KDE): its dependency on a
single global parameter, the bandwidth (h). Bandwidth significantly impacts KDE's performance because it
dictates the degree of smoothness in the estimated probability density. When the same bandwidth is applied
to all classes and new samples, it can lead to poor estimation—too narrow a bandwidth can overfit the data,
capturing noise, while too wide a bandwidth can oversmooth the distribution, losing critical distinctions
between classes. These issues are particularly prominent in high-dimensional spaces, where data sparsity
exacerbates the "curse of dimensionality." This makes KDE less robust in certain scenarios, especially for
imbalanced or complex datasets.
KNN provides an alternative approach that mitigates the challenges of KDE. Instead of fixing a global
parameter, KNN dynamically adjusts the region of consideration by fixing the number of nearest neighbors
(k) and allowing the radius of the hypersphere around the query point to expand until k neighbors are
included. This flexibility ensures that the classification is inherently local and adapts to the data distribution.
Additionally, KNN eliminates the need to estimate a global bandwidth, making it less sensitive to the
high-dimensional input space. The algorithm assigns a class label to a new sample x by majority voting
among its k-nearest neighbors, which provides robustness against data imbalance and noise.
This slide delves into the mathematical foundation of KNN by describing how the conditional probability
of a sample x belonging to a particular class ωcis computed. The formula:
𝐾𝑐
𝑝(𝑥 | ω𝑐) = 𝑁𝑐𝑉
18
● Kc is the number of samples of class ωc within the hypersphere centered at x.
● Ncrepresents the total number of samples in class ωcin the training set.
● V is the volume of the hypersphere, dynamically adjusted to include at least k samples.
The total density of x, considering all classes, is calculated as p(x) = K / NV, where K is the total number of
neighbors in the hypersphere, and N is the total number of training samples. This formula accounts for the
overall density of the data and ensures the probabilities are normalized across all classes.
Additionally, the prior probability of each class, π(ωc), is defined as Nc / N, which reflects the relative
frequency of each class in the training set. Using Bayes' rule, the posterior probability P(ωc ∣ x) is calculated,
representing the probability that x belongs to class ωc given its neighborhood distribution. The
simplification to Kc occurs under specific assumptions, such as uniform prior probabilities or equal class
sizes, which makes KNN computationally efficient and intuitive.
This probabilistic formulation highlights KNN’s flexibility in accommodating varying class distributions
and its reliance on local density rather than global assumptions, making it suitable for non-parametric
classification tasks.
The final slide focuses on the practical aspect of KNN classification. Once the k-nearest neighbors of a query
sample x are identified, the class assignment is straightforward. The sample is assigned to the class with the
highest Kc, the number of neighbors belonging to class ωcwithin the hypersphere centered at x. This process
is encapsulated by the formula:
𝐾𝑐
𝑝(ω𝑐 | 𝑥) = 𝐾
Here, the posterior probability is directly proportional to the representation of class ωcwithin the k-nearest
neighbors. This method ensures that the decision boundary adapts to the local density of the data, making
KNN highly effective for imbalanced datasets and overlapping class distributions.
By dynamically adjusting the hypersphere radius and relying on local neighbor density, KNN offers a robust,
adaptable approach to classification, particularly in scenarios where global models like KDE may struggle.
However, the choice of k and distance metric still plays a critical role in determining KNN’s effectiveness.
19
Ball Trees - Trivial Example
This slide presents a side-by-side view of ball tree coverage in two scenarios. The left diagram shows a
broader distribution of data points, with overlapping spheres that progressively refine the coverage of dense
regions. The right diagram provides a zoomed-in perspective, emphasizing how smaller, overlapping balls
capture finer details of the data distribution. Together, these visuals highlight the flexibility of ball trees in
accommodating various densities and spatial arrangements.
20
The recursive division ensures that even complex datasets are efficiently represented, which is critical for
tasks like KNN search or density estimation in high-dimensional data.
This algorithm generates a Ball Tree from a dataset Dn recursively. Each node represents a "ball," or
hypersphere, containing a subset of the data points. The key steps are as follows:
1. Stopping Criterion: If the size of the current dataset ∣Dn∣ is smaller than the parameter
min_samples, the algorithm terminates for this branch, returning a null node:
1
𝑛𝑜𝑑𝑒. 𝑐𝑒𝑛𝑡𝑟𝑒 ← |𝐷𝑛|
∑ 𝑥
𝑥ϵ𝐷𝑛
○ The radius of the ball is the maximum distance of any point in Dn from the center:
21
𝑥1 ← 𝑎𝑟𝑔 𝑚𝑎𝑥𝑥ϵ𝐷 ||𝑥 − 𝑛𝑜𝑑𝑒. 𝑐𝑒𝑛𝑡𝑟𝑒||, 𝑥2 ← 𝑎𝑟𝑔 𝑚𝑎𝑥𝑥ϵ𝐷 ||𝑥 − 𝑥1||
𝑛 𝑛
○ All data points are projected onto the axis defined by (x1 − x2):
𝑇
𝑧 = {𝑥 · (𝑥1 − 𝑥2) | 𝑥 ϵ 𝐷𝑛}
5. Recursive Calls:
○ The left and right child nodes are constructed recursively:
○ If both child nodes are non-null, the current node does not store data directly to save
memory.
This algorithm retrieves the k-nearest neighbors of a point x using the Ball Tree. The recursive process
leverages the hierarchical structure for efficient searches:
1. Initial Check: The algorithm checks if the point x lies within the current node’s ball:
If this condition holds, the algorithm proceeds to examine the data or children of the node.
2. Leaf Node Data: If the node contains data (i.e., it’s a leaf), the algorithm iterates through all points
xi in the node and calculates their distance from x:
||𝑥 − 𝑥𝑖||
It maintains a max heap, knn, of size k to store the k-nearest neighbors. For each point:
○ If the distance is smaller than the current maximum in knn, the point is added:
22
𝑘𝑛𝑛. 𝑝𝑢𝑠ℎ(𝑥𝑖)
𝑘𝑛𝑛. 𝑝𝑜𝑝()
3. Recursive Search: If the node has children, the algorithm recursively checks both the left and right
child nodes:
4. Final Neighbors: Once all relevant branches have been searched, the knn heap contains the
k-nearest neighbors of x.
By combining the hierarchical partitioning of Ball Trees and the recursive search strategy, these algorithms
significantly reduce the computational complexity of nearest-neighbor searches, especially in
high-dimensional spaces. The formulas ensure a precise definition of distances and subsets at each step.
23