0% found this document useful (0 votes)
19 views21 pages

Som New

The document discusses the differences between supervised and unsupervised learning, emphasizing that supervised learning uses labeled data for prediction while unsupervised learning identifies patterns in unlabeled data. It details the K-Means algorithm for clustering data, including its steps, challenges, and potential improvements, such as using median instead of mean to handle noise. Additionally, it explores the connection between K-Means and neural networks, highlighting the importance of normalization and weight updates in the learning process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views21 pages

Som New

The document discusses the differences between supervised and unsupervised learning, emphasizing that supervised learning uses labeled data for prediction while unsupervised learning identifies patterns in unlabeled data. It details the K-Means algorithm for clustering data, including its steps, challenges, and potential improvements, such as using median instead of mean to handle noise. Additionally, it explores the connection between K-Means and neural networks, highlighting the importance of normalization and weight updates in the learning process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit-4 UNSUPERVISED LEARNING

Dr. John Babu


October 21, 2024

1 Supervised Learning vs Unsupervised Learning


Supervised Learning
Supervised learning is a type of machine learning where the algorithm is trained on labeled data, meaning the
input data is paired with the correct output. The goal is to learn a mapping from inputs to outputs, so that
when given new input data, the algorithm can predict the corresponding output.

Example: Consider a dataset of house prices. Each house is represented by features such as the number of
bedrooms, square footage, and location. The price of the house is the output (or label). In supervised learning,
the algorithm is trained on this data to predict the price of a new house based on its features.

ˆ Input (Features): Number of bedrooms, square footage, location

ˆ Output (Label): House price

Common algorithms in supervised learning include:


ˆ Linear regression

ˆ Decision trees

ˆ Support vector machines (SVM)

ˆ Neural networks

Unsupervised Learning
Unsupervised learning is a type of machine learning where the algorithm is given data that is not labeled. The
algorithm must learn patterns, relationships, or structures in the data without explicit guidance on what the
output should be. It is mainly used for clustering or dimensionality reduction.

Example: Consider a dataset of customer data at a retail store with features like age, purchasing history, and
browsing time. There are no labels to indicate what group each customer belongs to. In unsupervised learning,
the algorithm can group similar customers together, which can be useful for targeting marketing campaigns.

ˆ Input (Features): Age, purchasing history, browsing time

ˆ Output: No predefined label; the algorithm finds patterns like customer segments.

Common algorithms in unsupervised learning include:

ˆ K-means clustering

ˆ Hierarchical clustering

ˆ Principal component analysis (PCA)

ˆ Autoencoders

1
Differences Between Supervised and Unsupervised Learning

Supervised Learning Unsupervised Learning


Requires labeled data (input-output pairs). Works with unlabeled data (only input features).
The goal is to learn a mapping from input to output. The goal is to discover hidden patterns in the data.
Used for tasks like classification and regression. Used for tasks like clustering and dimensionality reduction.
Example: Predicting house prices. Example: Grouping customers based on purchasing behavior.
Algorithms: Linear regression, SVMs, etc. Algorithms: K-means, PCA, etc.
Feedback provided in the form of labels. No feedback provided during training.
Can be used for prediction. Mainly used for exploratory data analysis.

Table 1: Comparison between Supervised and Unsupervised Learning

2 K-Means Algorithm
The K-Means algorithm is used to divide a given dataset into k distinct groups or clusters. Below is an
easy-to-follow explanation of the algorithm step by step:

Objective The goal of the K-Means algorithm is to group the data points into k clusters such that points in
the same cluster are more similar to each other than to points in other clusters.

Initial Setup You start by choosing the number of clusters (k ), which you need to decide beforehand. Initially,
the algorithm assigns k random points in the input space as the centers of the clusters (called centroids).

Distance Calculation To group data points, we need to measure how close a point is to a cluster center.
This is done using a distance measure like Euclidean distance (the straight-line distance between two points).

Assigning Points to Clusters For each data point, calculate its distance from all the cluster centers. Assign
each data point to the cluster whose center is closest to it. This means each point is grouped based on the
minimum distance to a cluster center.

Updating Cluster Centers After assigning all data points to clusters, compute the mean of the points
in each cluster. The new cluster center is placed at this mean point (the average position of all points in the
cluster). This step ensures that the cluster centers move towards the center of the group of points they are
responsible for.

Iterative Process Repeat the steps of reassigning points to clusters and updating the cluster centers. This
process continues until the cluster centers stop moving or change very little, meaning the algorithm has converged
and the clusters are stable.

Stopping Criteria The algorithm stops when the cluster centers don’t change their positions significantly,
meaning the best cluster centers for the data have been found.

Example Imagine a group of people (data points) being grouped around k tour guides (cluster centers).
Initially, the tour guides stand randomly, and people move toward the guide closest to them. As people gather,
the tour guides move to the center of their respective groups. This process repeats until the tour guides no
longer need to move.

K-Means Algorithm
1. Start by selecting k (number of clusters).
2. Randomly initialize k cluster centers.
3. For each point, calculate the distance to each cluster center and assign it to the closest one.
4. Recalculate the cluster center as the mean of the assigned points.
5. Repeat the process until cluster centers stop moving.
6. The output is k clusters where each point belongs to its nearest cluster.

2
The k-Means Algorithm
ˆ Initialisation
– Choose a value for k.
– Choose k random positions in the input space.
– Assign the cluster centres µj to those positions.
ˆ Learning
– Repeat
* For each datapoint xi :
ˆ Compute the distance to each cluster centre.
ˆ Assign the datapoint to the nearest cluster centre with distance di = minj d(xi , µj ).
* For each cluster centre:
ˆ Move the position of the centre to the mean of the points in that cluster:
Nj
1 X
µj = xi
Nj i=1
where Nj is the number of points in cluster j.
– Until the cluster centres stop moving.
ˆ Usage
– For each test point:
* Compute the distance to each cluster centre.
* Assign the datapoint to the nearest cluster centre with distance di = minj d(xi , µj ).

3 Issues in the k-Means Algorithm


1. Clustering Data:
ˆ The k-means algorithm is used to cluster data points into different groups.
ˆ The positions of the initial cluster centres(means) can greatly affect the clustering outcome.
2. Local Minima Issue:
ˆ The algorithm can get stuck in local minima, leading to different and sometimes unexpected clustering
results based on where we start.
3. Choosing the Number of Clusters:
ˆ The k-means algorithm struggles when we do not know the optimal number of clusters in advance.
4. Improving Results with Multiple Runs:
ˆ To overcome the issues of local minima and cluster count, we can run the k-means algorithm multiple
times.
ˆ By starting with different initial positions for the cluster centres, we increase the chances of finding
a good solution.
ˆ The best solution is typically the one that minimizes the overall sum-of-squares error.
5. Evaluating Different Values of k:
ˆ We should experiment with various values of k to determine which one yields the best clustering
result.
ˆ Caution is needed when measuring the sum-of-squares error:
– If k is set equal to the number of data points, each point can become its own cluster.
– This would yield a sum-of-squares error of zero, but the solution would be overly complex and
specific to the data—this is known as overfitting.
6. Using a Validation Set:
ˆ To avoid overfitting, we can calculate the error using a separate validation set.
ˆ By multiplying the error by k, we can assess the benefits of adding extra cluster centres and make
more informed decisions about the optimal number of clusters.

3
4 Dealing with Noise
Clustering is often used to handle noisy data. Noisy data can occur for several reasons; it might be slightly
corrupted or even completely wrong. If we can group the data points into the right clusters, we can effectively
reduce the noise. This is because we can replace each noisy data point with the average of the points in its
cluster.
However, the k-means algorithm uses the average (mean) to find the cluster centers, which can be greatly
affected by outliers—these are extreme values that don’t fit with the rest of the data. For example, if we have
the numbers (1, 2, 1, 2, 100), the average is 21.2, but the middle value, known as the median, is 2. The median
is a robust statistic because it is not influenced by outliers.
To improve our clustering method, we can replace the average with the median when calculating the cluster
centers. This change makes the algorithm more resistant to noise. However, calculating the median can take
more time than finding the average, but it helps in effectively reducing noise in the data.

5 The k-Means Neural Network


The k-means algorithm is effective despite challenges like handling noise and deciding the number of clusters.
Interestingly, it has a connection to neural networks. If we think of the cluster centers we optimize as locations
in weight space, we can position neurons at these points and apply neural network training. In the k-means

algorithm, each input identifies the closest cluster center by calculating the distance to all centers. We can
mimic this in a neural network. The position of each neuron corresponds to its weights in weight space. For
each input, we calculate the distance between the input and each neuron’s weight. During training, we adjust
the position of the neuron by changing its weights.
To implement the k-means algorithm using neurons, we will use one layer of neurons and some input nodes,
without a bias node. The first layer will consist of the input nodes that don’t perform any computations, while
the second layer will contain competitive neurons. Only one neuron will “fire” for each input, meaning that
only one cluster center can represent a particular input vector. The neuron that is closest to the input will be
chosen to fire. This approach is called winner-takes-all activation, which is a form of competitive learning. In
competitive learning, neurons compete to fire based on their closeness to the input. Each neuron may learn
to recognize a specific feature, firing only when it detects that feature. For example, you could have a neuron
trained to recognize your grandmother.
We will select k neurons and connect the inputs to the neurons fully, as usual. The activation of the neurons
will be calculated using a linear transfer function as follows:
X
hi = wij xj .
j

When the inputs are normalized to have the same scale, this effectively measures the distance between the
input vector and the cluster center represented by that neuron. Higher activation values indicate that the input
and the cluster center are closer together.
The winning neuron is the one closest to the current input. The challenge is updating the position of that
neuron in weight space, or how to adjust its weights. In the original k-means algorithm, this was straightforward;
we simply set the cluster center to be the mean of all data points assigned to that center. However, in neural
network training, we often input just one vector at a time and change the weights accordingly (using online
learning rather than batch learning). This means we don’t know the mean because we only have information
about the current input.

4
To address this, we approximate the mean by moving the winning neuron closer to the current input. This
makes it more likely to be the best match for the next occurrence of that input. The adjustment is given by:

∆wij = ηxj .
However, this method might not be sufficient. We will explore why that is when we discuss normalization
further.

6 Normalization and Weight Updates


When working with neural networks, it’s essential to ensure that the weights of all neurons are on the same
scale. For example, if most neuron weights are small (less than 1) but one neuron’s weights are much larger
(let’s say 10), this can create problems.
Imagine we have an input vector with values (0.2, 0.2, −0.1). If this input perfectly matches a neuron with
smaller weights, its activation (the output value) will be calculated using a specific formula. However, for the
neuron with the large weights, the activation will be much higher due to those larger weights. This means the
neuron with larger weights may be incorrectly chosen as the best match, even if it’s not the most appropriate
one.
To compare activations fairly among different neurons, we need to normalize the weights. Normalization
ensures that all neurons are positioned on what is known as a unit hypersphere. This means that the distance
of each neuron from the origin (the point where all values are zero) is set to one.
The unit hypersphere is similar to a circle in two dimensions and a sphere in three dimensions, allowing us
to accurately compare the activation values of different neurons.
After normalizing, the activation of a neuron can be expressed in terms of its weights and the input vector.
This expression helps us find the similarity between the input and the neuron weights.

7 Neuronal Activation
The neuronal activation can be expressed as:

hi = WiT · x,
where · denotes the inner product (or scalar product) between two vectors, and WiT is the transpose of the
i-th row of the weight matrix W .
The inner product calculates the product ∥Wi ∥∥x∥ cos θ, where θ is the angle between the two vectors, and
∥ · ∥ represents the magnitude of the vector. If the magnitudes of all vectors are normalized to one, then only
the angle θ influences the size of the dot product.
This means that the closer the two vectors point in the same direction, the larger the activation will be.
Therefore, the angle between the vectors is crucial in determining how effectively they match each other.

8 A Better Weight Update Rule


The weight update rule previously mentioned allows weights to grow indefinitely, which can lead them to move
away from the unit hypersphere. To address this, we can normalize the inputs and use the following weight
update rule:

∆wij = η(xj − wij ),


where this rule moves the weight wij directly toward the current input. It’s important to note that only the
weights of the winning unit (the neuron that best matches the input) are updated.
For example, in our training process, we determine the activation for each neuron, find the winning neuron,
and then update its weights based on the current input.
In many supervised learning algorithms, we minimize the sum of squares difference between the output
and the target, affecting all weights together. However, in our current approach, each weight is minimized
independently. This change makes the analysis of the algorithm’s behavior more complex, which is a common
challenge for competitive learning algorithms. Despite this complexity, competitive learning algorithms tend to
perform well.
Now that we have a better weight update rule, we can look into the overall algorithm for the online k-means
network.
The important point is that we only update the weights of the neuron that is the closest match to the
input. In this method, unlike in supervised learning where we minimize the overall error across all weights, we

5
minimize the error for each weight independently. This makes the analysis of the algorithm more complex, but
competitive learning algorithms like this tend to work well in practice.
Key Points:

Normalization: Ensures that all weights are on the same scale.

Unit Hypersphere: Helps in comparing activations fairly.

Weight Update Rule: Moves weights closer to inputs based on their distance.

Independence: Each weight is updated independently, making the analysis complex but effective.

9 The On-Line k-Means Algorithm


ˆ Initialization

– Choose a value for k, which represents the number of output nodes.


– Initialize the weights with small random values.
ˆ Learning

– Normalize the data so that all points lie on the unit sphere.
– Repeat the following steps:
* For each datapoint:
· Compute the activations of all the nodes.
· Pick the winner as the node with the highest activation.
· Update the weights using the weight update rule.
– Continue until the number of iterations exceeds a predefined threshold.

ˆ Usage

– For each test point:


* Compute the activations of all the nodes.
* Pick the winner as the node with the highest activation.

10 Implementation of K-means with Iris Dataset


Now that we have a method to train the k-means algorithm, we can learn about data. However, we need to
consider how to interpret the results. If the data doesn’t have labels, it can be challenging to analyze the results
since we have nothing to compare them with.
We can apply unsupervised learning methods to cluster data where we know some of the labels. For instance,
we can use the k-means algorithm on the Iris dataset we studied earlier. This dataset classifies three types of
iris flowers using the MLP.
To use the algorithm, we only need to provide some of the data for training and then test it with additional
data. However, the output of the k-means algorithm is not straightforward since we do not utilize the labels
from the dataset in this unsupervised learning context.
To address this, we need a way to convert the algorithm’s results, which indicate the index of the best-
matching cluster, into classification outputs that we can compare with the actual labels. This conversion is
relatively simple if we use three clusters in the algorithm, as there should ideally be a one-to-one correspondence
between them. However, using more clusters might yield better results, albeit complicating the analysis.
If the number of data points is small, you can do this by hand. Alternatively, you can use a supervised
learning algorithm to automate this process, which we will discuss next.
To illustrate how the k-means algorithm is applied, let’s explore its use on the Iris dataset:

6
11 Key Points on Competitive Learning for Clustering
ˆ Cluster Assignment: After training, a new datapoint can be classified by presenting it to the trained
algorithm, which determines the activated cluster.
ˆ Class Label Interpretation: When target data is available, the best-matching cluster can be interpreted
as a class label. However, careful consideration is needed since the order of nodes may not align with the
order of target classes.
ˆ Matching Output Classes: To ensure accurate classification, it is essential to match output classes to
target data carefully. Misalignment can lead to misleading results.
ˆ Combining K-Means and Perceptrons: The k-means algorithm can position Radial Basis Function
(RBF) nodes in the input space, which accurately represents the input data. A Perceptron can be trained
on top of these RBFs to match output classes to the target data categories.
ˆ Utilizing More Clusters: Using more clusters in the k-means network allows for better representation
of the input data without needing to manually assign datapoints to clusters since the Perceptron handles
this classification.

12 Difference between Normal K-Means and Competitive K-Means


Normal K-Means is the standard clustering algorithm where the assignment of data points to centroids is
purely based on the distance (usually Euclidean distance). The algorithm repeatedly computes the mean of
all points assigned to a cluster and updates the cluster centroids.
Competitive K-Means introduces the concept of competitive learning into K-Means, where centroids
”compete” for each data point. The key difference is how centroids are updated. In Competitive K-Means:
ˆ The centroids are updated not based on the average of the assigned points but in a way that the ”winning”
centroid (the closest one) adjusts itself incrementally toward the input data point.
ˆ This is similar to competitive learning in neural networks, where only the winning neuron (centroid) is
adjusted during each iteration.

Noise Removal with Competitive Learning


Competitive learning can also be used for noise removal during the clustering process. In datasets with noise,
outlier points can disrupt the clustering process in normal K-Means, as these points can heavily affect the
mean of the clusters. However, in Competitive K-Means, the incremental update strategy reduces the impact
of noisy points. Since the centroids are updated gradually towards each data point, they adapt more slowly to
outliers. Additionally, noisy points may not win the competition frequently, so their influence on the centroids
is minimized, effectively filtering out noise over time.

Intuition with Example


Let us consider a simple numerical example to highlight the differences.
Data Points: We have four data points:

(2, 2), (3, 4), (5, 6), (8, 8)

and we want to cluster them into 2 clusters (k = 2 ).


Initial Centroids: Assume initial centroids are:

Centroid 1 : (2, 2), Centroid 2 : (7, 7)

1. Normal K-Means
In normal K-Means, we follow these steps:
Step 1: Assignment
For each point, compute the distance to both centroids and assign the point to the nearest centroid.
Distances:
ˆ Point (2,2) → Distances to Centroid 1: 0, to Centroid 2: 7.07 → Assigned to Centroid 1.
ˆ Point (3,4) → Distances to Centroid 1: 2.24, to Centroid 2: 5 → Assigned to Centroid 1.

7
ˆ Point (5,6) → Distances to Centroid 1: 5.66, to Centroid 2: 2.24 → Assigned to Centroid 2.

ˆ Point (8,8) → Distances to Centroid 1: 8.48, to Centroid 2: 1.41 → Assigned to Centroid 2.

Step 2: Update

ˆ New Centroid 1: Mean of (2,2) and (3,4) → (2.5, 3)

ˆ New Centroid 2: Mean of (5,6) and (8,8) → (6.5, 7)

Step 3: Repeat
Re-assign and update until convergence. This iterative process continues until the centroids stop moving
significantly.

2. Competitive K-Means
In Competitive K-Means, the centroids compete for each data point and only the ”winning” centroid is updated
incrementally towards the input data point, rather than updating based on the average of all assigned points.
Step 1: Assignment (Competition)
Like normal K-Means, we assign points to the nearest centroid based on distance.
Step 2: Incremental Update
Instead of calculating the mean, the winning centroid is incrementally adjusted towards each assigned data
point. This adjustment is often controlled by a learning rate (η).
Set learning rate η = 0.5. Let us go through each point:

ˆ Point (2, 2): Closest to Centroid 1.

New Centroid 1 = (2, 2) + 0.5 × ((2, 2) − (2, 2)) = (2, 2) (no change)

ˆ Point (3, 4): Closest to Centroid 1.

New Centroid 1 = (2, 2) + 0.5 × ((3, 4) − (2, 2)) = (2.5, 3)

ˆ Point (5, 6): Closest to Centroid 2.

New Centroid 2 = (7, 7) + 0.5 × ((5, 6) − (7, 7)) = (6, 6.5)

ˆ Point (8, 8): Closest to Centroid 2.

New Centroid 2 = (6, 6.5) + 0.5 × ((8, 8) − (6, 6.5)) = (7, 7.25)

Repeat
The process repeats for each new point with incremental updates to the centroids.

Key Differences
ˆ Centroid Update Method:

– Normal K-Means: Centroids are updated by averaging all assigned points.


– Competitive K-Means: The winning centroid is updated incrementally toward each data point
(using a learning rate).
ˆ Convergence Speed:

– Normal K-Means may converge faster as centroids move directly to the mean.
– Competitive K-Means may require more iterations due to incremental updates but could offer
better adaptability, especially for online data where new points arrive sequentially.
ˆ Suitability:

– Normal K-Means works well with batch data and when all data points are available.
– Competitive K-Means is more suitable for streaming or dynamic data, where centroids continu-
ously evolve based on incoming data points.

8
Basics for Choosing k in k-Means Clustering
Definition of k in k-Means:
ˆ In k-means clustering, k represents the number of clusters you aim to partition your dataset into. Each
data point is assigned to one of these k clusters based on proximity to the cluster centroids.
Elbow Method:
ˆ The Elbow Method is a popular technique for selecting k. It involves plotting the sum of squared distances
(SSD) between data points and their assigned cluster centroids for different values of k.
ˆ As k increases, SSD decreases because the clusters better represent the data. However, after a certain
point, the improvement becomes marginal, and the plot starts to ”bend” or form an elbow. The value of
k at this elbow is often a good choice, as it balances model accuracy and simplicity.
Silhouette Score:
ˆ The silhouette score measures how similar a data point is to its own cluster compared to other clusters.
A high silhouette score indicates that data points are well-clustered.
ˆ By calculating the silhouette score for different values of k, the k with the highest average score is
considered the best choice.
Domain Knowledge:
ˆ Often, the choice of k depends on prior knowledge about the data. For example, if you are clustering
customer groups in a retail setting, business objectives or existing customer segmentation models may
guide you in choosing a suitable k.
Cross-Validation:
ˆ In cases where you can validate the clustering performance with some labeled data, cross-validation
techniques can help determine the optimal k. You can assess the accuracy of clustering for different
values of k based on external validation metrics like purity or adjusted Rand index.
Gap Statistic:
ˆ The Gap Statistic compares the total within-cluster variation for different values of k with the expected
variation under a random distribution. The best k is the one with the largest gap, indicating that the
clustering structure is far from random.
Hierarchical Clustering for Pre-analysis:
ˆ Performing hierarchical clustering before running k-means can help estimate a good value for k. Dendro-
grams from hierarchical clustering can visually suggest a natural number of clusters in the data.
*Practical Considerations:
ˆ Data Size: For large datasets, too many clusters can lead to overfitting, while too few clusters may not
capture the true underlying structure.
ˆ Computational Resources: The time complexity of k-means increases with k. Choosing a large k for
big datasets may be computationally expensive, so practical constraints should be considered.
Conclusion: Choosing k in k-means clustering is not straightforward, and multiple techniques should be
used in conjunction to select the most appropriate value. Methods like the Elbow Method, Silhouette Score,
and Gap Statistic, combined with domain knowledge and practical considerations, provide a robust approach
to determining k.

13 Vector Quantization
In Vector Quantization (VQ), we use competitive learning to compress data, a concept closely related to
noise removal. Data compression is essential in many applications, such as storing data or transmitting speech
and image data. The main idea is to reduce the number of data points by replacing them with a cluster center
or ”prototype vector” that the data point belongs to.
In noise reduction, the goal is to replace a noisy input with a cleaner, representative cluster center. In
data compression, the purpose is to minimize the amount of data being transmitted or stored. The method
for both involves using competitive learning to identify the closest cluster center to each data point and replacing
the input with this center.

9
Data Compression as Communication
Let us consider a scenario where we need to send data but aim to minimize the amount of information transmitted
to save resources. If there are many repeated data points, we can create a codebook of representative points
(prototype vectors). Instead of transmitting each data point, we can send only the index of the data point in
the codebook, which is shorter and more efficient.
The receiver would use the indices to retrieve the actual data from the codebook. To further optimize,
more frequent data points can be assigned shorter indices. This method is widely used in sound and image
compression algorithms.

Lossy Compression and Vector Quantization


One challenge arises when a data point is not in the codebook. In that case, we send the index of the closest
available prototype vector. This is the core idea behind vector quantization, which leads to lossy compres-
sion. In lossy compression, the data received may not be identical to the original, but it is close enough for
practical use.

Voronoi Tessellation and Delaunay Triangulation


To understand Voronoi Tessellation and Delaunay Triangulation, we can think of the following:
In Voronoi Tessellation, we divide the space into regions around each prototype vector (cluster center).
Imagine a group of cities on a map, and we want to assign every location on the map to its nearest city. Each
city will ”own” a region where all locations inside it are closer to that city than any other. This region around
the city is called a Voronoi cell. All the data points that lie within the same cell are considered to be closer
to that city (or prototype vector) than any other.
Each prototype vector acts like the center of its own region. These regions form a pattern across the
space, and together they create what we call the Voronoi Tessellation—a way of dividing up space based on
distance to the prototype vectors. Every data point inside a region is represented by the same prototype vector,
which simplifies how we store or transmit the data. Delaunay Triangulation is another way to understand

the relationships between prototype vectors. If we connect the prototype vectors (or ”cities”) that share a
boundary in the Voronoi Tessellation, we create a set of triangles. These triangles form what is known as the
Delaunay Triangulation, which shows the optimal way to organize the space for certain tasks, like function
approximation.
In simpler terms, Delaunay Triangulation is like drawing lines between neighboring cities that share a
common boundary, creating a web of connections between them. This structure helps us in efficiently organizing
and processing data, especially in algorithms that need to approximate or interpolate between points.

Choosing the Prototype Vectors


The key challenge in vector quantization is choosing prototype vectors that are close to all possible inputs. This
is where competitive learning comes into play. The algorithm that selects the best prototype vectors is called
Learning Vector Quantization (LVQ). The k-means algorithm is a common approach, especially when we
know the size of the codebook. Another useful algorithm is the Self-Organizing Feature Map (SOFM),
which we will discuss next.

10
Simple Example for Compression
Let us consider the following data points: (1, 2), (2, 3), (8, 9), and (9, 10). Instead of sending each data point,
we can group similar points into clusters. Let’s say we create two prototype vectors:

ˆ Prototype 1: (1.5, 2.5)

ˆ Prototype 2: (8.5, 9.5)

Now, instead of sending the full data points, we only send indices referring to these prototypes:
ˆ (1, 2) → Prototype 1

ˆ (2, 3) → Prototype 1

ˆ (8, 9) → Prototype 2

ˆ (9, 10) → Prototype 2

By sending indices instead of full data points, the transmission becomes much more efficient.
This process is the basis of vector quantization and is used in various compression algorithms in modern
technology.

14 Self-Organizing Feature Map (SOM)


The Self-Organizing Feature Map (SOM) is a competitive learning algorithm widely used in machine
learning, especially for tasks such as pattern recognition and data visualization. This algorithm was developed
by Teuvo Kohonen in 1988, and it simulates how sensory inputs are organized in the brain, particularly in
areas like the auditory cortex. In the brain, neurons that respond to similar inputs are placed closer to each
other. SOM attempts to mimic this organization in a computational model.

Key Concepts
ˆ Feature Mapping: In SOM, the neurons are arranged in a grid, typically in a 2D grid (but can also be
1D). Neurons that are close to each other in this grid will respond to similar input patterns, while neurons
that are far apart will respond to very different inputs. This is known as topology preservation.
ˆ Topology Preservation: In simpler terms, topology preservation means that the structure of the data
(the relationship between different inputs) is maintained on the grid. If two inputs are similar, the neurons
that represent them will be near each other. If the inputs are very different, the neurons will be far apart.
However, since most input spaces have higher dimensions than the grid (which is usually 2D), it’s not
always possible to perfectly preserve the structure of the data.

Example
Let us say we have different sounds, such as:

ˆ Sound A: A low-pitch sound.

ˆ Sound B: A medium-pitch sound.

ˆ Sound C: A high-pitch sound.

In SOM, the neurons responding to Sound A and Sound B will be close together on the map because they are
similar in pitch. Neurons responding to Sound C will be farther away, as it is very different from Sound A and
Sound B.

15 How Self-Organizing Maps (SOM) Work


Self-Organizing Maps (SOM) are a type of artificial neural network used for unsupervised learning. They are
particularly effective for clustering and visualizing high-dimensional data. Here is a step-by-step explanation of
how SOM works:

11
ˆ Initialization
The process starts by initializing a grid of neurons. This grid can be one-dimensional (a line) or two-
dimensional (a rectangular array). Each neuron is assigned a weight vector, which is usually initialized
randomly. The weight vector has the same dimensionality as the input data.
ˆ Input Data Presentation
The SOM is trained using a dataset of input patterns. In each iteration, a single input pattern (a data
point) is presented to the SOM.
ˆ Best Matching Unit (BMU) Identification
For the presented input pattern, the SOM identifies the neuron whose weight vector is most similar to
the input. This neuron is called the Best Matching Unit (BMU). The similarity is typically measured
using a distance metric, such as the Euclidean distance. The BMU is the neuron with the smallest distance
to the input vector.
ˆ Weight Adjustment
Once the BMU is identified, its weights are updated to become more similar to the input pattern. The
adjustment is based on the distance from the BMU to other neurons in the grid. The update rule can be
described as follows:

Wi (t + 1) = Wi (t) + α(t) · hci (t) · (X(t) − Wi (t)) (1)

where:
– Wi (t) is the weight vector of neuron i at time t.
– α(t) is the learning rate, which decreases over time.
– hci (t) is the neighborhood function, which determines the influence of the BMU on its neighboring
neurons.
– X(t) is the input vector at time t.
Neurons that are close to the BMU also have their weights adjusted, but to a lesser extent, based on their
distance from the BMU. This is achieved through the Mexican Hat function mentioned earlier.
ˆ Iteration and Convergence
The process of presenting input patterns, identifying the BMU, and adjusting weights is repeated for a
number of iterations. As training progresses, the learning rate and the size of the neighborhood decrease.
This allows the SOM to stabilize, clustering similar input patterns together while preserving the topological
relationships.
ˆ Result
After training, the SOM produces a 2D representation of the input space, where similar input patterns
are mapped to nearby neurons. This makes it easier to visualize and analyze complex data.
ˆ Example
Consider a dataset containing different types of fruits. Each fruit can be represented by features such as
color, size, and sweetness. When this dataset is fed into a SOM, fruits with similar characteristics will be
grouped together on the map. For example, apples and pears might be located close to each other, while
oranges are placed farther away, reflecting their different features.

SOM Process
1. Initialization: We start by initializing a grid of neurons, each with a random weight vector.
2. Input Presentation: For each input data point, we find the neuron whose weight vector is closest to
the input. This neuron is called the Best Matching Unit (BMU).
3. Update Weights: After finding the BMU, we adjust the weights of the BMU and its neighbors so that
they become closer to the input data point. The amount of adjustment depends on the distance from the
BMU: closer neurons are adjusted more, and farther neurons are adjusted less.
4. Repeat: This process is repeated for many input data points until the grid of neurons has organized itself
into a meaningful pattern that represents the structure of the data.

12
Voronoi Tessellation in SOM
SOM also forms Voronoi tessellations, similar to what we have seen in competitive learning. Each neuron
represents a region of the input space, and the input points that are closest to a particular neuron belong to
that neuron’s region.

Real-Time Example
Let us consider the example of classifying flower species (e.g., iris flowers) based on features like petal length,
petal width, sepal length, and sepal width. After training the SOM on flower data, the neurons that represent
similar species will be close to each other on the grid, while neurons representing very different species will be
farther apart. This helps us visualize the relationships between different species on a 2D map.

16 Mexican Hat Function in Self-Organizing Maps (SOM)


The Mexican Hat function, also known as the Ricker wavelet, is a mathematical function commonly used
in various fields, including signal processing and neural networks. In the context of Self-Organizing Maps

13
(SOM), it serves as a metaphor for the strength of lateral connections between neurons, representing how
nearby neurons interact with one another.

Intuition Behind the Mexican Hat Function in SOM


1. Shape and Characteristics:
ˆ The Mexican Hat function resembles a sombrero hat, with a peak in the center and slopes downward
on all sides, creating a ”bell” shape.
ˆ Mathematically, it is defined as the second derivative of a Gaussian function, which means it captures
the local structure around a point.
2. Lateral Connections:
ˆ In SOM, when a neuron (let’s call it the Best Matching Unit, or BMU) is activated by an input,
it influences its neighboring neurons.
ˆ The strength of this influence diminishes with distance from the BMU. Neurons that are close to the
BMU receive a stronger pull, while those that are farther away experience weaker interactions.
3. Pulling and Pushing:

ˆ The Mexican Hat function can be visualized as follows:


– Positive Influence (Pull): Neurons close to the BMU are attracted, meaning their weights
will be adjusted to be more similar to the input pattern. This creates a clustering effect where
similar features are grouped together.
– Negative Influence (Push): Neurons further away from the BMU experience a repelling effect.
Their weights are adjusted to be less similar to the input pattern, which helps prevent them from
clustering with the BMU.

4. Topological Preservation:
ˆ The Mexican Hat function supports the concept of topological preservation. It ensures that
similar input patterns (which are close together in input space) will result in neurons that are also
close together in the SOM grid.
ˆ This characteristic allows the SOM to represent complex input spaces in a reduced dimensional form
while maintaining the structure of the original data.
5. Learning Dynamics:
ˆ During training, as the SOM processes input patterns, the Mexican Hat function guides the learning
process.
ˆ Neurons that respond to similar features will adjust their weights closer together, while those that
respond to different features will move apart, effectively learning the underlying structure of the
input data.
6. Example to Illustrate:
ˆ Consider a scenario where the SOM is trained on images of animals, such as cats, dogs, and birds.

14
ˆ When an image of a cat is presented:
– The neuron that best matches the image (BMU) is activated.
– Nearby neurons, which may represent similar features (like whiskers or fur patterns), are pulled
closer in weight space, reinforcing their similarity to the cat image.
– Neurons further away, representing animals with very different features (like a bird), are pushed
away, preventing them from being wrongly associated with the cat.
7. Summary:
ˆ The Mexican Hat function in SOM embodies the principle of competitive learning and the importance
of local structure in data.
ˆ By establishing a mechanism for neurons to pull together similar features while pushing apart dis-
similar ones, it effectively allows the SOM to learn and represent complex relationships in high-
dimensional data spaces in a lower-dimensional grid.

Conclusion
The Self-Organizing Feature Map is a powerful tool for dimensionality reduction and pattern recognition. By
organizing neurons into a grid and preserving the topology of the input data, SOM allows us to visualize and
understand the relationships between different data points.
In summary:
ˆ SOM arranges neurons in a grid, where nearby neurons respond to similar inputs.
ˆ Topology preservation means that the grid tries to reflect the structure of the data.
ˆ Neurons interact through lateral connections, where winning neurons pull neighbors closer in weight space.
ˆ SOM is useful for tasks like data visualization and pattern recognition.

17 The SOM Algorithm


Using the full Mexican Hat lateral interactions between neurons is fine, but it isn’t essential. In Kohonen’s
SOM algorithm, the weight update rule is modified to include information about neighboring neurons, making
the algorithm simpler. The algorithm is a competitive learning algorithm, where one neuron is chosen as the
winner. When its weights are updated, the weights of its neighbors are updated as well, although to a lesser
extent. Neurons that are not within the neighborhood are ignored, not repelled.
We will now look at the Self-Organising Feature Map Algorithm.
Using the full Mexican Hat lateral interactions between neurons is fine, but it isn’t essential. In Kohonen’s
SOM algorithm, the weight update rule is modified to include information about neighboring neurons, making
the algorithm simpler. The algorithm is a competitive learning algorithm, where one neuron is chosen as the
winner. When its weights are updated, the weights of its neighbors are updated as well, although to a lesser
extent. Neurons that are not within the neighborhood are ignored, not repelled.
We will now look at the Self-Organising Feature Map Algorithm.

Self-Organising Feature Map Algorithm


ˆ Initialization
– Choose a size (number of neurons) and number of dimensions d for the map.
– Either:
* Choose random values for the weight vectors so that they are all different, OR
* Set the weight values to increase in the direction of the first d principal components of the
dataset.
ˆ Learning
– Repeat:
* For each datapoint:
· Select the best-matching neuron nb using the minimum Euclidean distance between the
weights and the input:
nb = arg min ∥x − wjT ∥
j

15
· Update the weight vector of the best-matching node using:

wjT ← wjT + η(t)(x − wjT )

where η(t) is the learning rate.


· Update the weight vector of all other neurons using:

wjT ← wjT + ηn (t)h(nb , t)(x − wjT )

where ηn (t) is the learning rate for neighborhood nodes, and h(nb, t) is the neighborhood
function that decides whether each neuron should be included in the neighborhood of the
winning neuron (so h = 1 for neighbors and h = 0 for non-neighbors).
· Reduce the learning rates and adjust the neighborhood function, typically by:
k
η(t + 1) = αη(t) kmax

where 0 ≤ α ≤ 1 decides how fast the size decreases, k is the number of iterations the
algorithm has been running for, and kmax is when learning is intended to stop. The same
equation is used for both learning rates (η, ηn ) and the neighborhood function h(nb, t).
– Until the map stops changing or some maximum number of iterations is exceeded.
ˆ Usage

– For each test point:


* Select the best-matching neuron nb using the minimum Euclidean distance between the weights
and the input:
nb = arg min ∥x − wjT ∥
j

Neighborhood Connections
The size of the neighborhood is another parameter that we need to control. How large should the neighborhood
of a neuron be? If we start our network off with random weights, as we did for the Multi-Layer Perceptron
(MLP), then at the beginning of learning, the network is pretty well unordered (as the weights are random).
Two nodes that are very close in weight space could be on opposite sides of the map, and vice versa. Therefore,
it makes sense that the neighborhoods should be large to ensure we achieve a rough ordering of the network
correctly.
However, once the network has been learning for a while, the rough ordering has already been established,
and the algorithm starts to fine-tune the individual local regions of the network. At this stage, the neighborhoods
should be smaller, as shown in Figure.

*
Neighborhood Connections in Self-Organizing Maps
Neighborhood connections play a crucial role in the functioning of Self-Organizing Maps (SOMs) by facili-
tating the competitive learning process and enhancing the mapping of input data to the neuron grid. The size
and dynamics of these neighborhood connections are critical for effective learning and topology preservation.

*
Importance of Neighborhood Connections

1. Rough Ordering of the Network:


ˆ At the beginning of the learning process, when the network weights are initialized randomly, the
network lacks any structured organization. During this initial phase, it is essential to have large
neighborhood sizes.
ˆ Large neighborhoods help ensure that even if neurons are poorly organized in weight space, neighbor-
ing neurons still influence each other significantly. This facilitates the rough ordering of the neurons
in the map, allowing similar inputs to begin clustering together.
2. Fine-Tuning the Network:

16
ˆ As learning progresses and the network starts to develop a rough order, the focus shifts to refining
local regions of the map. During this phase, it is beneficial to reduce the neighborhood size.
ˆ Smaller neighborhoods allow for more localized adjustments, enabling the network to fine-tune the
weights of neurons that are close to the Best Matching Unit (BMU) without excessively affecting
distant neurons. This fine-tuning is critical for improving the accuracy and effectiveness of the
mapping.
3. Phases of Learning:
ˆ The two phases of learning in SOMs are typically referred to as ordering and convergence. Dur-
ing the ordering phase, large neighborhoods help establish the basic structure of the map. In the
convergence phase, smaller neighborhoods promote detailed adjustments to enhance the map’s rep-
resentation of the data.
4. Dynamic Adjustment of Neighborhood Size:
ˆ The size of the neighborhood is not static; it changes dynamically throughout the training process.
This is often implemented by reducing the neighborhood size gradually at each iteration of the
algorithm.
ˆ This dynamic adjustment mirrors the control of the learning rate, which also starts large and decreases
over time, helping the network stabilize and fine-tune the representations.
5. Implementation Considerations:
ˆ As the size of the neighborhood changes during learning, implementing actual connections between
nodes becomes impractical. Instead, a distance matrix is typically maintained to measure the dis-
tances between neurons in the network.
ˆ The nodes considered part of a neuron’s neighborhood are defined as those within a neighborhood
radius that shrinks over time, adapting to the learning phases.
6. Initialization of Weights:
ˆ One effective way to initialize the weights in the SOM is through Principal Components Analysis
(PCA). This method identifies the two largest directions of variation in the data and initializes the
weights to align along these directions.

17
ˆ When weights are initialized in this manner, the rough ordering of the network is established from
the start, allowing the training process to focus on smaller neighborhood sizes right away.
7. Limitations of Batch Learning:

ˆ While SOMs are typically designed for batch learning, there may be scenarios where online learning is
preferred. However, the SOM’s architecture and dynamics make it less suited for online applications.
ˆ In such cases, alternatives like Fritzke’s Growing Neural Gas and Marsland’s Grow When Re-
quired Network may be considered, as these networks are specifically designed to handle online
learning situations.

Neighborhood connections in Self-Organizing Maps are fundamental for establishing the structure of the map
and ensuring effective learning. By dynamically adjusting the neighborhood size throughout the training process,
SOMs can effectively cluster similar input patterns and refine their representations, leading to meaningful data
organization. Understanding the role of these connections helps improve the performance and adaptability of
SOMs in various applications.

*
Weight Initialization
There is another way to initialize the weights in the network, which is to use Principal Components Analysis
(PCA) to find the two (assuming that the map is two-dimensional) largest directions of variation in the data.
The weights are then initialized so that they increase along these two directions. This means that the ordering
part of the training has already been done during initialization, allowing the algorithm to be trained with a
small neighborhood size from the start.
However, this is only possible if the training of the algorithm is in batch mode, so that all of the data is
available for training from the beginning. This should be true for the SOM; it is not designed for online learning.
This can be a limitation, as there are many cases where we would like to perform unsupervised online learning.
One option is to ignore that constraint and use the SOM anyway, which is fairly common. However, the
size of the map becomes crucial, and there is no guarantee that the SOM will converge to a solution unless
batch learning is applied. Alternatively, we can use a variety of networks designed to address this situation.
Fritzke’s ”Growing Neural Gas” and Marsland’s ”Grow When Required” Network are two of the more common
approaches.

Self-Organisation in Self-Organising Feature Maps (SOM)


Self-organisation refers to how a system can develop a structure or order without being directed by an outside
force. In the context of Self-Organising Feature Maps (SOM), this means that the arrangement of neurons (the
building blocks of the network) can achieve a global ordering through local interactions.
To understand self-organisation better, let’s consider the example of birds flying in formation. Each bird
does not need to know the exact position of every other bird. Instead, if each bird simply tries to stay a certain
distance behind the bird to its right and maintains the same speed, they can form a perfect flock.
This is similar to how SOM works. Each neuron interacts with its nearby neurons (just like each bird
interacts with the one next to it). Through these local interactions, the network organizes itself globally, even
if the neurons are far apart.

18 Self-Organization
The term self-organize in the context of Self-Organizing Maps (SOMs) refers to the ability of the algorithm to
automatically arrange and categorize input data based on its inherent structure, without the need for supervision
or explicit guidance. This self-organizing characteristic is a fundamental aspect of SOMs and contributes
significantly to their functionality and effectiveness in various applications.

What “Self-Organize” Indicates


1. Autonomous Learning:
ˆ Self-organization implies that the learning process occurs without external labels or pre-defined cat-
egories. The algorithm learns from the data itself, identifying patterns and structures inherent in the
input.
2. Adaptive Representation:

18
ˆ In a self-organizing system, the arrangement of neurons or nodes adapts based on the input data. As
data is presented to the network, the neurons adjust their weights to represent the features of the
input space effectively.
3. Topological Mapping:

ˆ Self-organization involves creating a topological map where similar input patterns are mapped to
neighboring neurons. This means that the organization reflects the similarities in the input data,
preserving the relationships between different data points.
4. Emergent Behavior:

ˆ The term also encompasses the emergence of complex behaviors from simple rules and interactions
among the neurons. Self-organization results in the emergence of patterns that represent the under-
lying structure of the data.

Contributions of Self-Organization to SOMs


1. Data Clustering:
ˆ SOMs excel at clustering similar data points. By self-organizing, they can group related inputs
together, making them valuable for tasks such as data visualization and clustering analysis. For
example, in image processing, SOMs can organize similar images or features into clusters.

2. Dimensionality Reduction:
ˆ Through self-organization, SOMs effectively reduce the dimensionality of complex datasets while
preserving the topology of the original data. This is particularly beneficial when dealing with high-
dimensional data, as it helps to visualize and understand complex relationships.

3. Feature Mapping:
ˆ The self-organizing property allows SOMs to learn meaningful representations of the input features.
Neurons that are activated by similar inputs become organized in close proximity on the map, making
it easier to interpret and analyze the relationships between different features.
4. Unsupervised Learning:

ˆ Self-organization enables SOMs to operate in an unsupervised learning framework. This is par-


ticularly advantageous when labeled data is scarce or unavailable. SOMs can discover underlying
structures and patterns in the data autonomously, which can then be used for further analysis or
modeling.

5. Robustness to Noise:
ˆ The self-organizing nature of SOMs makes them robust to noise and outliers in the data. By focus-
ing on the overall structure and relationships between data points, SOMs can effectively filter out
irrelevant variations and capture the essential patterns.
6. Flexibility and Adaptability:

ˆ Self-organization allows SOMs to be flexible and adaptable to different types of input data. Whether
dealing with images, sound, or any other type of data, SOMs can learn and adjust their representation
accordingly.

The concept of self-organization is central to the operation of Self-Organizing Maps. It indicates a form of
autonomous, adaptive learning that enables the algorithm to effectively capture and represent the structure of
complex data. By self-organizing, SOMs can cluster, reduce dimensions, and provide meaningful insights into
the data, making them valuable tools in various fields such as data analysis, image recognition, and pattern
recognition. Understanding the implications of self-organization enhances our ability to leverage SOMs for
practical applications and contributes to the advancement of unsupervised learning methodologies.

19
19 Network Dimensionality and Boundary Conditions
When we use the Self-Organising Feature Map (SOM) algorithm, we often visualize it as a 2D rectangular grid
of neurons. However, SOM can also work with different shapes and dimensions, such as:

ˆ 1D Neurons: Sometimes, a line of neurons may work better.

ˆ 3D Neurons: In other cases, we may need a three-dimensional arrangement.

The choice of dimensions depends on the intrinsic dimensionality of the input data. For instance, if you
have data points located on a flat surface in a room (like the wall in front of you), they are essentially 2D even
though they exist in a 3D space.
Understanding the intrinsic dimensionality can help reduce noise and improve learning by focusing on the
relevant features.

*
Boundary Conditions
When designing the SOM, we also need to think about the boundaries of the network. Here are two scenarios
regarding boundaries:

ˆ Defined Boundaries: For example, when mapping sounds from low to high pitch, we have clear end-
points (the lowest and highest pitches).
ˆ Removing Boundaries: In some cases, we might want to treat the edges of the map as connected. This
can be done by tying the ends together:
– In 1D, this means turning a line into a circle.
– In 2D, this transforms a rectangle into a torus (like bending a piece of paper to create a tube and
then connecting the ends).

*
Distance Calculation
With boundary conditions, calculating distances becomes more complex. For a toroidal map, we can visualize
multiple copies of the original map surrounding it. The distance between two neurons is then the shortest
distance, considering both the original and its copies.

*
Choosing the Size of the Network
Before starting the learning process, we must decide the size of the SOM (number of neurons).

20
ˆ If the network is too small, it can only find broad generalizations from the data.

ˆ If it’s too large, it may memorize every detail without generalizing, leading to overfitting.

Thus, it’s essential to experiment with different network sizes (like 5 × 5 or 10 × 10) to find the optimal
configuration for learning.

20 Examples of Using the SOM


Self-Organizing Maps (SOM) can be used in different scenarios. Here, we look at examples: one with random
data and another two with real-world datasets.

Example 1: Random Two-Dimensional Data


Let us first consider a simple example where the SOM is trained on two-dimensional data that is randomly
chosen from a uniform distribution in the range [−1, 1]. In the beginning, the network weights are initialized
randomly. So, the network is disordered, meaning that no specific structure is present in the nodes.
However, after training for 10 iterations, the network becomes ordered, and nearby neurons map to similar
data points. This means that the network starts organizing itself, grouping similar data together.
Now, if we initialize the network using Principal Component Analysis (PCA), the SOM converges even faster.
In this case, after only five iterations, we see a similar result where the neurons are well-organized. While PCA
helps speed up the training, it is not always necessary for random data.

Example 2: Iris Dataset


Next, let’s consider a well-known dataset, the Iris dataset, which contains three different types of flowers. We
train a 5 × 5 SOM on this dataset. After 100 iterations, the SOM organizes the data into clusters, and each of
the three different types of flowers forms its own group in the map. Even though the SOM doesn’t know the
flower types during training, it manages to group similar flowers together.
The three flower types are shown as different shapes (such as squares, plus signs, and triangles). These
clusters indicate that the SOM can learn patterns in the data and separate the flowers into distinct groups,
even without knowing the class labels.

Example 3: Ecoli Dataset


For a more challenging example, we look at the Ecoli dataset from the UCI Machine Learning repository. In
this case, the task is to identify the location of proteins based on various measurements. This is a more difficult
problem compared to the Iris dataset.
When training the SOM on the Ecoli dataset, the clusters formed are not as clearly defined as in the Iris
dataset, but some patterns can still be observed. It’s important to note that even the MLP (Multi-Layer
Perceptron), which uses the target data, only achieves about 50% accuracy on this dataset, while the SOM is
working without any target labels.
One interesting point here is that boundary conditions can make things more complex. Since the SOM maps
the data onto a grid, clusters near the edges of the map can sometimes behave differently than those in the
center.

21 Key Takeaways
ˆ The SOM organizes itself from random weights to structured groups after training, even on random data.

ˆ For real datasets like the Iris dataset, the SOM can separate different classes into distinct clusters, even
without knowing the class labels.

ˆ The Ecoli dataset presents a more challenging problem, but the SOM still manages to group data into
clusters, though less clearly than with simpler datasets.

21

You might also like