9 Som

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

21-10-2023

Unsupervised Learning in ANN


-from Simon Haykin book on ANN,
-another book: Laurene Fausett book on ANN

SOM

Unsupervised Learning with ANN / Self organizing


Neural Networks
• Unsupervised or self-organized learning does not require an
external teacher or target labels.
• During the training session, the neural network receives a no. of
different input patterns (without the outputs), discovers
significant features in these patterns and learns how to classify
similar input data into appropriate categories.
• Unsupervised learning tends to follow the neuro-biological
organisation of the brain.
• Unsupervised learning algorithms aim to learn rapidly and can
be used in real-time.

1
21-10-2023

Unsupervised Learning
• Some problems require an algorithm to cluster or to
partition a given data set into disjoint subsets
("clusters"), such that patterns in the same cluster are
as alike as possible, and patterns in different clusters
are as dissimilar as possible.
• The application of a clustering procedure results in a
partition (function) that assigns each data point to a
unique cluster.
• A partition may be evaluated by measuring the
average squared distance between each input pattern
and the centroid of the cluster in which it is placed.

Sum of Squared Error (SSE)

C2

C1

Total sum of squared errors of the resulting clusters (SSE)

2
21-10-2023

Applications of Unsupervised Learning


• Information retrieval:
– web search engines are the most visible IR applications
– information retrieval process begins when a user enters a query
into the system
• e.g. Some widely known application of unsupervised learning is in:
– market segmentation for targeting appropriate customers
– anomaly/fraud detection in banking sector
– image segmentation
– gene clustering for grouping gene with similar expression levels
– deriving climate indices based on clustering of earth science data
– document clustering based on content etc.

Where to use clustering?


• text mining
• Web analysis
• Marketing
• Market Segmentation
• medical diagnostic
• Social Network Analysis
• Astronomical Data Analysis
• Organizing Computer Clusters

3
21-10-2023

Clustering
• Clustering is alternatively called as “grouping”
• Intuitively, if you would want to assign same
label to a data points that are close to each
other
• Thus, clustering algorithms rely on a distance
metric between data points

What is Cluster Analysis?


• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms

4
21-10-2023

Issues
• Given desired number of clusters?
• Finding “best” clusters
• Are clusters semantically meaningful?

Cluster Analysis
• Finding groups of objects such that
– the objects in a group will be similar (or related) to one
another and
– different from (or unrelated to) the objects in other groups

5
21-10-2023

• Distance based methods Typical Clustering Methods


a. Partitioning algorithms
 K-means, K-medians, K-medoids
 They partition data into a multiple clusters
b. Hierarchical algorithms
 Agglomerative (bottom up)
 Divisive methods (top-down)
• Density/grid based methods
• DBSCAN (dense small regions are merged into bigger regions)
 In grid based method, individual regions of data space are formed into a
grid like structure, each grid is summary of characteristics of data, in the
lower grid/cells.
• Probabilistic and generative models
 It models data from a generative process, assuming data follow certain
distributions, e.g. mixture of Gaussians,
 Current points are generated from these underlying generative models
 It does Expectation maximization foe a maximum likelihood fit

K-means Algorithm
• Given the cluster count K, the K-means algorithm is carried out in
three steps after initialisation:
Initialisation: set seed points (randomly selected as means
of clusters)
1)Assign each object to the cluster of the nearest seed
point measured with a specific distance metric
2)Compute new seed points as the centroids of the clusters
of the current partition (the centroid is the centre, i.e.,
mean point, of the cluster)
3)Go back to Step 1), stop when no more new assignment
(i.e., membership in each cluster no longer changes)

6
21-10-2023

Class Problem
Training x1 x2
Examples

A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
• Let k=2, means we are interested in two clusters
• Let A and C are randomly selected as the means of 2 clusters.

Class Problem
Mean/Center Distance from Distance from
center 1 center 2

A (C1) 0 (C1) 1.4


B 1 (C1) 2.2
C (C2) 1.4 0 (C2)
D 3.2 2.8 (C2)
E 4.47 4.2 (C2)
• Find distance between each observation and all centers/cluster
means. A 1 1
B 1 0
• Assign each observation to the cluster having C 0 2
the closest mean. D 2 4
E 3 5
• Recalculate the cluster means.

7
21-10-2023

Class Problem
Mean/Center Distance from Distance from
center 1 center 2

A (C1) 0 (C1) 1.4


B 1 (C1) 2.2
C (C2) 1.4 0 (C2)
D 3.2 2.8 (C2)
E 4.47 4.2 (C2)
• Recalculate the cluster means.
• C1 = {A,B} and C2 = {C,D,E}
• New mean of cluster C1 = {(1+1 )/2, (0+1)/2} = {1,0.5}
• New mean of cluster 2 = {(0+2+3)/3,(2+4+5)/3} = {1.7, 3.7}

Class Problem
Mean/Center Distance from Distance from
center 1 {1,0.5} center 2 {1.7,
3.7}
A 1 1
A 0.5 (C1) 2.7 B 1 0
B 0.5 (C1) 3.7 C 0 2
D 2 4
C 1.8 (C1) 2.4
E 3 5
D 3.6 0.5 (C2)
E 4.9 1.9 (C2)

• Recalculate the cluster means.


• C1 = {A,B,C} and C2 = {D,E}
• New mean of cluster C1 = {(0+1+1 )/3, (0+1+2)/3} = {0.7,1}
• New mean of cluster 2 = {(2+3)/2,(4+5)/2} = {2.5, 4.5}

8
21-10-2023

Summary
• K-means algorithm is a simple yet popular method for clustering
analysis but it fails for non-linear or complex data.
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the
distribution of the data.
• There are other limitations – still a need for reducing costs of
calculating distances to centroids.
• Its performance is determined by initialisation and appropriate
distance measure
• There are several variants of K-means to overcome its weaknesses
– Kernel K-means
– K-Medoids or PAM (partitioning around medoids): resistance to noise
and/or outliers
– K-Modes: extension to categorical data clustering analysis
– CLARA: extension to deal with large data sets
– Mixture models (EM algorithm): handling uncertainty of clusters

SOM

Books:
Neural Networks by Simon Haykin
Fundamentals of NN by Laurene Fausett

9
21-10-2023

SOM model
• The Self-Organizing Map (SOM) was introduced by Teuvo
Kohonen in 1982.
• The SOM (also known as the Kohonen feature map)
algorithm is one of the best known artificial neural
network algorithms.
• In contrast to many other neural networks using
supervised learning, the SOM is based on unsupervised
learning.
• Teuvo Kohonen, a professor of the Academy of Finland,
provided a way of representing multidimensional data in
much lower dimensional spaces - usually one or two
dimensions with SOM algorithm.

SOM
• The SOM has been proven useful in many applications.

• SOMs map multidimensional data onto lower dimensional


subspaces where geometric relationships between points
indicate their similarity.

• The SOM can be used to detect features inherent to the


problem and thus has also been called SOFM, the Self-
Organizing Feature Map.

10
21-10-2023

SOM
• It provides a topology preserving mapping
from the high dimensional space to map units.
• The property of topology preservation means
that the mapping preserves the relative
distance between the points.
– Points that are near each other in the input space
are mapped to nearby map units in the SOM.
– The SOM can thus serve as a cluster analyzing tool
of high-dimensional data.
– Also, the SOM has the capability to generalize.

SOM
• Generalization capability means that the
network can recognize or characterize inputs it
has never encountered before.

• A new input is assimilated with the map unit it


is mapped to.

11
21-10-2023

Competitive learning
• With Backpropagation, when we applied a net that was trained
to classify the input signal into one of the output categories,
A,B, C……Z, the net sometimes responded that the signal was
both C and K or both E and K.
• In such situations, we know only one of several neurons should
respond, we can include additional structure in the network so
that the net is forced to make a decision as to which one unit
will respond.
• The mechanism by which this is achieved is called competition.
• The most extreme form of competition among a group of
neurons is called WinnerTakeAll.

Competitive learning
• In competitive learning, neurons compete among
themselves to be activated.
• While in Hebbian learning, several output neurons
can be activated simultaneously, in competitive
learning, only a single output neuron is active at
any time.
• The output neuron that wins the competition is
called the winner-takes-all neuron.

12
21-10-2023

SOM Overview
• SOM is based on three principles:
– Competition: each neuron calculates a discriminant
function. The neuron with the highest value is declared the
winner.
– Cooperation: Neurons near-by the winner on the lattice
get a chance to adapt.
– Adaptation: The winner and its neighbors increase their
discriminant function value relative to the current input.
• Subsequent presentation of the current input should
result in enhanced function value.
• Redundancy in the input is needed!

Architecture of Kohonen Network


• Two layers of units
– Input: n units (length of training vectors)
– Output: m units (number of categories)
• Input units fully connected with weights to output
units
• Intra-layer (“lateral”) connections
– Within output layer
– Defined according to some topology
– No weight between these connections, but used in
algorithm for updating weights

13
21-10-2023

Architecture example

Feature Map / Lattice


• This simulated cortex map, on the one hand can
– perform a self-organized search for important features
among the inputs, and on the other hand can
– arrange these features in a topographically meaningful
order.

• This is why the map is also sometimes termed the


‘self-organizing feature map’, or SOFM.
• Often SOM’s are used with 2D topographies
connecting the output units
• In this way, the final output can be interpreted
spatially, i.e., as a map

14
21-10-2023

Example of feature Map

• It is a 4x4 SOM network (4


nodes down, 4 nodes across).
• It is easy to overlook this
structure as being trivial, but
there are a few key things to
notice.
• First, each map node is
connected to each input node.
• For this small 4x4 node
network, that is 4x4x3=48
connections.

Feature map
• Consider 3D input data with red, blue, green
(RGB) values for a particular color.
• For this dataset, a good mapping would
group red, green, blue colors far away from
one another and place the intermediate
colors between their base colors.
– E.g. Yellow should get mapped close to red and
green
– E.g. Teal should get mapped close to green and
blue.

15
21-10-2023

2-D Lattice of Neurons

Feature map
• Map nodes are not connected to each other.

• The nodes are organized in this manner, as a 2-D grid makes it


easy to visualize the results.

• In this configuration, each map node has a unique (i , j)


coordinate.

• This makes it easy to reference a node in the network, and to


calculate the distances between nodes.
• Because of the connections only to the input nodes, the map
nodes are oblivious as to what values their neighbors have.
• A map node will only update its' weights based on what the
input vector tells it.

16
21-10-2023

Network Architecture

Note: There is one weight vector of length n associated with each


output / map unit

The SOM weight matrices


• The SOM uses a set of neurons, often arranged in a 2-D
rectangular or hexagonal grid, to form a discrete topological
mapping of an input space, X ∈ Rn.

• At the start of the learning, all the weights


{w1, w2, ..., wM} are initialized to small random numbers.

– wi is the weight vector associated to neuron ‘i’ in the grid


and is a vector of the same dimension as ‘n’ of the input,

– M is the total number of neurons, and let ri be the location


vector of neuron i on the grid.

17
21-10-2023

Algorithm
• Initialize weights
• For 0 to X number of training epochs
– Select a sample from the input data set
– Find the "winning" neuron for the sample input
– Adjust the weights of winning and nearby neurons
• End for loop
This model tries moving input vectors similar (in
size) to weight vectors towards winning unit in the
feature map.

Algorithm

18
21-10-2023

Neighborhoods for rectangular grid


and 1-D grid

Neighborhoods for Hexagonal grids

19
21-10-2023

Cooperative Process
• The winning neuron locates the center of a topological
neighborhood of cooperating neurons.

• We are interested in a topological neighborhood that is


neurobiologically correct.
• Take analogy from neurobiological evidence for lateral
interaction among a set of excited neurons.
– A neuron that is firing tends to excite the neurons in its
immediate neighborhood more than those farther away
from it.

• Topological neighborhood should be centered around the


winning neuron ‘i’ and should decay smoothly with lateral
distance.
• And one such natural choice is the Gaussian function.

Topographic Map
• This form of map, known as a topographic map , has two
important properties:
– At each stage of representation, or processing, each piece of incoming
information is kept in its proper context (neighbourhood).
– Neurons dealing with closely related pieces of information in input
space are kept close together in topographic map so that they can
interact via short synaptic connections.
• “The spatial location of an output neuron in a topographic
map corresponds to a particular domain or feature drawn
from the input space”.
w j (n  1)  w j (n)   (n) h j ,i ( x ) (n) [ x  w j (n)] (eq 9.13)
• This weight update is applied to all neurons in lattice that lie
inside topological neighborhood of winning neuron i.

Refer Simon Haykin book, Chapter 9 on SOM for more details.

20
21-10-2023

Finding Neighbors
• The neurons close to the winning neuron are called its
neighbors.
• Determining a neuron's neighbors can be achieved with
– concentric squares, hexagons, and other polygonal shapes
as well as Gaussian functions, Mexican Hat functions, etc.
• Generally, the neighborhood function is designed to have a
global maxima at the winning neuron and decrease as it gets
further away from it.
• This makes it so that
– neurons close to the "winning" neuron get scaled towards
the sample input the most
– While neurons far away get scaled the least, which creates
groupings of similar neurons in the final map and this is
called cooperative process.

Topological neighbourhood
• Let dj,i denote the lateral distance (in output space rather than
some distance measure in original input space) between
winning neuron i and excited neuron j

– here discrete vector rj defines position of excited neuron j


– ri defines discrete position of winning neuron i
– both ri and rj are measured in discrete output space
• Let hj,i denote topological neighbourhood centred on winning
neuron i, and encompassing a set of excited (cooperating )
neurons, a typical one of which is denoted by j.

21
21-10-2023

Gaussian Function
for n=0, 1, 2, 3

• Here ‘i’ is the winning neuron.


• Ԏ1 is the time constant
• n is no. of iterations
• In the case of 1-D lattice,
dj,i=|j-i|
• In case of 2-D lattice it is defined
by:

3-D Gaussian Function

Gaussian Distribution
Bean
machine:
drop ball
with pins

1-d 2-d
Gaussian Gaussian
From wikipedia and http://home.dei.polimi.it
44

22
21-10-2023

Finding Neighbors: other alternatives

In the case of mexican hat function, neurons


far away get scaled away from the sample
input), which creates groupings of similar
neurons in the final map. 3-D Mexican Hat Function

Adaptation in SOM
• For the network to be self organizing or
adapting, the synaptic weight vector wj of
neuron j in the network is required to change
in relation to the input vector x.

• It is needed to fine tune the feature map and


therefore provide an accurate statistical
quantification of input space.

w j (n  1)  w j (n)   (n) h j ,i ( x ) (n) [ x  w j ( n)]..........(eq 9.13)

23
21-10-2023

Adaptation in SOM
• Equation 9.13 has effect of moving synaptic weight
vector wi of winning neuron ‘i’ towards input vector
‘x’.
• Upon repeated presentations of training data,
synaptic weight vectors tend to follow distribution of
input vectors due to neighborhood updating.
• Algorithm therefore leads to a topological ordering of
feature map in input space in the sense that neurons
that are adjacent in lattice will tend to have similar
weight vectors.

Two phases of Adaptive process


• There are 2 phases of adaptive process in SOM:
– Self-organizing or ordering phase
– Convergence phase

• Careful choice of learning rate and neighborhood


function results in good self organization/ordering of
the feature map.

24
21-10-2023

Adaptation with choice of neighborhood


function
• Neighbourhood function hj,i(x) should initially include almost
all neurons in the network centered on winning neuron ‘i’, and
then shrink slowly with time/number of iterations.

• Set initial value of σ 0 in hj,i(x) function equal to radius of


lattice.

• Set time constant τ1 = 1000/log(σ0) in equation


for n=0, 1, 2, 3…….

• This is a popular choice for dependence of σ on discrete time


n which is the exponential decay.

Neighborhood function and role of Standard


deviation (σ)
• σ is effective width of topological neighbourhood, it measures
degree to which excited neurons in the vicinity of winning
neuron participate in the learning process.
• Size of topological neighbourhood shrinks with time.
• Correspondingly, topological neighborhood assumes a time-
varying form given by:

• As compared to:

25
21-10-2023

Adaptation in SOM with learning rate


• We consider two things for self organization: learning rate and
neighbourhood
• The learning rate should begin with a high value (say 0.1)
– The learning rate parameter
– ɳ(n) = ɳ0 exp(-n/ Ԏ2),
– So desirable values of parameters are:
ɳ0 = 0.1 and Ԏ2 = 1000
– i.e. It should begin with a value e.g. 0.1,
– then it should decrease gradually but remain above 0.01,
– never goes to zero;
– otherwise metastable state of SOM will result.

w j (n  1)  w j (n)   (n) h j ,i ( x ) (n) [ x  w j (n)]..........(eq 9.13)

Convergence in SOM
• As a general rule, the no. of iterations constituting the
convergence phase must be at least 500 times the no. of
neurons in the network.

• hj,i(x) (preferably defined by Gaussian/Mexican hat function)


should contain only the nearest neighbours of a winning
neuron i, which may eventually reduce to one/zero
neighbouring neurons.

26
21-10-2023

Quality: What Is Good Clustering?


• A good clustering method will produce high quality clusters
– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
– the similarity measure used by the method
– its implementation, and
– Its ability to discover some or all of the hidden patterns

Source: Chap 10, Han and Kamber Book, 3rd


Edition

Determine the Number of Clusters


• Empirical method
– # of clusters: k ≈√n/2 for a dataset of n points,
e.g., n = 200, k = 10
• Elbow method
– Use the turning point in the curve of sum of
within cluster variance w.r.t the # of clusters

27
21-10-2023

Measuring Clustering Quality


• 3 kinds of measures: External, internal and relative
• External (Supervised): employ criteria not inherent to the dataset
– Compare a clustering against prior or expert-specified knowledge
(i.e., the ground truth) using certain clustering quality measure
• Internal (Unsupervised): criteria derived from data itself
– Evaluate the goodness of a clustering by considering how well the
clusters are separated, and how compact the clusters are, e.g.,
Silhouette coefficient
• Relative: directly compare different clusterings,
– usually those obtained via different parameter settings for the
same algorithm

Measuring Clustering Quality: External Methods


• Clustering quality measure: Q(C, T), for a clustering C given
the ground truth T
• Q is good if it satisfies the following 4 essential criteria
– Cluster homogeneity: the purer, the better
– Cluster completeness: should assign objects belong to the
same category in the ground truth to the same cluster
– Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a rag
bag (i.e., “miscellaneous” or “other” category)
– Small cluster preservation: splitting a small category into
pieces is more harmful than splitting a large category into
pieces

28
21-10-2023

Cross-validation
• Cross validation method
– Divide a given data set into m parts
– Use m – 1 parts to obtain a clustering model i.e. hold out
1 part
– Use the remaining part to test the quality of the clustering
• E.g.
I. For each point in the test set, find the closest
centroid, and
II. use the sum of squared distance between all points
in the test set and the closest centroids to measure
how well the model fits the test set
– For any k > 0, repeat it m times, compare the overall
quality measure w.r.t. different k’s, and find # of clusters
that fits the data the best.

BetaCV

29
21-10-2023

Silhouette Coefficient
• Silhouette Coefficient or silhouette score is a metric used to
calculate the goodness of a clustering technique. Its value
ranges from -1 to 1.
 1: Means clusters are well apart from each other and clearly
distinguished.
 0: Means clusters are indifferent, or we can say that the distance
between clusters is not significant.
 -1: Means clusters are assigned in the wrong way.
• Silhouette Score = (b-a)/max(a,b)
– Where a= average intra-cluster distance i.e the average distance
between each point within a cluster.
– b= average inter-cluster distance i.e the average distance between all
clusters.

Library for SOM


• http://scikit-learn.org/stable/modules/clustering.html
• 2.3. Clustering — scikit-learn 1.3.1 documentation
• https://chi3x10.wordpress.com/2008/05/08/som-self-
organizing-map-code-in-matlab/

• http://docs.unigrafia.fi/publications/kohonen_teuvo/MATLAB
_implementations_and_applications_of_the_self_organizing_
map.pdf

• https://in.mathworks.com/help/nnet/examples/iris-
clustering.html

30
21-10-2023

Applications of SOM
• The most important practical applications of SOMs are in:
– exploratory data analysis,
– pattern recognition,
– speech analysis,
– robotics,
– industrial and medical diagnostics,
– instrumentation and control,
– Environment protection

• The SOM can also be applied to hundreds of other tasks


where large amounts of unclassified data is available.

SOM
• The major disadvantage of a SOM is that it requires necessary
and sufficient data in order to develop meaningful clusters.

• The weight vectors must be based on data that can successfully


group and distinguish inputs.

• Lack of data or extraneous data in the weight vectors will add


randomness to the groupings.

• Finding the correct data involves determining which


factors/features are relevant and can be a difficult or even
impossible task in several problems.

• The ability to determine a good data set is a deciding factor in


determining whether to use a SOM or not.

31
21-10-2023

History of the Artificial Neural Networks

-Adaptive Resonance Theory (ART)


Networks
-Radial Basis Function (RBF) Networks
-Self-Organizing Map (SOM)
-LVQ (Learning Vector Quantization)
Network

Application Areas of ANNs


• Although a certain type of ANN has been engineered to
address certain kinds of problems.
• There exist no definite rules as to what the exact application
domains of ANNs are.

32

You might also like