UNIT 3 Data Mining
UNIT 3 Data Mining
2 MARKS
1.What is mutation in genetic algorithm?
The mutation operator introduces new genetic structures in the population by randomly changing
some of its being building blocks. The mutation is implemented by occasionally altering a random bit
from a chromosome (string).
15.Define noise.
Let C1,C2,…….,Ck,be the clusters of d with respect to Ɛ and MinPts.Then, we define the noise as the
set of objects in D which do not belong to any cluster Ci as noise-[0€D|All i,O€ Ci}.
17.Expand STIRR.
Sieving Through Iterated Relational Reinforcement
approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach: This approach is also known as the bottom-up approach. In this, we start
with each object forming a
separate group. It keeps on merging the objects or groups that are close to one another. It keep on
doing so until all of the groups are merged into one or until the termination condition holds.
Algorithm:
For i=1 to N:
as the distance matrix is symmetric about the primary diagonal so we compute only lower part of
the primary diagonal
For j=1 to i:
dis_mat[i][j] = distance[di,dj]
each data point is a singleton cluster repeat merge the two cluster having minimum distance update
Divisive Approach: This approach is also known as the top-down approach. In this, we start with all
of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds. This method is
rigid, i.e., once a merging or splitting is done, it can never be undone.
Algorithm
given a dataset(d1,d2,d3......dn) of size N at the top we have all data in one cluster
the cluster is split using a flat clustering methods eg., K-Means etc
repeat
1.The partition clustering techniques partition the database into a predefined number of clusters
2. partitional clustering is faster than hierarchical
clustering.
3.partitional clustering requires stronger assumptions such as number of clusters and the initial
centre.
4.partitional clustering algorithms require the number of clusters to start running.
5. The partition clustering algorithms are of two types: k-means algorithms and k-medoid
algorithms.
Hierarchical clustering
Noise: Let C1, C2,… , Ck, be the clusters of D with respect to ℇ and MinPts. Then, we define the noise
as the set of objects in D which do not belong to any cluster Ci as Noise — [0 € D | All i, O € Ci}.
For a given non-negative value z, the e-neighbourhood of an objcct 0,, denoted by N tO,), is defined
by N,‹0,) = (@ C D d{O„ @)ñ r}
CLUSTER: A Cluster C with respect to r and MinPts is a non-empty subset of D satisfying the
following conditions: – For all Oi,Oj €D, if Oi €C and Oj is density-reachable from 0i with respect to ℇ
and MinPts, then Oj € C. – For all Oi, 0j € C, 0i is density connected to 0j with respect to ℇ and
MinPts.
Noise: Let C1, C2,… , Ck, be the clusters of D with respect to ℇ and MinPts. Then, we define the noise
as the set of objects in D which do not belong to any cluster Ci as Noise — [0 € D | All i, O € Ci}.
PARTITIONING ALGORITHMS
Partitioning algorithms construct partitions of a database of N objects into a set of k clusters. The
construction involves determining the optimal partition with respect to an objective function. There
are approximately kNlk! ways of partitioning a set of Ndata points into k subsets. An exhaustive
enumeration method that can find the global optimal partition, is practically infeasible except when
N and k are very small. The partitioning clustering algorithm usually adopts the Iterative
Optimization paradigm. It starts with an initial partition and uses an iterative control strategy. It tries
swapping data points to see if such a swapping improves the quality of clustering. When swapping
does not yield any improvements in clustering, it finds a locally optimal partition. This quality of
clustering is very sensitive to the initially selected partition.
(i) k-means algorithms, where each cluster is represented by the center of gravity of the cluster.
(ii) k-medoid algorithms, where each cluster is represented by one of the objects of the cluster
located near the centre.
Most of special clustering algorithms designed for data mining are k-medoid algorithms.
Uses a k-medoid method to identify the clusters. PAM selects k objects arbitrarily from the data as
medoids. Each of these k objects are representatives of k classes. Other objects in the database are
classified based on their distances to these k-medoids (we say that the database is partitioned with
respect to the selected set of k-medoids). 7he algorithm starts with arbitrarily selected kmedoids
and iteratively improves upon this selection. In each step, a swap between a selected object Oi and a
non-selected object 0 is made, as long as such a swap results in an improvement in the quality of
clustering. To calculate the effect of such a swap between 0, and 0 a costC,z is computed, which is
related to the quality of partitioning the non-selected objects to k clusters represented by the
medoids. The algorithm has two important modules—the partitioning of the database for a given set
of medoids and the iterative selection of medoids.
Tournament Selection: In this approach a "tournament, k" is run among a few individuals chosen at
random from the population and the one with the best fitness is selected as the winner.
Assume k=2, then two entities are picked out of the pool, their fitness is compared, and the better is
permitted to reproduce. See below figure to get an idea on how it happens.
Selection pressure can be easily adjusted by changing the tournament size (higher k increases
selection pressure). Tournament selection is independent of Fitness function.
Merits: Decreases computing time, Works on parallel architecture.
Tournament Selection is also extremely popular in literature as it can even work with negative
fitness values.
DBSCAN.
• Density Based Spatial Clustering of Applications of Noise.
• for each object of a cluster, the neighbourhood of a given radius has to contain at least a minimum
• The DBSCAN algorithm should be used to find associations and structures in data that are hard to
find manually but that can be relevant and useful to find patterns and predict trends.
15. Explain the ways in which mutation operation can be affected.
The mutation operator introduces new genetic structures in the population by randomly changing
some of its being building blocks. Since the modification is totally random and thus not related to
any previous genetic structures as shown in fig (9), the mutation is implemented by occasionally
altering a random bit from a chromosome. The figure shows the operator being applied to the fifth
element of the chromosome.
16.Explain categorical clustering algorithm
Categorical Clustering Algorithms:
ROCK (Robust hierarchical-clustering with Links)
STIRR (Sieving Through Iterated Reinforcement)
CACTUS (Clustering Categorical Data Using Summaries) STIRR:
• STIRR (Sieving through Iterated Relational Reinforcement).
• Iterative algorithm based on non-linear dynamical system The silent feature of STIRR is:
• The database is represented as graph.
• where each distinct value in the domain in each attribute is represented by a weighted node.
• there are N attributes and the domain size of the i th attribute is di
• number of nodes in the graph is ∑idi
• For each tuple in the database, an edge represents a set of nodes which participate in that tuple.
• Thus, a tuple is represented as a collection of nodes, one from each attribute type.
• We assign a weight to each node.
• The set of weights of all the nodes define the configuration of this structure.
the selected individuals are then arranged in a pair of two to increase reproduction. Then these
Types:
Roulette Wheel Selection: In the roulette wheel selection, the probability of choosing an individual
for breeding of the next
generation is proportional to its fitness, the better the fitness is, the higher chance for that individual
to be chosen. Choosing individuals can be depicted as spinning a roulette that has as many pockets
as there are individuals in the current generation, with sizes depending on their probability.
{\displaystyle f_{i}} f_{i} is the fitness of i {\displaystyle i} i and N {\displaystyle N} N is the size of
current generation (note that in this method one individual can be drawn multiple times).
Rank Selection: Rank Selection also works with negative fitness values and is mostly used when the
individuals in the population have very close fitness values (this happens usually at the end of the
run). This leads to each individual having an almost equal share of the pie (like in case of fitness
proportionate selection) and hence each individual no matter how fit relative to each other has an
approximately same probability of getting selected as a parent. This in turn leads to a loss in the
selection pressure towards fitter individuals, causing the GA to make poor parent selections in such
situations.
Tournament Selection
Tournament Selection is a method of choosing the individual from the set of individuals. The winner
of each tournament is selected to perform crossover