0% found this document useful (0 votes)
33 views11 pages

UNIT 3 Data Mining

The document provides information about various genetic algorithm and clustering concepts: 1. Mutation in genetic algorithms introduces new genetic structures by randomly altering bits in chromosomes. Two applications of genetic algorithms in data mining are hypothesis testing and refinement. 2. Hierarchical clustering techniques can be agglomerative, starting with each object as a cluster and merging the closest, or divisive, starting with all objects in one cluster and splitting it recursively. 3. STIRR represents a database as a graph with nodes for attribute values and edges for tuples, assigning weights to represent the configuration.

Uploaded by

mahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views11 pages

UNIT 3 Data Mining

The document provides information about various genetic algorithm and clustering concepts: 1. Mutation in genetic algorithms introduces new genetic structures by randomly altering bits in chromosomes. Two applications of genetic algorithms in data mining are hypothesis testing and refinement. 2. Hierarchical clustering techniques can be agglomerative, starting with each object as a cluster and merging the closest, or divisive, starting with all objects in one cluster and splitting it recursively. 3. STIRR represents a database as a graph with nodes for attribute values and edges for tuples, assigning weights to represent the configuration.

Uploaded by

mahi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT-3

2 MARKS
1.What is mutation in genetic algorithm?
The mutation operator introduces new genetic structures in the population by randomly changing
some of its being building blocks. The mutation is implemented by occasionally altering a random bit
from a chromosome (string).

2. Mention two applications of genetic algorithm in data mining?


*Hypothesis testing
*Refinement

3. What is text mining?


Text mining is the extension of the data mining approach to textual data and is concerned with
various tasks, such as extraction of information implicitly contained in collection of documents, or
similarity- based structuring.

4.What is tournament selection?


Tournament selection is a method of selecting an individual from a population of individuals in a
genetic algorithm.

5. What is rank based selection?


In this, we remove the concept of a fitness value while selecting a parent. However, every individual
in the population is ranked according to their fitness. The selection of the parents depends on the
rank of each individual and not the fitness.

6. What is proportionate selection?


Fitness Proportionate Selection is one of the most popular ways of parent selection. In this every
individual can become a parent with a probability which is proportional to its fitness. Therefore,
fitter individuals have a higher chance of mating and propagating their features to the next
generation.

7.What is Roulette wheel selection?


The roulette wheel selection method is used for selecting all the individuals for the next generation.
It is a popular selection method used in a genetic algorithm. A roulette wheel is constructed from
the relative fitness (ratio of individual fitness and total fitness) of each individual.

8.What is Elitist selection?


Elitist selection is a selection strategy a limited number of individuals with the best fitness values are
chosen to pass to the next generation, avoiding crossover and mutation

9.What is replacement selection?


Replacement selection is based on the premise that when we select the smallest record from the
sort buffer in main memory, we can replace it with the incoming record.

10.What is bitwise inversion?


It flips zeros into ones and ones into zeros. To flip or invert bits << shift left. The left operand value is
shifted left by the number of bits specified by the right operand.

11.What are the 2 modules of PAM algorithm


The PAM algorithm has two important modules—the partitioning of the database for a given set of
medoids and the iterative selection of medoids.

12. What is cross over in genetic algorithm?


Crossover is one of the genetic operators used to recombine the population’s genetic material. It
takes two chromosomes and swaps part of their genetic information to produce new chromosomes.

13. Name the two ways of hierarchical decomposition


The hierarchical decomposition can be represented as a dendrogram in two ways:
(i) Bottom-up (Agglomerative) approach, and
(ii) Top-down (Divisive) approach.

14. What is DBSCAN?


DBSCAN (Density Based Spatial Clustering of Applications of Noise) uses a density- based notion of
clusters to discover clusters of arbitrary shapes. The key idea of DBSCAN is that, for each object of a
cluster, the neighbourhood of a given radius has to contain at least a minimum number of data
objects. In other words, the density of the neighbourhood must exceed a threshold. The critical
parameter here is the distance function for the data objects.

15.Define noise.
Let C1,C2,…….,Ck,be the clusters of d with respect to Ɛ and MinPts.Then, we define the noise as the
set of objects in D which do not belong to any cluster Ci as noise-[0€D|All i,O€ Ci}.

16. Define cluster


A Cluster C with respect to r and MinPts is a non-empty subset of D satisfying the following
conditions:
– For all Oi,Oj €D, if Oi €C and Oj is density-reachable from 0i with respcect to ℇ and MinPts, then Oj
€ C.
– For all Oi, 0j € C, 0i is density connected to 0j with respect to ℇ and MinPts

17.Expand STIRR.
Sieving Through Iterated Relational Reinforcement

18. What is neighbourhood of an object?


The data points in the region separated by two clusters of low point density are considered as noise.
The surroundings with a radius & of a given object are known as the & neighbourhood of the object.

19. Define core object


an object is said to be a Core Object if | N, (O) |? MinPts. A core object is an object which has a
neighbourhood of user-specified minimum density.

20.Define Directly density reachable object?


An object (or instance) q is directly density reachable from object p if q is within the ε-
Neighbourhood of p and p is a core object. Here directly density reachability is not symmetric.
Object p is not directly density-reachable from object q as q is not a core object.
5 MARKS
1. Write a note on CLARA
It can be observed that the major computational efforts for PAM are to determine k medoids
through an iterative optimization. Though CLARA follows the same principle, it attempts to reduce
the computational effort. CLARA (Clustering Large Applications), a sampling-based method, use a
random sample from the data set as the candidates of medoids instead of taking the whole set of
data into consideration. The algorithm PAM is applied to compute the best medoids from the
sample. If the sample is good enough, it should closely represent the original data set. In many cases,
a large enough sample by an equal probability of selection design (that is, each object has the same
probability to be chosen into the sample) works well. The representative objects (medoids) chosen
will likely be similar to those that would have been chosen from the whole data set. CLARA tries
multiple random samples and returns the best clustering as the output. The complexity computing
the medoids on a random sample is O (ks2 +k(n−k)), where s is the size of the sample, k is the
number of clusters, and n is the total number of objects. CLARA can deal with larger data sets than
PAM. The effectiveness of CLARA depends on the sample size. CLARA cannot find a good clustering if
any of the best sampled medoids is far from the best k medoids. If an object is one of the best k
medoids but is not selected during sampling, CLARA will never find the best clustering. PAM
examines every object in the data set against every current medoid, whereas CLARA confines the
candidate medoids to only a random sample of the data set. A randomized algorithm called
CLARANS (Clustering Large Applications based upon Randomized Search) presents a trade-off
between the cost and the effectiveness of using samples to obtain clustering.

2.Differentiate agglomerative and divisive clustering


*The hierarchical techniques, are of two types-agglomerative and divisive clustering techniques.
Agglomerative clustering techniques start with as many clusters as there are records, with each
cluster having only one record. Then pairs of clusters are successively merged until the numbers of
clusters reduces to k. At each stage, the pairs of the clusters that are merged are the ones nearest to
each other. If the merging is continued, it terminates in a hierarchy of clusters which is built with just
a single cluster contains all the records, at the top of the hierarchy.
*Divisive clustering techniques take the opposite approach from agglomerative techniques. This
starts with all the records in one cluster, and then try to split that cluster into small piece

3. Write a note on STIRR


• STIRR (Sieving through Iterated Relational Reinforcement).
⚫ Iterative algorithm based on non-linear dynamical system

The silent feature of STIRR is:


• The database is represented as graph.
where each distinct value in the domain in each attribute is represented by a weighted node.
⚫ there are N attributes and the domain size of the ith attribute is di
⚫ number of nodes in the graph is Σidi
for each tuple in the database, an edge represents a set of nodes which participate in that tuple.
Thus, a tuple is represented as a collection of nodes, one from each attribute type.

• We assign a weight to each node.


The set of weights of all the nodes define the configuration of this structure.
4.What is Hierarchical clustering? List and explain different types of hierarchical clustering
techniques?
Hierarchical Clustering: This method creates a hierarchical decomposition of the given set of data
objects. We can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two

approaches here −

Agglomerative Approach

Divisive Approach

Agglomerative Approach: This approach is also known as the bottom-up approach. In this, we start
with each object forming a

separate group. It keeps on merging the objects or groups that are close to one another. It keep on

doing so until all of the groups are merged into one or until the termination condition holds.

Algorithm:

given a dataset (d1,d2,d3......dn) of size N compute the distance matrix

For i=1 to N:

as the distance matrix is symmetric about the primary diagonal so we compute only lower part of
the primary diagonal

For j=1 to i:

dis_mat[i][j] = distance[di,dj]

each data point is a singleton cluster repeat merge the two cluster having minimum distance update

the distance matrix until only a single cluster remains

Divisive Approach: This approach is also known as the top-down approach. In this, we start with all
of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds. This method is
rigid, i.e., once a merging or splitting is done, it can never be undone.

Algorithm
given a dataset(d1,d2,d3......dn) of size N at the top we have all data in one cluster

the cluster is split using a flat clustering methods eg., K-Means etc

repeat

choose the best cluster among all the clusters to split

split that cluster by the flat clustering algorithm

until each data is in its own singleton cluster

5. Write a note on PAM.


PAM (Partition Around Medoids) uses a k medoid method to identify the clusters. PAM selects k
objects arbitrarily from the data as medoids. Each of these k objects are representatives of k classes.
Other objects in the database are classified based on their distances to these k medoids (we say that
the database is partitioned with respect to the selected set of k-medoids). 7he algorithm starts with
arbitrarily selected k-medoids and iteratively improves upon this selection. In each step, a swap
between a selected object Oi and a non-selected object 0 is made, as long as such a swap results in
an improvement in the quality of clustering. To calculate the effect of such a swap between 0, and 0
a cost C, z is computed, which is related to the quality of partitioning the non-selected objects to k
clusters represented by the medoids. The algorithm has two important modules-the partitioning of
the database for a given set of medoids and the iterative selection of medoids.

6. Differentiate hierarchical and partition clustering.


Partition clustering

1.The partition clustering techniques partition the database into a predefined number of clusters
2. partitional clustering is faster than hierarchical
clustering.
3.partitional clustering requires stronger assumptions such as number of clusters and the initial
centre.
4.partitional clustering algorithms require the number of clusters to start running.
5. The partition clustering algorithms are of two types: k-means algorithms and k-medoid
algorithms.
Hierarchical clustering

1.The hierarchical clustering techniques do a sequence of partitions, in which each partition is


nested into the next partition in the sequence 2.
Hierarchical clustering is less faster compared to partitional
clustering.
3.Hierarchical clustering requires only a similarity measure.
4.Hierarchical clustering does not require any input parameters
5. The hierarchical techniques, are of two types—agglomerative and divisive clustering techniques.

7.Describe the technique of Genetic algorithm?


Crossover: Crossover is one of the genetic operators used to recombine the population’s genetic
material. It takes two chromosomes and swaps part of their genetic information to produce new
chromosomes.
As fig (8) shows, after the crossover point has been randomly chosen, portions of the parent’s
chromosome (strings). Parent 1and Parent 2 are combined to produce the new offspring, child.
Mutation
The mutation operator introduces new genetic structures in the population by randomly
changing some of its being building blocks. Since the modification is totally random and thus
not related to any previous genetic structures present in the population, it creates different
structures as shown in fig (9), the mutation is implemented by occasionally altering a random
bit from a chromosome (string). The figure shows the operator being applied to the fifth
element of the fifth element of the chromosome.

8.Explain mutation in genetic algorithm.


The mutation operator introduces new genetic structures in the population by randomly changing
some of its being building blocks. Since the modification is totally random and thus not related to
any previous genetic structures present in the population, it creates different structures as shown in
fig (9), the mutation is implemented by occasionally altering a random bit from a chromosome
(string). The figure shows the operator being applied to the fifth element of the fifth element of the
chromosome.
9)Explain core object and noise related to DBSCAN
DBSCAN (Density Based Spatial Clustering of Applications of Noise) uses a density- based notion of
clusters to discover clusters of arbitrary shapes. The key idea of DBSCAN is that, for each object of a
cluster, the neighbourhood of a given radius has to contain at least a minimum number of data
objects. In other words, the density of the neighbourhood must exceed a threshold. The critical
parameter here is the distance function for the data objects. The following concepts are introduced
here in the context of DBSCAN.
• Density Based Spatial Clustering of Applications of Noise.
• uses a density- based notion of clusters to discover clusters of arbitrary shapes.
• for each object of a cluster, the neighbourhood of a given radius has to contain at least a minimum
number of data objects.
• the density of the neighbourhood must exceed a threshold.
• The DBSCAN algorithm should be used to find associations and structures in data that are hard to
find manually but that can be relevant and useful to find patterns and predict trends.
CORE OBJECT: An object is said to be a Core Object if | N,(O) | ? MinPts. A core object is an object
which has a neighbourhood of user-specified minimum density.

Noise: Let C1, C2,… , Ck, be the clusters of D with respect to ℇ and MinPts. Then, we define the noise
as the set of objects in D which do not belong to any cluster Ci as Noise — [0 € D | All i, O € Ci}.

10.Explain Cluster and noise related to DBSCAN


DBSCAN (Density Based Spatial Clustering of Applications of Noise) uses a density- based notion of
clusters to discover clusters of arbitrary shapes. The key idea of DBSCAN is that, for each object of a
cluster, the neighbourhood of a given radius has to contain at least a minimum number of data
objects. In other words, the density of the neighbourhood must exceed a threshold. The critical
parameter here is the distance function for the data objects. The following conccpts are introduced
nere in the context of DBSCAN.

For a given non-negative value z, the e-neighbourhood of an objcct 0,, denoted by N tO,), is defined
by N,‹0,) = (@ C D d{O„ @)ñ r}

CLUSTER: A Cluster C with respect to r and MinPts is a non-empty subset of D satisfying the
following conditions: – For all Oi,Oj €D, if Oi €C and Oj is density-reachable from 0i with respect to ℇ
and MinPts, then Oj € C. – For all Oi, 0j € C, 0i is density connected to 0j with respect to ℇ and
MinPts.

Noise: Let C1, C2,… , Ck, be the clusters of D with respect to ℇ and MinPts. Then, we define the noise
as the set of objects in D which do not belong to any cluster Ci as Noise — [0 € D | All i, O € Ci}.

11. Explain partitioning algorithms

PARTITIONING ALGORITHMS
Partitioning algorithms construct partitions of a database of N objects into a set of k clusters. The
construction involves determining the optimal partition with respect to an objective function. There
are approximately kNlk! ways of partitioning a set of Ndata points into k subsets. An exhaustive
enumeration method that can find the global optimal partition, is practically infeasible except when
N and k are very small. The partitioning clustering algorithm usually adopts the Iterative
Optimization paradigm. It starts with an initial partition and uses an iterative control strategy. It tries
swapping data points to see if such a swapping improves the quality of clustering. When swapping
does not yield any improvements in clustering, it finds a locally optimal partition. This quality of
clustering is very sensitive to the initially selected partition.

There are the two main categories of partitioning algorithms.

(i) k-means algorithms, where each cluster is represented by the center of gravity of the cluster.

(ii) k-medoid algorithms, where each cluster is represented by one of the objects of the cluster
located near the centre.

Most of special clustering algorithms designed for data mining are k-medoid algorithms.

K-MEDOID ALGORITHMS PAM

PAM (Partition Around Medoids)

Uses a k-medoid method to identify the clusters. PAM selects k objects arbitrarily from the data as
medoids. Each of these k objects are representatives of k classes. Other objects in the database are
classified based on their distances to these k-medoids (we say that the database is partitioned with
respect to the selected set of k-medoids). 7he algorithm starts with arbitrarily selected kmedoids
and iteratively improves upon this selection. In each step, a swap between a selected object Oi and a
non-selected object 0 is made, as long as such a swap results in an improvement in the quality of
clustering. To calculate the effect of such a swap between 0, and 0 a costC,z is computed, which is
related to the quality of partitioning the non-selected objects to k clusters represented by the
medoids. The algorithm has two important modules—the partitioning of the database for a given set
of medoids and the iterative selection of medoids.

12. Explain random selection and tournament selection in genetic algorithm

Tournament Selection: In this approach a "tournament, k" is run among a few individuals chosen at
random from the population and the one with the best fitness is selected as the winner.
Assume k=2, then two entities are picked out of the pool, their fitness is compared, and the better is
permitted to reproduce. See below figure to get an idea on how it happens.
Selection pressure can be easily adjusted by changing the tournament size (higher k increases
selection pressure). Tournament selection is independent of Fitness function.
Merits: Decreases computing time, Works on parallel architecture.
Tournament Selection is also extremely popular in literature as it can even work with negative
fitness values.

13. Explain roulette wheel selection and rank-based selection


Roulette Wheel Selection: In this method is proportionate to the fitness of an individual. Higher the
fitness of an individual (better chromosomes), higher the chances of getting selected.
The principle of roulette selection follows a linear search through a roulette wheel with the slots in
the wheel weighted in proportion to the individual chromosomes' fitness values. Then a marble is
thrown there and selects the chromosome. Chromosome with bigger fitness will be selected more
times.
It is clear that a fitter individual has a greater pie on the wheel and therefore a greater chance of
landing in front of the fixed point/pointer when the wheel is rotated. Therefore, the probability of
choosing an individual depends directly on its fitness.
Rank-based selection: The population is sorted according to the objective values.
The fitness assigned to each individual depends only on its position in the individual’s rank.
Ranking introduces a uniform scaling across the population and provides a simple and effective way
of controlling selective pressure.
The probability of each individual being selected for reproducing depends on its fitness normalized
by the total fitness of the population.
Merits: Rank-based fitness assignment behaves in a more robust manner than proportional fitness
assignment. All the chromosome has a chance of being selected.
Demerits: Unfair to the individuals who have high fitness. So, lead to slower convergence because
the best chromosomes do not differ so much from other ones.
14. What is DBSCAN?
DBSCAN (Density Based Spatial Clustering of Applications of Noise) uses a density- based notion of
clusters to discover clusters of arbitrary shapes. The key idea of DBSCAN is that, for each object of a
cluster, the neighbourhood of a given radius has to contain at least a minimum number of data
objects. In other words, the density of the neighbourhood must exceed a threshold. The critical
parameter here is the distance function for the data objects. The following concepts are introduced
here in the context of

DBSCAN.
• Density Based Spatial Clustering of Applications of Noise.

• uses a density- based notion of clusters to discover clusters of arbitrary shapes.

• for each object of a cluster, the neighbourhood of a given radius has to contain at least a minimum

number of data objects.

• the density of the neighbourhood must exceed a threshold.

• The DBSCAN algorithm should be used to find associations and structures in data that are hard to

find manually but that can be relevant and useful to find patterns and predict trends.
15. Explain the ways in which mutation operation can be affected.
The mutation operator introduces new genetic structures in the population by randomly changing
some of its being building blocks. Since the modification is totally random and thus not related to
any previous genetic structures as shown in fig (9), the mutation is implemented by occasionally
altering a random bit from a chromosome. The figure shows the operator being applied to the fifth
element of the chromosome.
16.Explain categorical clustering algorithm
Categorical Clustering Algorithms:
ROCK (Robust hierarchical-clustering with Links)
STIRR (Sieving Through Iterated Reinforcement)
CACTUS (Clustering Categorical Data Using Summaries) STIRR:
• STIRR (Sieving through Iterated Relational Reinforcement).
• Iterative algorithm based on non-linear dynamical system The silent feature of STIRR is:
• The database is represented as graph.
• where each distinct value in the domain in each attribute is represented by a weighted node.
• there are N attributes and the domain size of the i th attribute is di
• number of nodes in the graph is ∑idi
• For each tuple in the database, an edge represents a set of nodes which participate in that tuple.
• Thus, a tuple is represented as a collection of nodes, one from each attribute type.
• We assign a weight to each node.
• The set of weights of all the nodes define the configuration of this structure.

17.Explain any two types of selection in genetic algorithm?


The selection phase involves the selection of individuals for the reproduction of offspring. All

the selected individuals are then arranged in a pair of two to increase reproduction. Then these

individuals transfer their genes to the next generation.

Types:

Roulette Wheel Selection: In the roulette wheel selection, the probability of choosing an individual
for breeding of the next

generation is proportional to its fitness, the better the fitness is, the higher chance for that individual

to be chosen. Choosing individuals can be depicted as spinning a roulette that has as many pockets

as there are individuals in the current generation, with sizes depending on their probability.

Probability of choosing individual i {\displaystyle i} i is equal to p i = f i Σ j = 1 N f j {\displaystyle

p_{i}={\frac {f_{i}}{\Sigma _{j=1}^{N}f_{j}}}} p_i = \frac{f_i}{\Sigma_{j=1}^{N} f_j}, where f i

{\displaystyle f_{i}} f_{i} is the fitness of i {\displaystyle i} i and N {\displaystyle N} N is the size of

current generation (note that in this method one individual can be drawn multiple times).

Rank Selection: Rank Selection also works with negative fitness values and is mostly used when the
individuals in the population have very close fitness values (this happens usually at the end of the
run). This leads to each individual having an almost equal share of the pie (like in case of fitness
proportionate selection) and hence each individual no matter how fit relative to each other has an
approximately same probability of getting selected as a parent. This in turn leads to a loss in the
selection pressure towards fitter individuals, causing the GA to make poor parent selections in such
situations.
Tournament Selection

Tournament Selection is a method of choosing the individual from the set of individuals. The winner
of each tournament is selected to perform crossover

You might also like