0% found this document useful (0 votes)
82 views19 pages

I. Review Questions Chapter 4: Mining Frequent Patterns, Associations, Ad Corelations

Association rule mining is an unsupervised learning technique used to discover relationships between variables in large datasets. The Apriori algorithm is commonly used to find frequent itemsets and generate association rules. It works by iteratively generating candidate itemsets and pruning those that don't meet a minimum support threshold. The FP-Growth algorithm is an alternative that can be more efficient for large datasets by constructing an FP-tree structure. Both algorithms were applied to a sample grocery store dataset, with Apriori finding more frequent itemsets but FP-Growth being more efficient due to its tree structure.

Uploaded by

Vivek JD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views19 pages

I. Review Questions Chapter 4: Mining Frequent Patterns, Associations, Ad Corelations

Association rule mining is an unsupervised learning technique used to discover relationships between variables in large datasets. The Apriori algorithm is commonly used to find frequent itemsets and generate association rules. It works by iteratively generating candidate itemsets and pruning those that don't meet a minimum support threshold. The FP-Growth algorithm is an alternative that can be more efficient for large datasets by constructing an FP-tree structure. Both algorithms were applied to a sample grocery store dataset, with Apriori finding more frequent itemsets but FP-Growth being more efficient due to its tree structure.

Uploaded by

Vivek JD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DM Assignment-2A Vivek JD

I. Review Questions
Chapter 4 : Mining Frequent Patterns, Associations, ad Corelations

1. What mining is used to find interesting association relationships among a


large set of data items? Explain the process used for the same. How the rules
used for mining association can be classified in various ways?

Association rule mining is a technique for discovering interesting and useful


relationships between variables in large databases. It is a type of unsupervised
learning, which means that it does not require any labeled data.

The process of association rule mining typically involves the following steps:

1. Data preparation: The data is cleaned and preprocessed to ensure that it is in a


format that can be analyzed by the association rule mining algorithm. This may
involve removing missing values, converting categorical variables to numerical
variables, and discretizing continuous variables.
2. Finding frequent itemsets: The algorithm identifies itemsets that occur frequently in
the data. An itemset is a set of one or more items. For example, an itemset in a
grocery store database could be {bread, milk, eggs}.
3. Generating association rules: The algorithm generates association rules from the
frequent itemsets. An association rule is an implication of the form X => Y, where X
and Y are itemsets. The rule indicates that if X occurs in a transaction, then Y is also
likely to occur in that transaction.
4. Evaluating association rules: The algorithm evaluates the association rules to identify
the most interesting and useful rules. This is typically done using two measures:
support and confidence. Support is the percentage of transactions that contain X and
Y. Confidence is the percentage of transactions that contain Y given that they also
contain X.
Example:

Consider the following grocery store database:

Transaction ID | Items
---------- | --------
1 | {bread, milk, eggs}
2 | {bread, milk}
3 | {bread, eggs}
4 | {bread, milk, eggs}
5 | {milk, eggs}

The algorithm would first identify the frequent itemsets in the data. The following
itemsets occur in at least two transactions:

 {bread}
 {milk}
 {eggs}
 {bread, milk}
1
DM Assignment-2A Vivek JD

 {bread, eggs}
 {milk, eggs}
 {bread, milk, eggs}

The algorithm would then generate association rules from the frequent itemsets. For
example, the following association rules would be generated:

 {bread} => {milk} (support = 60%, confidence = 100%)


 {milk} => {bread} (support = 60%, confidence = 100%)
 {bread} => {eggs} (support = 40%, confidence = 100%)
 {eggs} => {bread} (support = 40%, confidence = 100%)
 {milk} => {eggs} (support = 40%, confidence = 100%)
 {eggs} => {milk} (support = 40%, confidence = 100%)
 {bread, milk} => {eggs} (support = 60%, confidence = 100%)
 {bread, eggs} => {milk} (support = 40%, confidence = 100%)
 {milk, eggs} => {bread} (support = 40%, confidence = 100%)

The algorithm would then evaluate the association rules to identify the most
interesting and useful rules. This is typically done using the support and confidence
measures. For example, the rule {bread} => {milk} has a high support and confidence,
which means that it is a strong rule. This rule could be used by the grocery store to
make recommendations to customers. For example, the store could place bread and
milk next to each other on the shelves to encourage customers to buy both items
together.

Classification of association rules:

Association rules can be classified in a variety of ways, including:

 Direction: Association rules can be either uni-directional or bi-directional. Uni-


directional rules have the form X => Y, while bi-directional rules have the form X <=>
Y, where X and Y are itemsets. Bi-directional rules indicate that there is a strong
relationship between X and Y, and that the direction of the relationship is not
important.
 Strength: Association rules can be classified by their strength, which is typically
measured by the support and confidence measures. Strong rules have a high support
and confidence, while weak rules have a low support and confidence.
 Actionability: Association rules can also be classified by their actionability, which is a
measure of how useful the rule is for making decisions. Actionable rules are rules that
can be used to take concrete actions, such as making recommendations to
customers or adjusting product placement.

Association rule mining is a powerful

2
DM Assignment-2A Vivek JD

2. Illustrate the algorithm used for finding frequent item sets using candidate
generation concept. How efficiently of that algorithm can be improved?
The Apriori algorithm is a popular algorithm for finding frequent itemsets using the
candidate generation concept. It works by iteratively generating candidate itemsets
and pruning those that do not meet the minimum support threshold.

Here is a step-by-step illustration of the Apriori algorithm:

1. Initialize: Set the minimum support threshold (min_sup) and the current frequent
itemset list (L1) to the set of all single items in the dataset.
2. Generate candidates: Generate the candidate itemset list (C2) of size 2 by joining the
items in L1.
3. Count support: Count the support for each candidate itemset in C2.
4. Prune candidates: Remove any candidate itemsets in C2 that have a support less
than min_sup.
5. Update frequent itemset list: Set the current frequent itemset list (L2) to the set of all
candidate itemsets in C2 that have a support greater than or equal to min_sup.
6. Repeat steps 2-5: Repeat steps 2-5 for k = 3, 4, ..., n, where n is the maximum size of
the frequent itemsets that you want to find.

The Apriori algorithm is efficient because it prunes candidate itemsets that are
unlikely to be frequent. This is based on the following property:

 Apriori property: If an itemset is not frequent, then none of its supersets can be
frequent.

This property allows the algorithm to avoid generating and counting many candidate
itemsets that are unlikely to be frequent.

However, the Apriori algorithm can still be inefficient for large datasets, as it can
generate a large number of candidate itemsets. There are a number of techniques
that can be used to improve the efficiency of the Apriori algorithm, such as:

 Hash tables: Hash tables can be used to quickly count the support for candidate
itemsets.
 Transaction reduction: Transaction reduction techniques can be used to reduce the
size of the dataset without affecting the results of the algorithm.
 Parallel mining: Parallel mining techniques can be used to distribute the computation
of the algorithm across multiple processors.

In addition to the above techniques, there are a number of other algorithms that have
been proposed for finding frequent itemsets. Some of these algorithms, such as the
FP-Growth algorithm, can be more efficient than the Apriori algorithm for large
datasets.

Overall, the Apriori algorithm is a simple and efficient algorithm for finding frequent
itemsets. However, there are a number of techniques that can be used to improve the
efficiency of the algorithm for large datasets.

3
DM Assignment-2A Vivek JD

(a)
Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
Using Apriori:
1. Initialize: Set the minimum support threshold (min_sup) to 60%.
2. Generate candidates: Generate the candidate itemset list (C1) of size 1 by finding the
set of all unique items in the dataset.
3. Count support: Count the support for each candidate itemset in C1.
4. Prune candidates: Remove any candidate itemsets in C1 that have a support less
than min_sup.
5. Update frequent itemset list: Set the current frequent itemset list (L1) to the set of all
candidate itemsets in C1 that have a support greater than or equal to min_sup.
6. Repeat steps 2-5: Repeat steps 2-5 for k = 2, 3, ..., n, where n is the maximum size of
the frequent itemsets that you want to find.

The following is the list of all frequent itemsets found using Apriori:

{M} (support = 40%)


{O} (support = 40%)
{N} (support = 40%)
{K} (support = 40%)
{E} (support = 40%)
{Y} (support = 40%)
{M, O} (support = 60%)
{M, N} (support = 60%)
{M, K} (support = 60%)
{M, E} (support = 60%)
{M, Y} (support = 60%)
{O, N} (support = 60%)
{O, K} (support = 60%)
4
DM Assignment-2A Vivek JD

{O, E} (support = 60%)


{O, Y} (support = 60%)
{N, K} (support = 60%)
{N, E} (support = 60%)
{N, Y} (support = 60%)
{K, E} (support = 60%)
{K, Y} (support = 60%)
{E, Y} (support = 60%)
Using FP-Growth:
1. Construct the FP-tree: The FP-tree is constructed by sorting the items in each
transaction by their support in decreasing order. The items are then added to the FP-
tree, starting with the most frequent item. The support of each item in the FP-tree is
the sum of the supports of all transactions that contain that item.
2. Mine the FP-tree: The FP-tree is mined recursively to find all frequent itemsets. The
algorithm starts at the root of the FP-tree and mines each branch of the tree. To mine
a branch, the algorithm first finds all frequent itemsets that contain the item at the
head of the branch. The algorithm then recursively mines each child of the node at
the head of the branch.

The following is the list of all frequent itemsets found using FP-Growth:

{M} (support = 40%)


{O} (support = 40%)
{N} (support = 40%)
{K} (support = 40%)
{E} (support = 40%)
{Y} (support = 40%)
{M, O} (support = 60%)
{M, N} (support = 60%)
{M, K} (support = 60%)
{M, E} (support = 60%)
{M, Y} (support = 60%)
{O, N} (support = 60%)
{O, K} (support = 60%)
{O, E} (support = 60%)
{O, Y} (support = 60%)
{N, K} (support = 60%)
{N, E} (support = 60%)
{N, Y} (support = 60%)
{K, E} (support = 60%)
{K, Y} (support = 60%)
{E, Y} (support = 60%)
Comparison of the efficiency of the two mining processes:

The FP-Growth algorithm is generally more efficient than the Apriori algorithm for
large datasets. This is because the FP-Growth algorithm constructs an FP-tree, which
allows it to avoid generating and counting many candidate itemsets that are unlikely
to be frequent

5
DM Assignment-2A Vivek JD

b)

The following are all the strong association rules (with support s and confidence c)
matching the given metarule:

{M, O} => {Y} (s = 60%, c = 100%)


{M, N} => {Y} (s = 60%, c = 100%)
{M, K} => {Y} (s = 60%, c = 100%)
{M, E} => {Y} (s = 60%, c = 100%)
{O, N} => {Y} (s = 60%, c = 100%)
{O, K} => {Y} (s = 60%, c = 100%)
{O, E} => {Y} (s = 60%, c = 100%)
{N, K} => {Y} (s = 60%, c = 100%)
{N, E} => {Y} (s = 60%, c = 100%)
{K, E} => {Y} (s = 60%, c = 100%)

These rules are all strong because they have a high support and confidence. This
means that they are both frequent and reliable. For example, the rule {M, O} => {Y}
has a support of 60%, which means that 60% of the transactions that contain the
items {M} and {O} also contain the item {Y}. The rule also has a confidence of 100%,
which means that 100% of the transactions that contain the items {M} and {O} also
contain the item {Y}.

These rules can be used by businesses to make recommendations to customers. For


example, a business could recommend that customers who purchase items {M} and
{O} also purchase item {Y}.

6
DM Assignment-2A Vivek JD

II. Review Questions


Chapter 5 : Classification
1. What is classification and prediction? Illustrate the classification process.
Classification and prediction are two important tasks in data mining.

Classification is the process of assigning a new data point to a predefined category.


For example, we could classify a new email as spam or not spam, or we could
classify a new customer as likely to churn or not likely to churn.

Prediction is the process of forecasting a future value based on past data. For
example, we could predict the sales of a new product, or we could predict the
probability of a customer clicking on an ad.
Classification process in data mining:

The classification process in data mining typically involves the following steps:

1. Data collection and preparation: The first step is to collect and prepare the data. This
may involve cleaning the data, removing outliers, and converting the data into a
format that can be used by the classification algorithm.
2. Feature selection: The next step is to select the features that will be used to classify
the data. Features are the variables that are used to describe the data points. For
example, the features for classifying emails as spam or not spam could be the
sender's email address, the subject line, and the body of the email.
3. Model training: The next step is to train a classification model. This involves feeding
the training data to the algorithm and allowing it to learn the relationships between the
features and the target classes.
4. Model evaluation: Once the model is trained, it needs to be evaluated to see how well
it performs on unseen data. This is done by feeding the model a test set of data and
comparing the predicted classes to the actual classes.
5. Model deployment: Once the model is evaluated and found to be performing well, it
can be deployed to production. This means that the model can be used to classify
new data points.

There are a variety of classification algorithms that can be used, such as decision
trees, logistic regression, and support vector machines. The best algorithm to use will
depend on the specific problem that you are trying to solve.

Classification is a powerful tool that can be used to solve a variety of problems. For
example, classification can be used to:

 Identify fraudulent transactions


 Target marketing campaigns
 Predict customer churn
 Diagnose diseases

Classification is an important part of data mining and is used by businesses in a


variety of industries.
7
DM Assignment-2A Vivek JD

2. Why classification is needed and how it is different from clustering?


Explain issues to be addressed regarding classification and prediction
Classification is needed because it allows us to make predictions about new data
points based on our knowledge of past data. This can be useful for a variety of tasks,
such as:

 Fraud detection: Classification can be used to identify fraudulent transactions based


on past patterns of fraudulent activity.
 Targeted marketing: Classification can be used to identify customers who are most
likely to be interested in a particular product or service.
 Medical diagnosis: Classification can be used to diagnose diseases based on
symptoms and other medical data.
 Risk assessment: Classification can be used to assess the risk of a customer
defaulting on a loan or committing a crime.

Classification is different from clustering in that classification involves assigning data


points to predefined categories, while clustering involves grouping data points based
on their similarity.

Issues to be addressed regarding classification and prediction:

 Overfitting: Overfitting occurs when a classification model learns the training data too
well and is unable to generalize to new data. This can be addressed by using
regularization techniques, such as L1 and L2 regularization.
 Imbalanced classes: Imbalanced classes occur when one class of data is much more
frequent than another class. This can be addressed by using oversampling or
undersampling techniques.
 Bias: Bias can occur when a classification model is trained on data that is not
representative of the population that it will be used to classify. This can be addressed
by using techniques such as debiasing and fairness-aware machine learning.

In addition to the above issues, there are a number of other challenges that can arise
when developing and deploying classification and prediction models. These
challenges include:

 Data quality: The quality of the training data is essential for training an accurate
classification or prediction model. If the training data is noisy or incomplete, the model
will not be able to learn the correct relationships between the features and the target
classes.
 Model selection: There are a variety of different classification and prediction
algorithms available. The best algorithm to use will depend on the specific problem
that you are trying to solve. It is important to carefully evaluate the different algorithms
before selecting one to use.
 Model interpretation: Once a classification or prediction model has been trained, it is
important to be able to interpret the results. This can be difficult, as machine learning
models are often complex and opaque. It is important to be able to understand how
the model makes its predictions in order to trust the results.

8
DM Assignment-2A Vivek JD

Despite the challenges, classification and prediction are powerful tools that can be
used to solve a variety of problems. By carefully addressing the issues discussed
above, you can develop and deploy classification and prediction models that are
accurate, reliable, and fair.

3. Illustrate classification by decision tree induction approach.


Same question as of 2nd assignment 9th question

4. What are Bayesian classifiers? Illustrate predicting class label using naïve
Bayesian classification.
Bayesian classifiers are a type of machine learning algorithm that uses Bayes'
theorem to predict the probability of a data point belonging to a particular class.
Bayes' theorem is a mathematical formula that allows us to update our beliefs about a
hypothesis in light of new evidence.

Naïve Bayesian classification is a type of Bayesian classifier that is based on the


assumption that the features of a data point are independent of each other. This
assumption is often not true in practice, but it can still be effective for many problems.

Same question as of 2nd assignment 2nd question

5. Illustrate the regression technique used for the prediction of continues


value.
There are many regression techniques used for the prediction of continuous values in
data mining. One of the most popular techniques is linear regression.

Linear regression is a statistical method that is used to model the relationship


between a dependent variable and one or more independent variables. It is a type of
supervised learning, which means that it is trained on a set of data where the
dependent variable is known.

To predict the continuous value of a new data point using linear regression, we can
use the following steps:

1. Train a linear regression model on a set of data where the dependent variable is
known.
2. Use the trained model to predict the continuous value of the new data point.

Here is an example of how to predict the continuous value of a new data point using
linear regression:

Suppose we have a dataset of house prices. We want to use this dataset to train a
linear regression model to predict the price of a new house.

The dependent variable in this dataset is the price of the house, and the independent
variables could be the square footage of the house, the number of bedrooms, and the
location of the house.

9
DM Assignment-2A Vivek JD

To train the model, we would first need to fit the model to the training data. This can
be done using a variety of statistical software packages.

Once the model is trained, we can use it to predict the price of the new house. To do
this, we would simply input the values of the independent variables for the new house
into the model and the model would output a predicted price.

Linear regression is a powerful tool that can be used to predict continuous values in a
variety of applications. It is a relatively simple technique to understand and
implement, and it can be used to model linear relationships between variables.

However, linear regression is not always the best technique to use for predicting
continuous values. If the relationship between the dependent variable and the
independent variables is non-linear, then linear regression will not be able to
accurately model the relationship.

In these cases, it is important to use a non-linear regression technique, such as


polynomial regression or support vector regression.

In addition to linear regression, there are a number of other regression techniques


that can be used for predicting continuous values. These techniques include:

 Logistic regression: Logistic regression is a type of regression that is used to predict


binary outcomes. For example, logistic regression could be used to predict whether or
not a customer is likely to churn or whether or not a patient is likely to have a
particular disease.
 Decision trees: Decision trees are a type of non-linear regression technique that can
be used to model complex relationships between variables. Decision trees are often
used to predict continuous values, but they can also be used to predict binary
outcomes.
 Random forests: Random forests are a type of ensemble learning technique that
combines the predictions of multiple decision trees to produce a more accurate
prediction. Random forests are often used to predict continuous values, but they can
also be used to predict binary outcomes.

The best regression technique to use for a particular problem will depend on the
nature of the data and the desired outcome.

6. What is the need of estimating classifier accuracy? Explain techniques used


for assessing classifier accuracy
Estimating classifier accuracy is important because it allows us to evaluate the
performance of a classifier and to identify areas where the classifier can be improved.

Holdout
random subsampling
cross-validation
bootstrap methods are common techniques for assessing accuracy, based on
randomly sampled partitions of the given data.

10
DM Assignment-2A Vivek JD

11
DM Assignment-2A Vivek JD

12
DM Assignment-2A Vivek JD

Estimating classifier accuracy is important because it allows us to evaluate the


performance of a classifier and to identify areas where the classifier can be improved.

There are a number of different techniques that can be used to assess classifier
accuracy. One common technique is to use a holdout set.

A holdout set is a set of data that is not used to train the classifier and is only used
to evaluate the classifier's performance.

To assess classifier accuracy using a holdout set, we would first train the classifier on
the training set. Once the classifier is trained, we would classify the holdout set and
compare the predicted labels to the actual labels. The accuracy of the classifier is
then calculated as the percentage of data points in the holdout set that were correctly
classified.

Another common technique for assessing classifier accuracy is to use cross-


validation. Cross-validation is a technique that involves splitting the training data into
multiple sets and then training and evaluating the classifier on each set.

To assess classifier accuracy using cross-validation, we would first split the training
data into k folds. We would then train the classifier on k - 1 folds and evaluate the
classifier on the remaining fold. We would then repeat this process k times, training
and evaluating the classifier on a different fold each time.

The accuracy of the classifier is then calculated as the average of the accuracies on
the k folds.

Random subsampling is a technique that involves randomly splitting the training


data into multiple sets and then training and evaluating the classifier on each set. The
accuracy of the classifier is then calculated as the average of the accuracies on the
multiple sets.

Bootstrap methods are a class of techniques that involve repeatedly sampling the
training data with replacement and then training and evaluating the classifier on each
sample. The accuracy of the classifier is then calculated as the distribution of the
accuracies on the multiple samples.

13
DM Assignment-2A Vivek JD

III. Review Questions


Chapter 7 : Cluster Analysis
1. What is cluster analysis? Explain typical requirements of clustering in data
mining.
Cluster analysis is a data mining technique that groups similar data points together.
The goal of cluster analysis is to identify patterns and relationships within the data
that may not be immediately obvious

Cluster analysis or simply clustering is the process of partitioning a set of data objects
(or observations) into subsets. Each subset is a cluster, such that objects in a cluster
are similar to one another, yet dissimilar to objects in other clusters. The set of
clusters resulting from a cluster analysis can be referred to as a clustering.

typical requirements of clustering in data mining.


Same question as of 2nd assignment 7th question

2. Explain various types of data used in cluster analysis.


The following are various types of data used in cluster analysis:

 Numerical data: Numerical data is data that can be represented by numbers, such as
height, weight, and age. Numerical data is the most common type of data used in
cluster analysis, as it is easy to measure and compare.
 Categorical data: Categorical data is data that can be classified into categories, such
as gender, hair color, and occupation. Categorical data can be used in cluster
analysis, but it is more difficult to measure and compare than numerical data.
 Binary data: Binary data is data that can take on two values, such as yes or no, true
or false, and 1 or 0. Binary data is often used in cluster analysis to identify groups of
data points that share a common characteristic.
 Text data: Text data is data that is represented in a text format, such as emails,
documents, and social media posts. Text data can be used in cluster analysis, but it is
more difficult to measure and compare than other types of data.

3. Explain categorization of major clustering methods.


You will learn several basic clustering techniques, organized into the following
categories:
partitioning methods (Section 10.2),
hierarchical methods (Section 10.3),
density-based methods (Section 10.4), and
grid-based methods (Section 10.5).

14
DM Assignment-2A Vivek JD

Partitioning methods: Partitioning methods divide the data into a predefined number
of clusters. The goal of partitioning methods is to find clusters that are compact and
well-separated. Some common partitioning methods include k-means clustering, k-
medoids clustering, and hierarchical clustering.

Density-based methods: Density-based methods group together data points that are
close to each other in density. The goal of density-based methods is to find clusters
that are dense and well-defined. Some common density-based methods include
DBSCAN and OPTICS.

Hierarchical methods: Hierarchical methods create a hierarchy of clusters by


merging or splitting them iteratively. The goal of hierarchical methods is to find
clusters that are nested and represent different levels of abstraction. Some common
hierarchical methods include agglomerative clustering and divisive clustering.

Grid-based methods are a type of clustering method that divides the data space into
a grid and then groups together data points that are located in the same grid cell.
Grid-based methods are particularly useful for clustering high-dimensional data, as
they can reduce the dimensionality of the data by quantizing it into a grid structure.

4. What is the need of partitioning method? Illustrate K-mean algorithm for


the same.
Partitioning methods are a type of clustering algorithm that divide the data into a
predefined number of clusters. The goal of partitioning methods is to find clusters that
are compact and well-separated.

Partitioning methods are needed because they are simple to implement and
understand, and they are efficient and scalable to large datasets. Partitioning
methods are also well-suited for clustering data where the clusters are non-
overlapping and well-defined.

K-means clustering is a popular partitioning algorithm that works by iteratively


assigning data points to the cluster with the closest centroid. The centroid of a cluster
is the average of all the data points in the cluster.

The k-means clustering algorithm works as follows:

1. Randomly initialize k centroids.


2. Assign each data point to the cluster with the closest centroid.
3. Update the centroids of the clusters to be the average of all the data points in the
cluster.
4. Repeat steps 2 and 3 until the centroids converge or the maximum number of
iterations is reached.

The k-means clustering algorithm is a simple and effective algorithm for partitioning
data into a predefined number of clusters. However, it is important to note that the k-
15
DM Assignment-2A Vivek JD

means clustering algorithm can be sensitive to the choice of initial centroids and the
number of clusters.

Here is an example of how the k-means clustering algorithm can be used to cluster
customer data:

Suppose we have a dataset of customer data that includes the customer's age,
gender, and income. We want to use the k-means clustering algorithm to cluster the
customers into three groups.

We would first need to choose the number of clusters. In this case, we would choose
three clusters.

We would then randomly initialize three centroids. The centroids could be initialized
by randomly selecting three data points from the dataset.

Next, we would assign each data point to the cluster with the closest centroid. We can
do this by calculating the distance between each data point and each centroid and
then assigning the data point to the cluster with the closest centroid.

Once we have assigned all of the data points to clusters, we would update the
centroids of the clusters to be the average of all the data points in the cluster.

We would then repeat the process of assigning data points to clusters and updating
the centroids until the centroids converge or the maximum number of iterations is
reached.

Once the algorithm has converged, we will have three clusters of customers. We can
then analyze the different clusters to identify patterns and trends. For example, we
might find that one cluster of customers is made up of younger customers with lower
incomes, while another cluster of customers is made up of older customers with
higher incomes.

This information could then be used to target marketing campaigns and develop
products and services that are tailored to the needs of each customer cluster.

K-means clustering is a powerful tool for partitioning data into a predefined number of
clusters. It is simple to implement and understand, and it is efficient and scalable to
large datasets. However, it is important to note that the k-means clustering algorithm
can be sensitive to the choice of initial centroids and the number of clusters.

10.2.1. k-Means: A Centroid-Based Technique

5. Justify the need of hierarchal methods. Explain different methods.


Hierarchical clustering methods are needed because they can be used to cluster data
without having to specify the number of clusters in advance. This can be useful for
problems where the number of clusters is unknown or where the clusters are not well-
defined. Hierarchical clustering methods are also well-suited for clustering data where
there are hierarchical relationships between the clusters.
16
DM Assignment-2A Vivek JD

There are two main types of hierarchical clustering methods:

 Agglomerative clustering: Agglomerative clustering methods start with each data point
in its own cluster and then iteratively merge the two closest clusters until a single
cluster remains.

 Divisive clustering: Divisive clustering methods start with all of the data points in a
single cluster and then iteratively split the cluster into two smaller clusters until each
cluster contains a single data point.

Both agglomerative and divisive clustering methods can be used to produce a


dendrogram, which is a tree-like diagram that shows the hierarchical relationships
between the clusters. The dendrogram can be used to identify the optimal number of
clusters by cutting the tree at a certain height.

Here are some examples of when hierarchical clustering methods might be used:

 Clustering gene data: Hierarchical clustering methods can be used to cluster gene
data based on the expression levels of the genes. This information can then be used
to identify groups of genes that are co-expressed and to better understand the
biological processes that they are involved in.
 Clustering customer data: Hierarchical clustering methods can be used to cluster
customer data based on their purchase history, demographics, and other factors. This
information can then be used to identify groups of customers with similar needs and
preferences.
 Clustering image data: Hierarchical clustering methods can be used to cluster image
data based on the color and texture of the pixels. This information can then be used
to identify objects in images and to segment images into different regions.

Hierarchical clustering methods are a powerful tool for clustering data where the
number of clusters is unknown or where there are hierarchical relationships between
the clusters. However, hierarchical clustering methods can be computationally
expensive for large datasets.

Here is an example of how agglomerative clustering can be used to cluster customer


data:

Suppose we have a dataset of customer data that includes the customer's age,
gender, and income. We want to use agglomerative clustering to cluster the
customers into groups.

We would first start with each data point in its own cluster.

Next, we would calculate the distance between each pair of clusters. We can use any
distance metric, but the Euclidean distance is commonly used.

We would then find the two closest clusters and merge them into a single cluster.

We would repeat the process of finding the two closest clusters and merging them
until there is only one cluster left.

17
DM Assignment-2A Vivek JD

Once the algorithm has finished, we will have a single cluster of customers. We can
then analyze the cluster to identify patterns and trends. For example, we might find
that the cluster is made up of younger customers with lower incomes.

This information could then be used to target marketing campaigns and develop
products and services that are tailored to the needs of this customer group.

6. Write a short notes on Density based methods; Grid based methods and
Model based clustering methods
Density based methods

Density-based clustering methods group together data points that are close to each
other in density. The goal of density-based methods is to find clusters that are dense
and well-defined. Some common density-based methods include:

 DBSCAN (Density Based Spatial Clustering of Applications with Noise): DBSCAN is a


popular density-based method that uses a density threshold and a minimum number
of points to identify clusters.
 OPTICS (Ordering Points To Identify the Clustering Structure): OPTICS is a density-
based method that orders the data points based on their density and their distance to
other data points. This ordering can then be used to identify clusters.
 DENCLUE (DENsity-based CLUstEring): DENCLUE is a density-based method that
identifies clusters by finding maximal sets of dense units.
Grid based methods

Grid-based clustering methods divide the data space into a grid and then group
together data points that are located in the same grid cell. Grid-based methods are
particularly useful for clustering high-dimensional data, as they can reduce the
dimensionality of the data by quantizing it into a grid structure.

Some common grid-based methods include:

 STING (Statistical Information Grid): STING is a grid-based method that uses


statistical information about the data to group together similar data points.
 WaveCluster: WaveCluster is a grid-based method that uses a wavelet transform to
group together similar data points.
 CLIQUE (Clustering In QUEst It): CLIQUE is a grid-based method that uses a
density-based approach to group together similar data points.
Model based methods

Model-based clustering methods assume that the data is generated by a specific


statistical model. The goal of model-based methods is to find the parameters of the
model that best fit the data and to use these parameters to identify clusters. Some
common model-based methods include:

 Gaussian mixture models: Gaussian mixture models assume that the data is
generated by a mixture of Gaussian distributions. The parameters of the model can
be estimated using the expectation-maximization (EM) algorithm.
18
DM Assignment-2A Vivek JD

 Hidden Markov models: Hidden Markov models assume that the data is generated by
a sequence of hidden states. The parameters of the model can be estimated using
the Baum-Welch algorithm.
Advantages and disadvantages of each method
Density-based methods
 Advantages:
o Can identify clusters of arbitrary shape and size.
o Can identify clusters in the presence of noise.
 Disadvantages:
o Sensitive to the choice of density threshold and minimum number of points.
o Can be computationally expensive for large datasets.
Grid-based methods
 Advantages:
o Simple to implement and understand.
o Efficient and scalable to large datasets.
o Can be used to cluster high-dimensional data.
 Disadvantages:
o Sensitive to the choice of grid resolution.
o Sensitive to the presence of noise and outliers in the data.
o May not be able to identify clusters that are non-linear or that span multiple grid cells.
Model-based methods
 Advantages:
o Can identify clusters of arbitrary shape and size.
o Can be used to cluster high-dimensional data.
 Disadvantages:
o Sensitive to the choice of statistical model.
o Can be computationally expensive for large datasets.
Conclusion

The best clustering method to use will depend on the specific problem and the nature
of the data. Density-based methods are well-suited for clustering data where the
clusters are dense and well-defined. Grid-based methods are well-suited for
clustering high-dimensional data and for problems where the number of clusters is
unknown. Model-based methods are well-suited for clustering data where the clusters
are generated by a known statistical model.

19

You might also like