0% found this document useful (0 votes)
17 views24 pages

DM Unit 4

Classification is a data analysis technique aimed at categorizing data into distinct classes, using known class labels from training data to build predictive models. Decision trees are a popular method for classification, utilizing a flowchart-like structure to represent decisions based on attribute tests, while Bayesian classifiers apply Bayes' Theorem to predict class probabilities. Various attribute selection measures, such as Information Gain and Gini Index, help optimize the decision tree construction process, while rule-based classifiers use IF-THEN rules for predictions.

Uploaded by

cantbeatme006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

DM Unit 4

Classification is a data analysis technique aimed at categorizing data into distinct classes, using known class labels from training data to build predictive models. Decision trees are a popular method for classification, utilizing a flowchart-like structure to represent decisions based on attribute tests, while Bayesian classifiers apply Bayes' Theorem to predict class probabilities. Various attribute selection measures, such as Information Gain and Gini Index, help optimize the decision tree construction process, while rule-based classifiers use IF-THEN rules for predictions.

Uploaded by

cantbeatme006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

What is Classification?

o If the accuracy is acceptable, the classifier can be used to


predict the class labels of future, unknown data.
• Classification is a type of data analysis where the goal is to
categorize data into different classes or groups. Key Concepts:

• For example, predicting whether a loan application is "safe" or • Training Data: Data with known class labels used to build the
"risky," or determining the appropriate medical treatment. model.

Difference from Numeric Prediction: • Test Data: Data with known class labels used to evaluate the
model.
• Classification deals with categorical labels (like "safe" or "risky"),
while numeric prediction deals with continuous values (like • Accuracy: The percentage of correct predictions made by the
predicting how much money a customer will spend). model on the test data.

• Classification is a major type of prediction, along with numeric • If the model is accurate, it can be used to classify new, unknown
prediction (regression). data.

General Approach to Classification (Two Steps): What is Decision Tree Induction?

• Step 1: Build the Model (Learning Phase): 1. Definition:

o In this step, a classifier is built using a set of data (called o Decision tree induction is a method used to create
training data) that includes both the input features and decision trees from labeled data (training tuples).
the correct class labels.
o A decision tree is a flowchart-like structure with:
o The classifier "learns" from this data to make predictions.
▪ Internal nodes that represent tests on attributes.
o This is called supervised learning because the class labels
▪ Branches that show the possible outcomes of
are already known.
these tests.
• Step 2: Evaluate and Use the Model (Classification Phase):
▪ Leaf nodes that give the final class label (the
o Once the classifier is built, it’s tested on new data (called prediction).
test data) that wasn’t used during training.
2. How Decision Trees Work:
o The classifier’s accuracy is evaluated by comparing its
predictions to the correct class labels of the test data.
o Given a new data point (tuple), the tree tests its attributes o Recursive Process: The algorithm keeps splitting the data
from the root to a leaf node. until:

o Each internal node tests an attribute, and the path taken ▪ All data in a partition belong to the same class.
depends on the test outcome.
▪ There are no attributes left to split.
o Eventually, the path leads to a leaf node, which provides
▪ A partition is empty.
the class label for the data point.
2. Steps in Decision Tree Algorithm:
3. Why Are Decision Trees Popular?:
o Start with all training data.
o They don’t require domain-specific knowledge or
complicated parameter settings. o If all data in a partition belong to the same class, create a
leaf node.
o They are easy to understand because they visually
represent decisions. o If no attributes are left, use majority voting to assign the
most common class to the node.
o They can handle complex, multi-dimensional data.
o Use a selection method (e.g., Information Gain or Gini
o They are fast and efficient for both learning and
Index) to pick the best attribute to split the data.
classification.
o Repeat this process for each branch created.
o They are widely used in fields like medicine, finance, and
marketing. 3. Types of Attributes and Splits:
Key Concepts in Decision Tree Construction: o Discrete-valued attributes: A separate branch is created
for each possible value.
1. Building a Decision Tree:
o Continuous-valued attributes: The data is split into two
o Attribute Selection: The algorithm picks the best attribute
branches based on a split-point (e.g., A ≤ x or A > x).
(test) to split the data at each internal node.
o Binary trees: Some algorithms enforce binary trees,
o Greedy Approach: The tree is built in a top-down manner,
meaning each internal node only has two branches.
recursively dividing data into smaller subsets until certain
conditions are met. 4. Stopping Conditions:
o Stop if all data in a partition belong to the same class. o Decision trees are used in various fields, such as medical
diagnosis, financial analysis, and machine learning
o Stop if no more attributes are available for splitting.
systems for rule induction.
o Stop if a partition is empty (then assign the majority
Attribute Selection Measures:
class).
• These measures help choose the best attribute (feature) to split a
5. Computational Complexity:
set of training data into different groups in decision trees.
o The time it takes to build the tree depends on the number
• A good attribute is one that divides the data into subsets where
of attributes and training examples.
each subset is as pure as possible, meaning all items in that
o It grows at a rate of O(n × |D| × log(|D|)), where n is the subset belong to the same class.
number of attributes, and |D| is the number of training
How it Works:
examples.
1. Goal: We aim to find the attribute that best splits the data to
6. Incremental Decision Trees:
reduce the uncertainty or "impurity" of the subsets.
o For new data, some algorithms update the existing tree
2. Best Attribute: The attribute with the highest score (after
instead of building a new one from scratch.
applying an attribute selection measure) is chosen as the splitting
7. Scalability Issues: criterion.

o Decision tree induction can be slow and memory- 3. Split Method:


intensive for large datasets. There are improvements to
o For continuous attributes, the splitting point is selected
address this issue.
(e.g., a threshold).
8. Tree Pruning:
o For discrete attributes, we use the distinct values of the
o After building the tree, pruning can be done to remove attribute to split the data into subsets.
branches that are based on noise or outliers, which can
Common Measures:
improve the accuracy.
1. Information Gain:
9. Applications:
o It is based on the concept of "entropy" (the level of
uncertainty in the data).
o We calculate how much uncertainty (information) is o The attribute with the highest information gain is selected
reduced by splitting the data based on a specific attribute. to split the data at the root node of the tree.

o Higher information gain means a better split because it 2. Gain Ratio Example:
reduces more uncertainty.
o Attributes like income are evaluated to see which split
2. Gain Ratio: provides the best gain ratio. This avoids choosing
attributes with many values that create overly specific
o This improves on information gain by accounting for
splits.
attributes that have many distinct values (which can
result in overfitting). 3. Gini Index Example:

o It normalizes the information gain to avoid favoring o For attributes like income, the Gini index is calculated for
attributes with many possible values (e.g., unique all possible splits.
identifiers like product IDs).
o The attribute and its best split (e.g., income split into
o The attribute with the highest gain ratio is selected. "low" and "medium") that minimize the Gini index are
chosen for the decision tree.
3. Gini Index:
Summary:
o This measure calculates the "impurity" of the data.
• These measures help create decision trees by selecting the most
o A lower Gini index means the data is purer (fewer mixed
informative attributes for splitting data.
classes).
• Information gain is used in ID3, gain ratio in C4.5, and the Gini
o It’s used in methods like CART (Classification and
index in CART.
Regression Trees).
• The goal is to find the best attribute and split that reduces
o It looks for binary splits (dividing the data into two
uncertainty or impurity in the data.
groups) that result in the lowest impurity.
Attribute Selection Measures:
Examples:
• Attribute selection measures help decide which features are
1. Information Gain Example:
important when building decision trees.
o The data is split based on different attributes (like age or
• Three commonly used measures are:
income).
o Information Gain: Prefers attributes with many possible o Postpruning: After the tree is fully grown, subtrees are
values but can be biased. removed and replaced with a leaf node.

o Gain Ratio: Adjusts for bias in Information Gain but favors • Pruned trees are smaller, simpler, and more accurate on test data.
unbalanced splits.
• Examples of pruning algorithms include:
o Gini Index: Prefers partitions with equal-sized groups but
o Cost Complexity Pruning (used in CART): Balances the
struggles with many classes.
size of the tree and its error rate.
• There are other measures, such as CHAID, C-SEP, and G-statistic,
o Pessimistic Pruning (used in C4.5): Adjusts error rates to
each with their strengths and weaknesses.
avoid over-optimism from the training set.
• Some measures, like MDL (Minimum Description Length), reduce
o MDL-based pruning: Prefers simpler trees by considering
bias and prefer simpler solutions.
the number of bits needed to encode the tree.
• Other techniques include multivariate splits, where multiple
Scalability and Decision Tree Induction:
attributes are combined for splitting.
• Traditional decision tree algorithms work well on small datasets
Which Attribute Selection Measure is Best?
but struggle with very large ones.
• All measures have biases, and no one measure is perfect.
• When the data doesn’t fit in memory, techniques like discretizing
• Decision trees can be shallow or deep depending on the measure continuous values and sampling help, but these still require some
used. Shallow trees may have more leaves and higher error rates. data to fit in memory.

• There’s no universal best measure; they all work well in practice. • Scalable methods like RainForest and BOAT help handle larger
datasets:
Tree Pruning:
o RainForest: Uses a technique called AVC-set to keep
• Tree pruning helps reduce overfitting (when the tree is too
memory usage low.
specific to the training data) by removing unnecessary branches.
o BOAT: Creates multiple smaller samples of data, fits them
• There are two main pruning methods:
in memory, and combines them to create a final tree.
o Prepruning: Stops the tree from growing further if a split
• BOAT is faster and can update the tree incrementally when new
doesn’t meet certain criteria.
data is added or deleted.
Interactive Tree Construction: 3. These classifiers are based on Bayes' Theorem, which helps to
calculate the probability of a class given certain evidence (data
• Perception-Based Classification (PBC) allows users to interactively
attributes).
visualize and build decision trees.
Key Concepts of Bayes’ Theorem:
• It helps visualize data, making it easier to build smaller, more
understandable trees while still maintaining accuracy. 1. P(H|X): The probability that a hypothesis (class) is true, given the
evidence (data tuple).
• The system shows two windows:
2. P(H): The prior probability of the class (before we know anything
o Data Interaction Window: Displays the data visually.
about the data).
o Knowledge Interaction Window: Shows the current
3. P(X|H): The probability of the evidence (data tuple) occurring,
decision tree.
given the class.
• Users can:
4. P(X): The probability of the evidence (data tuple) itself.
o Select the attribute and split points for decision making.
Bayes' Theorem formula:
o Visualize data for a specific node.
• P(H|X) = P(X|H) * P(H) / P(X)
o Assign class labels to nodes.
Naïve Bayesian Classification Steps:
• PBC builds smaller trees than traditional algorithms like CART,
1. Training: Use a training set of data with known class labels.
C4.5, and SPRINT, and the trees are easier to understand.
2. Prediction: For a new data point, calculate the posterior
Bayesian Classifiers Overview:
probability for each class.
1. Bayesian classifiers are statistical tools used to predict the
o Predict the class that has the highest probability (based
likelihood (probability) that a given data point belongs to a certain
on Bayes' Theorem).
class (category).
3. Independence Assumption: To simplify the calculation, the naïve
2. Naïve Bayesian Classifier is a simpler version where it assumes
classifier assumes that all attributes are independent given the
that the features (attributes) of the data are independent of each
class.
other, given the class.
o This allows us to break down the complex probability into
the product of simpler probabilities for each attribute.
4. Class Probability Calculation: Rule-Based Classification (Overview)

o For each class, compute the probability using: • Rule-Based Classifiers: These classifiers use a set of rules to
predict the class of a data point.
▪ P(X|Ci) = P(x1|Ci) * P(x2|Ci) * … * P(xn|Ci)
• IF-THEN Rules: The rules are in the form "IF condition THEN
▪ Where xk is the value of each attribute, and Ci is
conclusion." For example:
the class.
o Rule: IF age = youth AND student = yes THEN buys
o Multiply this by the class prior probability to get the final
computer = yes.
value.
• Rule Components:
5. Classification: Choose the class with the maximum value as the
prediction. o Antecedent (IF part): Contains conditions (e.g., age =
youth, student = yes).
Handling Zero Probability Issue:
o Consequent (THEN part): Contains the prediction (e.g.,
• Sometimes, in the calculation, we may encounter a zero
buys computer = yes).
probability for a certain attribute's value (e.g., if no data point has
a certain combination of values). • Coverage: The number of data points (tuples) a rule applies to.

• To avoid the problem of zero probabilities (which would make the • Accuracy: How many of those data points are correctly predicted
entire product of probabilities zero), we use a Laplace Correction by the rule.
(also called Laplacian Smoothing).
Key Definitions:
o Add 1 to each count in the training data to avoid zero
• Coverage of a Rule: Percentage of data points that the rule
probabilities.
applies to.
o This method slightly adjusts the probabilities, but helps
• Accuracy of a Rule: Percentage of correct predictions made by the
ensure no attribute has a zero probability.
rule.
Example:
Example: Rule Evaluation
• If you have 1000 training tuples and 0 tuples with a specific
• Example 1: If we have a rule that covers 2 data points and both
attribute value, using the Laplace correction would assume 1
are correctly classified, the coverage is 14.28% (2/14) and
extra tuple to avoid a zero probability.
accuracy is 100% (2/2).
8.4.1 Using IF-THEN Rules for Classification • Pruning: After extracting the rules, unnecessary or redundant
conditions can be removed to make the rules simpler.
• Triggering the Rule: A rule is "triggered" when the conditions of
the rule (IF part) match a data point. 8.4.3 Rule Induction Using a Sequential Covering Algorithm

• Conflict Resolution: If multiple rules apply to the same data point, • Sequential Covering: This is a method where rules are learned
we need a method to decide which rule to use. one at a time:

o Size Ordering: Choose the rule with the most conditions o For each class, a rule is created that covers as many data
(toughest rule). points as possible for that class.

o Rule Ordering: Pre-arrange rules in order of importance o Once a rule is learned, the data points it covered are
(e.g., by class frequency). removed, and the process repeats until the stopping
condition is met (e.g., no more data or rule quality is low).
• Default Rule: If no rule matches, a default rule can be applied
(e.g., predicting the majority class). • Learning Process:

8.4.2 Rule Extraction from a Decision Tree o Start with an empty rule.

• Extracting Rules: You can turn a decision tree into rules: o Gradually add conditions (attribute tests) to the rule.

o Each path from the root to a leaf node forms a rule. o Choose the best test at each step (greedy approach).

o The conditions on the path become the rule's "IF" part. • Greedy Approach: At each step, the best attribute test is added to
the rule, without looking back to fix mistakes made earlier. This is
o The leaf node gives the prediction, which forms the
like a search for the best rule but doesn't reconsider previous
"THEN" part.
decisions.
• Example:
Key Steps in Rule Learning:
o Path from root to leaf: age = youth AND student = yes →
1. Start with an empty rule: No conditions initially.
buys computer = yes
2. Add conditions: Keep adding attributes to the rule to improve its
o This path becomes the rule: "IF age = youth AND student
quality.
= yes THEN buys computer = yes."
3. Repeat: Keep adding conditions until the rule is good enough.
4. Remove covered data points: Once a rule is learned, remove the 4. Information Gain (FOIL):
data points it correctly classifies and repeat the process.
o Information Gain is used in a method called FOIL (First-
Rule Quality Measures: Order Inductive Learner), which learns first-order logic
rules.
1. Rule Quality Evaluation:
o It considers how many positive and negative tuples the
o When creating a rule (e.g., "IF condition THEN class = c"),
rule covers, and it prefers rules that have high accuracy
it’s important to evaluate if adding a new attribute test
and cover many positive tuples.
improves the rule.
o FOIL Gain is calculated as the difference in the entropy
o Accuracy might seem like the obvious measure, but it's
before and after adding a condition to the rule.
not always the best indicator of a rule's quality.
5. Likelihood Ratio Test:
2. Accuracy Example (Example 8.8):
o This test checks if the rule's performance is due to chance
o Two rules (R1 and R2) both aim to classify loans as
or if it actually reflects a meaningful relationship.
"accept."
o It compares the observed distribution of classes in the
o R1 correctly classifies 38 out of 40 examples (95%
tuples covered by the rule with the expected distribution
accuracy), while R2 classifies only 2 examples, but it does
if the rule made random predictions.
so perfectly (100% accuracy).
o A higher likelihood ratio means the rule is likely making
o Even though R2 has a higher accuracy, R1 is a better rule
valid predictions and not just guessing.
because it covers more examples.
Rule Pruning:
3. Entropy Measure:
1. Pruning to Avoid Overfitting:
o Entropy measures how mixed or uncertain the class
distribution is in the tuples a rule covers. o Overfitting happens when a rule performs well on
training data but poorly on new data.
o Lower entropy means the rule is classifying most tuples to
one class (better rule). o Pruning is the process of removing conditions (tests) from
a rule to improve its performance on unseen data.
o It’s used to prefer conditions that focus on one class with
fewer examples from other classes.
o Pruning is done by evaluating the rule on a separate o Data in the same cluster are similar to each other, while
pruning set (not the original training data). data in different clusters are dissimilar.

2. FOIL Pruning: 2. How It Works:

o FOIL Prune calculates a value based on the number of o The clustering process is automated through an
positive and negative examples the rule covers. algorithm, which partitions the data.

o The formula is: pos−negpos+neg\frac{{\text{pos} - o Different algorithms may generate different groupings
\text{neg}}}{{\text{pos} + \text{neg}}}pos+negpos−neg, from the same data.
where "pos" is positive examples and "neg" is negative
3. Applications:
examples.
o Business Intelligence:
o If pruning the rule increases this value, the rule is pruned.
▪ Clustering helps in organizing customers into
3. RIPPER Pruning:
groups based on similar characteristics, which
o RIPPER prunes the most recent added condition (test) helps in creating targeted business strategies.
first.
▪ For example, grouping projects in a consultancy
o It keeps pruning until no improvement is made. to improve management and outcomes.

Conclusion: o Image Recognition:

• Rule Quality Measures include entropy, information gain, and ▪ In handwriting recognition, clustering can group
likelihood ratio, which assess how well the rule classifies data. variations of the same digit written differently by
people.
• Pruning is done to improve rule performance and reduce
overfitting by simplifying the rule to what works best on new o Web Search:
data.
▪ Clustering is used to organize large search results
What Is Cluster Analysis? into manageable groups.

1. Definition: ▪ It can also be used to group documents into


topics, improving information retrieval.
o Cluster analysis (or clustering) is a technique for grouping
data into clusters based on similarities. o Data Mining:
▪ Clustering can help understand the distribution of 7. Tools and Methods:
data and can also be a preprocessing step for
o There are many tools and methods for clustering,
other techniques, like classification.
including k-means and k-medoids.
▪ It can also be used for feature selection.
o These techniques are often built into statistical software
4. Advantages: like SPSS, S-Plus, and SAS.

o Automatic Classification: Clustering can automatically 8. Current Challenges:


identify groupings of data, unlike manual classification.
o Research focuses on making clustering methods more
o Data Segmentation: It can break large datasets into scalable, effective at handling complex shapes (like non-
smaller, similar groups, making it easier to analyze. convex shapes), and capable of working with high-
dimensional and mixed data types.
o Outlier Detection: Clustering can help detect unusual or
"outlier" data points (e.g., fraud detection in credit card Summary:
transactions).
• Clustering is a useful technique for grouping similar data together,
5. Research and Development: which can be applied in areas like business, image recognition,
and web search.
o Cluster analysis is actively researched, especially for
handling large datasets and improving techniques for • It is an unsupervised learning technique, where the algorithm
complex data like images, graphs, and text. discovers patterns in the data without needing labeled examples.

o It’s a growing area of study in fields like statistics, machine Partitioning Methods in Clustering:
learning, and data mining.
1. Partitioning: This is a type of cluster analysis where objects are
6. Clustering vs. Classification: grouped into exclusive clusters (groups). The number of clusters
(k) is provided beforehand.
o Clustering is unsupervised learning, meaning there’s no
prior knowledge of class labels (the grouping is 2. Goal: The aim is to make objects within the same cluster similar
discovered by the algorithm itself). to each other and different from objects in other clusters.

o Classification is supervised learning, where class labels 3. Centroid-Based Partitioning: This method uses the "center"
are given, and the algorithm learns from these labeled (centroid) of a cluster to represent it. The centroid is the average
examples. of the objects in the cluster.
k-Means Algorithm (Centroid-Based): o Iteratively swap a medoid with another object to
minimize total dissimilarity.
1. Objective: Minimize the difference between each object in a
cluster and the cluster's centroid (center). o Repeat until no improvement can be made.

2. Steps: 3. Benefit: It's more robust to outliers because medoids are not
affected by extreme values.
o Choose k objects as initial centroids.
4. Limitation: More computationally expensive than k-means,
o Assign each object to the closest centroid based on
especially for large data sets.
distance.
Scaling Up k-Medoids:
o Recalculate the centroids based on the assigned objects.
1. CLARA (Clustering Large Applications): A variation that uses
o Repeat until the assignments do not change.
random samples of the data instead of the full dataset. It runs the
3. Limitation: It can stop at a local optimum (not the best solution) k-medoids algorithm on these samples to improve scalability.
and might depend on the initial centroids.
2. CLARANS (Clustering Large Applications based on Randomized
4. Efficiency: It's computationally efficient and scales well for large Search): This method randomly picks objects and iterates through
data sets. swaps between medoids to find a better solution, improving
efficiency while still maintaining quality.
5. Issue with Outliers: Sensitive to outliers because extreme values
can heavily influence the mean. Conclusion:

k-Medoids Algorithm (Representative-Based): • k-Means: Good for large datasets but sensitive to outliers.

1. Objective: Instead of using centroids, this method uses real • k-Medoids: More robust to outliers but computationally more
objects as the center of each cluster (called medoids). expensive.

2. Steps: • CLARA and CLARANS: Methods to improve the scalability of k-


medoids for large datasets.
o Choose k objects as initial medoids (representative
objects). What is Hierarchical Clustering?

o Assign each object to the nearest medoid. • Hierarchical clustering organizes data into a "tree" or hierarchy of
clusters.
• It allows you to break data down into groups at different levels, • Divisive Clustering:
showing relationships between them.
o Starts with all data points in one cluster.
Real-Life Example:
o The cluster is gradually split into smaller groups.
• Example 1: In a company like AllElectronics, employees can be
Challenges in Hierarchical Clustering:
grouped into major categories (executives, managers, staff). Staff
can be further divided into smaller subgroups (e.g., senior • Deciding when to merge or split clusters can be difficult and
officers, officers, trainees). crucial to the quality of the results.
• Example 2: For handwritten character recognition, characters can • Once merged or split, decisions can’t be undone, and this may
be grouped first, and then each group can be further split into result in poor-quality clusters.
subgroups if the characters are written in different styles.
• Hierarchical methods don’t scale well to large datasets, as each
Hierarchical Clustering for Data Summarization: decision involves evaluating many data points.
• This hierarchy helps to summarize and visualize data. For instance, Improving Hierarchical Clustering:
you can easily calculate the average salary for managers or
officers in a company. • BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies):
When Hierarchical Clustering Discovers Structure:
o Uses tree structures to hierarchically group data, and then
• In some cases, the data might naturally have a hierarchical applies other algorithms for further clustering of the
structure (e.g., grouping animals by biological features for smaller "microclusters".
studying evolution).
• Chameleon:
• This can help uncover patterns or structures that already exist in
o Uses dynamic modeling to improve the hierarchical
the data, such as finding evolutionary paths for species.
clustering process.
Two Types of Hierarchical Clustering Methods:
Types of Hierarchical Clustering Methods:
• Agglomerative Clustering:
• Algorithmic Methods:
o Starts with each data point as its own cluster.
o These methods treat data as fixed and use deterministic
o These clusters are merged over time into larger clusters. distances between data points to form clusters.
• Probabilistic Methods: o Starts by grouping all data points into one big cluster.

o These use probabilistic models to define clusters and o The method then splits this big cluster into smaller
assess cluster quality based on how well the model fits subclusters and keeps splitting them until the clusters are
the data. small enough (either one data point per cluster or the
clusters are very similar).
• Bayesian Methods:
2. When to Stop Clustering:
o These methods return multiple possible clusterings along
with their probabilities, instead of providing a single • In both agglomerative and divisive methods, the process can stop
deterministic result. when the desired number of clusters is reached.

Conclusion: 3. Example of Both Methods:

• Hierarchical clustering is useful for organizing and summarizing • Agglomerative Example (AGNES):
data into hierarchical groups.
o Starts with each object as its own cluster.
• However, it can be challenging to make correct decisions about
o Merges clusters based on the closest objects (using a
merging or splitting clusters, especially with large datasets.
similarity measure like Euclidean distance).
• New methods like BIRCH and Chameleon help improve the quality
o This merging process continues until there’s just one big
of clustering by using additional techniques.
cluster.
. Two Types of Hierarchical Clustering:
• Divisive Example (DIANA):
• Agglomerative Clustering (Bottom-Up):
o Starts with all objects in one cluster.
o Starts by giving each data point its own cluster.
o Splits the cluster into smaller ones based on a criterion
o The method then merges the closest clusters step by step (like the largest distance between objects in the cluster).
until everything is in one big cluster.
o The splitting continues until each cluster contains just one
o It requires a maximum of "n" steps, where "n" is the object.
number of objects.
4. Dendrograms:
• Divisive Clustering (Top-Down):
• Both methods can be visualized using a dendrogram, which is a • Agglomerative: Builds the hierarchy from the bottom (merging
tree diagram showing how objects are grouped (agglomerative) clusters).
or split (divisive) step by step.
• Divisive: Builds the hierarchy from the top (splitting clusters).
• The dendrogram shows the level of similarity at each step. For
• Dendrogram: Used to show how clusters are grouped or split at
example, objects that are very similar will be merged or split at a
each step.
lower level (closer to the base of the tree).
• Divisive methods face computational difficulties with large
5. Challenges with Divisive Methods:
datasets, making them less common than agglomerative
• Partitioning Complexity: methods.

o Divisive methods need to figure out how to split large 1. Need for Distance Measures:
clusters into smaller ones.
• When clustering data, you need to measure the distance between
o There are many ways to do this (2^n - 1 possible ways for clusters (groups of objects).
"n" objects), and checking all options is too time-
• There are several ways to define the distance between two
consuming for large datasets.
clusters, called "linkage measures."
• Heuristics for Efficiency:
2. Four Main Distance Measures:
o To save time, divisive methods use heuristics (rules of
• Minimum Distance (distmin):
thumb) to make splitting decisions quickly, but these
might not always give the best results. o Measures the smallest distance between any two objects,
one from each cluster.
o Once a decision is made to split a cluster, the method
doesn’t reconsider it, even if it could have made a better o Example: If you have two clusters, this measure picks the
decision later. closest pair of objects from each cluster.
6. Why Agglomerative Methods Are More Common: • Maximum Distance (distmax):
• Due to the challenges in divisive methods, agglomerative o Measures the largest distance between any two objects,
methods are more popular because they are easier to compute one from each cluster.
and require fewer complex decisions.
o Example: This picks the farthest pair of objects between
7. Summary: the two clusters.
• Mean Distance (distmean): o Tends to create compact clusters and can help minimize
the increase in cluster size.
o Measures the distance between the averages (means) of
each cluster. 4. Challenges with Minimum and Maximum Measures:

o Example: Calculates the average of all objects in each • Sensitivity to Outliers:


cluster and measures the distance between those two
o Both minimum and maximum distances are sensitive to
average points.
outliers (or noisy data), which can skew the results.
• Average Distance (distavg):
5. Compromise Measures:
o Calculates the average distance between every pair of
• Mean and Average Distance measures are a middle ground
objects, where one object is from each of the two
between minimum and maximum distances.
clusters.
• Mean distance is easier to compute but works best for numeric
o Example: It averages all the distances between all object
data.
pairs between two clusters.
• Average distance is useful for both numeric and categorical data,
3. Types of Algorithms Based on Distance Measures:
although computing the "mean" for categorical data can be
• Nearest-Neighbor (Single-Linkage) Algorithm: difficult or impossible.

o Uses minimum distance to merge clusters. 6. Example of Single vs. Complete Linkage:

o Merges the closest clusters, leading to a tree structure • Single Linkage:


where clusters are linked by the closest objects.
o Finds clusters based on local proximity—it merges the
o Can be visualized as a minimal spanning tree. closest objects, leading to clusters that are close to each
other.
• Farthest-Neighbor (Complete-Linkage) Algorithm:
• Complete Linkage:
o Uses maximum distance to merge clusters.
o Finds clusters based on global closeness—it merges
o Merges the clusters by considering the farthest objects
clusters where the farthest objects are still relatively
from each cluster.
close.

7. Centroid-based Distance Measure:


• A variation of the distance measures is to calculate the distance o A summary of a cluster’s properties, represented as a 3D
between the centroids (central objects) of two clusters. vector: n (number of objects), LS (linear sum of data
points), and SS (square sum of data points).
• This approach measures the central point of each cluster and
calculates the distance between these two central points. o This feature helps describe the cluster without needing to
store all individual data points, saving space.
Summary:
• Clustering Feature Tree (CF-tree):
• Four main distance measures: minimum, maximum, mean, and
average. o A hierarchical tree that stores the CFs of clusters.

• Agglomerative methods: use these measures to merge clusters. o Non-leaf nodes summarize the information of their child
clusters (subclusters), while leaf nodes store individual
• Single-linkage uses minimum distance and tends to find clusters
clusters.
based on local proximity, while complete-linkage uses maximum
distance and focuses on global proximity. o It is designed to be memory efficient and scalable.

• Mean and average distance help reduce sensitivity to outliers. o The tree is height-balanced and controlled by two
parameters:
. BIRCH Overview:
▪ Branching factor (B): Maximum number of
• BIRCH (Balanced Iterative Reducing and Clustering using
children per nonleaf node.
Hierarchies) is a clustering algorithm designed to handle large
data sets efficiently. ▪ Threshold (T): Maximum allowable size
(diameter) of subclusters in the leaf nodes.
• It combines hierarchical clustering (in the beginning stage) with
iterative partitioning (in the later stage) to improve scalability and 3. BIRCH's Phases:
overcome two main issues in traditional agglomerative clustering:
• Phase 1: Initial CF-tree Construction
scalability and inability to undo previous steps.
o BIRCH scans the data once to build a basic CF-tree.
2. Key Concepts in BIRCH:
o During this phase, objects are inserted into the tree
• Clustering Feature (CF):
incrementally. Each insertion might lead to splitting of
subclusters if the threshold is exceeded.
o This phase aims to compress the data while preserving its • Cluster Shape: BIRCH works best when clusters are spherical
inherent clustering structure. (ball-shaped). It might not perform well with non-spherical
clusters, as it uses radius or diameter to define cluster
• Phase 2: Refining Clusters
boundaries.
o In this phase, BIRCH applies a standard clustering
• Node Limitations: Each node in the CF-tree can only hold a
algorithm to the leaf nodes of the CF-tree to refine
limited number of entries, meaning a node may not always
clusters.
correspond to what a user sees as a "natural" cluster.
o This step helps to remove outliers (sparse clusters) and
7. BIRCH’s Broader Impact:
merge dense clusters.
• The ideas of clustering features and CF-trees have been widely
4. Advantages of BIRCH:
used in other methods for streaming data and dynamic data
• Scalability: The algorithm can handle large datasets with linear clustering.
time complexity: O(n), where n is the number of data points.
Summary:
• Space Efficiency: The use of CFs allows BIRCH to store much less
• BIRCH is an efficient algorithm for large-scale clustering that
data than traditional methods while still summarizing cluster
combines hierarchical clustering and partitioning.
information accurately.
• It works by summarizing clusters with clustering features (CFs),
• Incremental: The algorithm can process data dynamically and
organizing them in a CF-tree, and refining the clusters in two
incrementally, which is useful for streaming or real-time data.
phases.
5. Handling Outliers and Tree Rebuilding:
• BIRCH is scalable, efficient in memory, and works well for large
• If the CF-tree becomes too large to fit into memory, a larger datasets, but has limitations with non-spherical clusters.
threshold can be used, and the tree can be rebuilt from the leaf
Chameleon: Hierarchical Clustering with Dynamic Modeling
nodes without re-reading all the data.
1. Cluster Similarity: Chameleon uses two key factors to assess if
• This makes the process efficient and avoids the need to start from
clusters should merge:
scratch every time.
o Interconnectivity: How well-connected the objects within
6. Limitations of BIRCH:
a cluster are.

o Proximity: How close the clusters are to each other.


2. Adaptability: Chameleon doesn't require a fixed model. It adapts 7. Drawbacks: For high-dimensional data, Chameleon may be slower
based on the clusters' characteristics to merge them in a way that (O(n²) time complexity).
reflects the data's true structure.
Probabilistic Hierarchical Clustering
3. K-Nearest Neighbor Graph:
1. Challenges with Traditional Clustering:
o A graph is created where each data point is a node.
o It's tough to choose a good distance measure.
o An edge (connection) exists between two nodes if one
o Cannot handle missing attribute values.
point is among the k-most similar points to the other.
o The merging/splitting decisions may not always optimize
o The graph's edges are weighted based on similarity.
the final cluster hierarchy.
4. Graph Partitioning: Chameleon uses graph partitioning to divide
2. Probabilistic Clustering:
the graph into small subclusters. This reduces the edge cuts (or
disconnections) between clusters. o It uses probabilistic models to measure distances
between clusters, overcoming some limitations of
5. Agglomerative Clustering:
traditional methods.
o Chameleon merges subclusters iteratively based on how
o It assumes the data objects are samples from an
similar they are in terms of interconnectivity and
underlying data generation model (e.g., a Gaussian
proximity.
distribution).
o It uses two key measures for similarity:
3. Generative Models:
▪ Relative Interconnectivity (RI): The
o Clustering is seen as estimating a probability distribution
interconnectivity between two clusters
that generated the data.
normalized by their internal interconnectivity.
o For example, a set of data points may be modeled as
▪ Relative Closeness (RC): The closeness between
samples from a Gaussian distribution.
two clusters normalized by their internal
closeness. o The task is to learn the parameters (mean and variance)
that best fit the data.
6. Benefits: Chameleon is effective at identifying clusters with
arbitrary shapes and is better than some other algorithms like 4. Cluster Quality:
BIRCH and DBSCAN.
o The quality of clusters is measured by how well the • Despite its potential, there’s still a gap between general data
generative model fits the data. mining principles and specialized tools for specific applications.

o If two clusters are merged, the quality change is • This section discusses several application domains for data
calculated based on how well the merged cluster fits the mining, including financial data, retail, telecommunications, and
generative model compared to the individual clusters. more.

5. Merging Clusters: 1. Data Mining for Financial Data Analysis:

o Clusters are merged if it improves the fit of the model (in • Financial Institutions: Banks and financial institutions use data
terms of likelihood). mining for services like loans, investments, insurance, and credit.

o The merging process stops when no improvement is • Advantages of Financial Data: The data is often reliable and
made. complete, making it easier to analyze.

6. Benefits: • Examples of Financial Data Mining:

o Can handle partially observed (incomplete) data, such as 1. Data Warehouses: Used for analyzing financial data over
missing attribute values. time (e.g., monthly debt and revenue).

o More interpretable due to the use of probabilistic models. 2. Loan Payment Prediction: Analyzing customer data (like
income, debt ratio, credit history) to predict loan payment
7. Drawbacks:
behavior and adjust loan policies.
o Only generates one possible hierarchy based on the
3. Customer Classification: Grouping customers based on
chosen probabilistic model.
behavior (like loan payments) for targeted marketing.
o Doesn't handle uncertainty in cluster hierarchies as well
4. Detecting Financial Crimes: Using data analysis tools to
as Bayesian methods might.
find unusual patterns that may indicate fraud or money
Unit 5 laundering.

Data Mining Applications Overview: 2. Data Mining for Retail and Telecommunications:

• Data Mining is a growing field used to find patterns in large • Retail Industry: Retail businesses collect large amounts of data on
datasets across various industries. sales, customers, and products. Data mining helps in various
ways:
1. Designing Data Warehouses: Creating systems to • Mining Complex Data: Scientific data often involves complex data
organize and analyze retail data. types (e.g., biological data, chemical structures) that need
advanced mining techniques.
2. Multidimensional Analysis: Analyzing sales, customer
behavior, and trends using advanced data cubes (e.g., • Graph-based Mining: Using graphs to model relationships
comparing sales in different regions and times). between objects (e.g., chemical structures or biological
processes) is crucial for mining complex scientific data.
3. Sales Campaign Effectiveness: Analyzing sales before and
after campaigns to measure impact and identify popular • Real-time Mining: In engineering, mining may need to happen in
items bought together. real-time, especially in systems requiring immediate analysis or
response.
4. Customer Retention: Using loyalty card data to track
customer purchases and suggest product 4. Data Mining in Social Science and Computer Science:
recommendations based on their previous behavior.
• Social Science: Analyzing communication data (e.g., social media,
5. Fraud Detection: Identifying fraudulent activities by blogs, news) helps understand public opinions and predict trends.
analyzing unusual patterns in customer behavior.
• Computer Science: Data mining is used to analyze system
• Telecommunications: Like retail, the telecommunications industry performance, detect software bugs, identify network intrusions,
uses data mining for customer patterns, fraud detection, and and improve computer systems. It also helps analyze massive
improving services. datasets generated by computer networks or sensor networks.

3. Data Mining in Science and Engineering: Summary:

• Scientific Data Mining: In science, massive amounts of high- • Data mining plays an essential role in various industries like
dimensional, temporal, and spatial data are generated. Data finance, retail, telecommunications, and science.
mining helps analyze this data, moving away from the traditional
• In each domain, specific tools and techniques are developed to
"hypothesize and test" approach.
handle industry-specific data, find patterns, predict outcomes,
• Challenges: Analyzing data from fields like meteorology, biology, and improve decision-making.
and climate modeling requires specialized data preprocessing and
Ubiquitous and Invisible Data Mining:
integration of diverse data sources.
• Data mining is everywhere in our daily lives, even if we don’t • Not All Data Mining Involves Personal Data: Many data mining
notice it. It affects the way we shop, work, search for information, applications focus on general data (e.g., predicting floods or
and even our health and leisure. studying geology) and don’t involve personal information.

• Invisible data mining happens through smart software like search • Personal Data Risks: Data mining can pose a risk when it involves
engines, online stores, and spam filters, which use data mining sensitive personal data like credit card details, health records, or
without us realizing it. criminal records.

• Shopping Influence: Stores like Wal-Mart and Amazon use data • Protecting Privacy:
mining to track customer behavior, predict what to stock, and
o Data Security Techniques: Measures like encryption,
make personalized recommendations for what to buy.
intrusion detection, and multilevel security can help
• Online Shopping: Recommender systems (like on Amazon) protect sensitive data.
suggest products based on previous customer behavior.
o Privacy-Preserving Data Mining: New methods are being
• Fraud Prevention: Credit card companies use data mining to spot developed to mine data while keeping personal
unusual spending and prevent fraud. information private.

• Customized Services: Companies use data mining for Customer o Methods for Privacy Preservation:
Relationship Management (CRM) to offer personalized services
▪ Randomization: Adding noise to data to protect
instead of mass marketing.
sensitive information.
• Email and Search: Email spam filters and search engines like
▪ K-anonymity: Ensuring data records can’t be
Google use data mining algorithms to organize results and show
uniquely identified.
relevant ads.
▪ L-diversity: Ensuring groups of data have diversity
• Personalized Ads: Ads are tailored based on your interests and
to prevent revealing sensitive values.
browsing habits, which makes them more relevant and less
annoying. ▪ Distributed Privacy: Splitting data across multiple
locations to limit access to sensitive information.
13.4.2 Privacy, Security, and Social Impacts of Data Mining:
▪ Downgrading Data Mining Effectiveness:
• Privacy Concerns: As more personal data is collected online, there
Modifying the results of data mining to prevent
are concerns about how data mining could invade our privacy or
privacy violations.
compromise data security.
• Differential Privacy: A new method that ensures small changes in • Data mining needs to be integrated into search engines,
data won’t affect the overall results of data mining, offering databases, and cloud computing systems for seamless
stronger privacy protection. performance and high-quality data analysis.

• Misuse of Data Mining: While data mining can be misused, its • This integration helps improve data mining's scalability and
benefits include helping businesses understand customer needs effectiveness.
and improving areas like health and science.
Social and Information Network Mining:
• Collaborative Efforts for Privacy: Experts from different fields will
• Mining data from social media, networks, and connections is
continue working together to ensure data mining is used
important to uncover patterns and knowledge.
responsibly and securely, so we can keep benefiting from it.
• This requires scalable methods to handle complex and large
Trends in data mining:
network data.
Expanding Applications:
Mining Spatiotemporal and Cyber-Physical Systems:
• Data mining is increasingly used in various fields like business,
• Data from moving objects like mobile phones, GPS systems, and
finance, government, healthcare, science, and even
sensors is growing rapidly.
counterterrorism.
• Data mining for real-time analysis of this data is a big challenge.
• New application areas include mobile data mining and analyzing
web and text data. Mining Multimedia, Text, and Web Data:
Scalable and Interactive Methods: • There is a focus on extracting useful information from multimedia
(audio, images, video) and web data.
• Data mining needs to handle large amounts of data efficiently and
interactively. • Despite progress, challenges in mining this type of data remain.
• Scalable algorithms are needed to process big data, and Mining Biological and Biomedical Data:
“constraint-based mining” allows users to guide the mining
• Biological data, such as DNA sequences or medical records, needs
process with specific goals.
special attention.
Integration with Other Systems:
• This includes analyzing DNA, proteins, microarray data, and
biomedical literature.
Software and System Engineering:

• Data mining techniques can be used to improve software and


system engineering by finding patterns in software errors, making
programs more robust.

Visual and Audio Data Mining:

• Data mining is being used to analyze visual and audio data,


helping humans better understand large data sets.

Distributed and Real-Time Data Mining:

• Many current data mining methods don't work well in distributed


systems like the internet or cloud computing.

• Real-time data mining, such as for stock analysis or mobile data, is


growing, requiring new techniques.

Privacy Protection and Security:

• With more personal data being collected, there are concerns


about privacy and security.

• Research is focused on developing privacy-preserving data mining


methods to keep personal information safe while still gaining
insights.

Future of Data Mining:

• Data mining technology will continue to evolve, and collaboration


among different experts will ensure privacy protection and
improve the benefits of data mining.

You might also like