Unit 2: Scs5623 - Data Mining and Warehousing
Unit 2: Scs5623 - Data Mining and Warehousing
UNIT 2
How it is done
• Collect the task-relevant data (initial relation) using a relational database query
• Perform generalization by attribute removal or attribute generalization
• Apply aggregation by merging identical, generalized tuples and accumulating their
respective counts
• Interaction with users for knowledge presentation
1. Data collection
target class: graduate student
contrasting class: undergraduate student
2. Analytical generalization using Ui
attribute removal
remove name and phone#
attribute generalization
generalize major, birth_place, birth_date and gpa
accumulate counts
candidate relation: gender, major, birth_country, age_range and gpa
Mining ClassComparison
• Comparison: Comparing two or more classes
• Method:
o Partition the set of relevant data into the target class and the contrasting class(es)
o Generalize both classes to the same high level concepts
o Compare tuples with the same high level descriptions
o Present for every tuple its description and two measures
support - distribution within single class
comparison - distribution between classes
o Highlight the tuples with strong discriminant features
• Relevance Analysis:
o Find attributes (features) which best distinguish different classes
Association Rules
“An association algorithm creates rules that describe how often events have occurred together.”
Example: When a customer buys a hammer, then 90% of the time they will buy nails.
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set.
• First proposed by Agrawal, Imielinski, and Swami in the context of frequent itemsets and
association rule mining
• Motivation: Finding inherent regularities in data
o What products were often purchased together?— Beer and diapers?!
o What are the subsequent purchases after buying a PC?
o What kinds of DNA are sensitive to this new drug?
o Can we automatically classify web documents?
• Applications: Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Support: “is a measure of what fraction of the population satisfies both the antecedent and the
consequent of the rule”.
• Example:
o People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High
Support
o People who buy hotdog buns buy hangers in 0.005% of cases. = Low support
• Situations where there is high support for the antecedent are worth careful attention
o E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if
there is also high confidence.
Confidence: “is a measure of how often the consequent is true when the antecedent is true.”
• Example:
o 90% of Hotdog bun purchases are accompanied by hotdog sausages.
o High confidence is meaningful as we can derive rules.
• Hotdog sausage, Hotdog bun
• 2 rules may have different confidence levels and have the same support.
• E.g. Hotdog bun may have a much lower confidence than Hotdog sausage, yet they both
can have the same support, Hotdog bun.
Apriori Algorithm
It is a frequent pattern mining algorithm, and findsthe frequent item sets by generating the
candidates.
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
Example:
Frequent Pattern Growth Tree Algorithm
(Mining Frequent Patterns Without Candidate Generation )
It grows long patterns from short ones using local frequent items