0% found this document useful (0 votes)
17 views9 pages

Unit 2: Scs5623 - Data Mining and Warehousing

The document discusses attribute-oriented induction and association rule mining techniques for data mining. It describes how attribute-oriented induction works, including attribute removal, generalization and threshold control. It also explains the concepts of support, confidence and how association rules are generated using the Apriori algorithm.

Uploaded by

AYAAN Satkut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Unit 2: Scs5623 - Data Mining and Warehousing

The document discusses attribute-oriented induction and association rule mining techniques for data mining. It describes how attribute-oriented induction works, including attribute removal, generalization and threshold control. It also explains the concepts of support, confidence and how association rules are generated using the Apriori algorithm.

Uploaded by

AYAAN Satkut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SCS5623 - DATA MINING AND WAREHOUSING

UNIT 2

CONCEPT DESCRIPTION AND ASSOCIATION RULES

Attribute Oriented Induction


• Data focusing: task-relevant data, including dimensions, and the result is the initial
relation
• Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1)
there is no generalization operator on A, or (2) A’s higher level concepts are expressed in
terms of other attributes
• Attribute-generalization: If there is a large set of distinct values for A, and there exists a
set of generalization operators on A, then select an operator and generalize A
• Attribute-threshold control: typical 2-8, specified/default
• Generalized relation threshold control: control the final relation/rule size

How it is done
• Collect the task-relevant data (initial relation) using a relational database query
• Perform generalization by attribute removal or attribute generalization
• Apply aggregation by merging identical, generalized tuples and accumulating their
respective counts
• Interaction with users for knowledge presentation

Example: Describe general characteristics of graduate students in the University


database

Step 1. Fetch relevant set of data using an SQL statement, e.g.,

Select * (i.e., name, gender, major, birth_place, birth_date, residence,


phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }

Step 2. Perform attribute-oriented induction

Step 3. Present results in generalized relation, cross-tab, or rule forms

Basic Algorithm for Attribute-Oriented Induction


• InitialRel: Query processing of task-relevant data, deriving the initial relation.
• PreGen: Based on the analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute: removal? or how high to generalize?
• PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a
“prime generalized relation”, accumulating the counts.
• Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into
rules, cross tabs, visualization presentations.

Class Characterization: An Example


Analytical Characterization

1. Data collection
target class: graduate student
contrasting class: undergraduate student
2. Analytical generalization using Ui
attribute removal
remove name and phone#
attribute generalization
generalize major, birth_place, birth_date and gpa
accumulate counts
candidate relation: gender, major, birth_country, age_range and gpa

Mining ClassComparison
• Comparison: Comparing two or more classes
• Method:
o Partition the set of relevant data into the target class and the contrasting class(es)
o Generalize both classes to the same high level concepts
o Compare tuples with the same high level descriptions
o Present for every tuple its description and two measures
support - distribution within single class
comparison - distribution between classes
o Highlight the tuples with strong discriminant features
• Relevance Analysis:
o Find attributes (features) which best distinguish different classes

Presentation of Generalized Results


• Generalized relation:
o Relations where some or all attributes are generalized, with counts or other
aggregation values accumulated.
• Cross tabulation:
o Mapping results into cross tabulation form (similar to contingency tables).
o Visualization techniques:
o Pie charts, bar charts, curves, cubes, and other visual forms.
• Quantitative characteristic rules:
o Mapping generalized result into characteristic rules with quantitative information
associated with it, e.g.,
• t-weight:
o Interesting measure that describes the typicality of
each disjunct in the rule
each tuple in the corresponding generalized relation
n – number of tuples for target class for generalized relation
qi … qn – tuples for target class in generalized relation
qa is in qi … qn

grad(x) Λ male(x) ⇒ birth_region(x) = “Canadd[t:53%] ∨ birth_region(x) = “foreign[t:47%]

Association Rules
“An association algorithm creates rules that describe how often events have occurred together.”

Example: When a customer buys a hammer, then 90% of the time they will buy nails.
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set.
• First proposed by Agrawal, Imielinski, and Swami in the context of frequent itemsets and
association rule mining
• Motivation: Finding inherent regularities in data
o What products were often purchased together?— Beer and diapers?!
o What are the subsequent purchases after buying a PC?
o What kinds of DNA are sensitive to this new drug?
o Can we automatically classify web documents?
• Applications: Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Support: “is a measure of what fraction of the population satisfies both the antecedent and the
consequent of the rule”.
• Example:
o People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High
Support
o People who buy hotdog buns buy hangers in 0.005% of cases. = Low support
• Situations where there is high support for the antecedent are worth careful attention
o E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if
there is also high confidence.

Confidence: “is a measure of how often the consequent is true when the antecedent is true.”
• Example:
o 90% of Hotdog bun purchases are accompanied by hotdog sausages.
o High confidence is meaningful as we can derive rules.
• Hotdog sausage, Hotdog bun
• 2 rules may have different confidence levels and have the same support.
• E.g. Hotdog bun may have a much lower confidence than Hotdog sausage, yet they both
can have the same support, Hotdog bun.

Apriori Algorithm
It is a frequent pattern mining algorithm, and findsthe frequent item sets by generating the
candidates.

• How to generate candidates?


Step 1: self-joining Lk
Step 2: pruning
• How to count supports of candidates?
- By counting how many times it hasoccured.

Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}

Example:
Frequent Pattern Growth Tree Algorithm
(Mining Frequent Patterns Without Candidate Generation )

It grows long patterns from short ones using local frequent items

• “abc” is a frequent pattern


• Get all transactions having “abc”: DB|abc
• “d” is a local frequent item in DB | abc abcd is a frequent pattern
Mining Multi-Level Associations
• A top_down, progressive deepening approach:
o First find high-level strong rules:
• milk -> bread [20%, 60%].
o Then find their lower-level “weaker” rules:
2% milk -> wheat bread [6%, 50%].
• Variations at mining multiple-level association rules.
o Level-crossed association rules:
o 2% milk -> Wonder wheat bread
o Association rules with multiple, alternative hierarchies:
o 2% milk -> Wonder bread
Multi-level Association: Uniform Support vs. Reduced Support
• Uniform Support: the same minimum support for all levels
o + One minimum support threshold. No need to examine itemsets containing any
item whose ancestors do not have minimum support.
o – Lower level items do not occur as frequently. If support threshold
too high ⇒ miss low level associations
too low ⇒ generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
o There are 4 search strategies:
Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item

Mining Quantitative Association Rules


• Determine the number of partitions for each quantitative attribute
• Map values/ranges to consecutive integer values such that the order is preserved
• Find the support of each value of the attributes, and combine when support is less than
MaxSup. Find frequent itemsets, whose support is larger than MinSup
• Use frequent set to generate association rules
• Pruning out uninteresting rules
Partial Completeness
• R : rules obtained before partition
• R’: rules obtained after partition
• Partial Completeness measures the maximum distance between a rule in R and its closest
generalization in R’
• is a generalization of itemset X: if
• The distance is defined by the ratio of support
K-Complete
• C : the set of frequent itemsets
• For any K ≥ 1, P is K-complete w.r.t C if:
1. P C
2. For any itemset X (or its subset) in C, there exists a generalization whose support
is no more than K times that of X (or its subset)
• The smaller K is, the less the information lost

Constraint based Association Mining


• Interactive, exploratory mining giga-bytes of data?
o Could it be real? — Making good use of constraints!
• What kinds of constraints can be used in mining?
o Knowledge type constraint: classification, association, etc.
o Data constraint: SQL-like queries
Find product pairs sold together in Vancouver in Dec.’98.
o Dimension/level constraints:
in relevance to region, price, brand, customer category.
o Rule constraints
small sales (price < $10) triggers big sales (sum > $200).
o Interestingness constraints:
strong rules (min_support ≥ 3%, min_confidence ≥ 60%).
• Pattern space pruning constraints
o Anti-monotonic: If constraint c is violated, its further mining can be terminated
o Monotonic: If c is satisfied, no need to check c again
o Succinct: c must be satisfied, so one can start with the data sets satisfying c
o Convertible: c is not monotonic nor anti-monotonic, but it can be converted into it
if items in the transaction can be properly ordered
• Data space pruning constraint
o Data succinct: Data space can be pruned at the initial pattern mining process
o Data anti-monotonic: If a transaction t does not satisfy c, t can be pruned from its
further mining

You might also like