Data Mining: Concepts and Techniques: - Chapter 4
Data Mining: Concepts and Techniques: - Chapter 4
Summary
2
Task-relevant data
Background knowledge
Characterization
Discrimination
Association
Classification/prediction
Clustering
Outlier analysis
Schema hierarchy
E.g., street < city < province_or_state <
country
Set-grouping hierarchy
E.g., {20-39} = young, {40-59} =
middle_aged
Operation-derived hierarchy
email address: login-name < department <
university < country
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and
cost (X, P2) and (P1 - P2) < $50
7
Measurements of Pattern
Interestingness
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = n(A and B)/ n (B),
classification reliability or accuracy, certainty factor,
rule strength, rule quality, discriminating weight, etc.
Utility
potential usefulness, e.g., support (association),
noise threshold (description)
Novelty
not previously known, surprising (used to remove
redundant rules, e.g., Canada vs. Vancouver rule
implication support ratio
8
mining task?
Summary
10
Motivation
Design
task-relevant data
interestingness measure
in relevance to att_or_dim_list
order by order_list
group by grouping_list
having condition
13
14
Characterization
Mine_Knowledge_Specification ::=
mine characteristics [as pattern_name]
analyze measure(s)
Discrimination
Mine_Knowledge_Specification ::=
mine comparison [as pattern_name]
for target_class where target_condition
{versus contrast_class_i where contrast_condition_i}
analyze measure(s)
Association
Mine_Knowledge_Specification ::=
mine associations [as pattern_name]
15
16
17
operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} :=
cluster(default, age, 5) < all(age)
rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <=
$250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
18
Example:
with support threshold = 0.05
with confidence threshold = 0.7
19
20
mining task?
Summary
23
mining task?
Summary
25
mining task?
Summary
27
Summary
References
E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent
Information Systems, 9:7-32, 1997.
Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm,
Aug. 2000.
J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, DMQL: A Data Mining Query Language for
Relational Databases, DMKD'96, Montreal, Canada, June 1996.
T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and
Knowledge Discovery, 3:373-408, 1999.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules
from large sets of discovered association rules. CIKM94, Gaithersburg, Maryland, Nov. 1994.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, pages
122-133, Bombay, India, Sept. 1996.
A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems.
IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational
database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998.
29