0% found this document useful (0 votes)

17 views9 pages

Unit 2: Scs5623 - Data Mining and Warehousing

The document discusses attribute-oriented induction and association rule mining techniques for data mining. It describes how attribute-oriented induction works, including attribute removal, generalization and threshold control. It also explains the concepts of support, confidence and how association rules are generated using the Apriori algorithm.

Uploaded by

AYAAN Satkut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views9 pages

Unit 2: Scs5623 - Data Mining and Warehousing

Uploaded by

AYAAN Satkut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

SCS5623 - DATA MINING AND WAREHOUSING

UNIT 2

CONCEPT DESCRIPTION AND ASSOCIATION RULES

Attribute Oriented Induction

• Data focusing: task-relevant data, including dimensions, and the result is the initial
relation
• Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1)
there is no generalization operator on A, or (2) A’s higher level concepts are expressed in
terms of other attributes
• Attribute-generalization: If there is a large set of distinct values for A, and there exists a
set of generalization operators on A, then select an operator and generalize A
• Attribute-threshold control: typical 2-8, specified/default
• Generalized relation threshold control: control the final relation/rule size

How it is done
• Collect the task-relevant data (initial relation) using a relational database query
• Perform generalization by attribute removal or attribute generalization
• Apply aggregation by merging identical, generalized tuples and accumulating their
respective counts
• Interaction with users for knowledge presentation

Example: Describe general characteristics of graduate students in the University

database

Step 1. Fetch relevant set of data using an SQL statement, e.g.,

Select * (i.e., name, gender, major, birth_place, birth_date, residence,

phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }

Step 2. Perform attribute-oriented induction

Step 3. Present results in generalized relation, cross-tab, or rule forms

Basic Algorithm for Attribute-Oriented Induction

• InitialRel: Query processing of task-relevant data, deriving the initial relation.
• PreGen: Based on the analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute: removal? or how high to generalize?
• PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a
“prime generalized relation”, accumulating the counts.
• Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into
rules, cross tabs, visualization presentations.

Class Characterization: An Example

Analytical Characterization

1. Data collection
target class: graduate student
contrasting class: undergraduate student
2. Analytical generalization using Ui
attribute removal
remove name and phone#
attribute generalization
generalize major, birth_place, birth_date and gpa
accumulate counts
candidate relation: gender, major, birth_country, age_range and gpa

Mining ClassComparison
• Comparison: Comparing two or more classes
• Method:
o Partition the set of relevant data into the target class and the contrasting class(es)
o Generalize both classes to the same high level concepts
o Compare tuples with the same high level descriptions
o Present for every tuple its description and two measures
support - distribution within single class
comparison - distribution between classes
o Highlight the tuples with strong discriminant features
• Relevance Analysis:
o Find attributes (features) which best distinguish different classes

Presentation of Generalized Results

• Generalized relation:
o Relations where some or all attributes are generalized, with counts or other
aggregation values accumulated.
• Cross tabulation:
o Mapping results into cross tabulation form (similar to contingency tables).
o Visualization techniques:
o Pie charts, bar charts, curves, cubes, and other visual forms.
• Quantitative characteristic rules:
o Mapping generalized result into characteristic rules with quantitative information
associated with it, e.g.,
• t-weight:
o Interesting measure that describes the typicality of
each disjunct in the rule
each tuple in the corresponding generalized relation
n – number of tuples for target class for generalized relation
qi … qn – tuples for target class in generalized relation
qa is in qi … qn

grad(x) Λ male(x) ⇒ birth_region(x) = “Canadd[t:53%] ∨ birth_region(x) = “foreign[t:47%]

Association Rules
“An association algorithm creates rules that describe how often events have occurred together.”

Example: When a customer buys a hammer, then 90% of the time they will buy nails.
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set.
• First proposed by Agrawal, Imielinski, and Swami in the context of frequent itemsets and
association rule mining
• Motivation: Finding inherent regularities in data
o What products were often purchased together?— Beer and diapers?!
o What are the subsequent purchases after buying a PC?
o What kinds of DNA are sensitive to this new drug?
o Can we automatically classify web documents?
• Applications: Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Support: “is a measure of what fraction of the population satisfies both the antecedent and the
consequent of the rule”.
• Example:
o People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High
Support
o People who buy hotdog buns buy hangers in 0.005% of cases. = Low support
• Situations where there is high support for the antecedent are worth careful attention
o E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if
there is also high confidence.

Confidence: “is a measure of how often the consequent is true when the antecedent is true.”
• Example:
o 90% of Hotdog bun purchases are accompanied by hotdog sausages.
o High confidence is meaningful as we can derive rules.
• Hotdog sausage, Hotdog bun
• 2 rules may have different confidence levels and have the same support.
• E.g. Hotdog bun may have a much lower confidence than Hotdog sausage, yet they both
can have the same support, Hotdog bun.

Apriori Algorithm
It is a frequent pattern mining algorithm, and findsthe frequent item sets by generating the
candidates.

• How to generate candidates?

Step 1: self-joining Lk
Step 2: pruning
• How to count supports of candidates?
- By counting how many times it hasoccured.

Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}

Example:
Frequent Pattern Growth Tree Algorithm
(Mining Frequent Patterns Without Candidate Generation )

It grows long patterns from short ones using local frequent items

• “abc” is a frequent pattern

• Get all transactions having “abc”: DB|abc
• “d” is a local frequent item in DB | abc abcd is a frequent pattern
Mining Multi-Level Associations
• A top_down, progressive deepening approach:
o First find high-level strong rules:
• milk -> bread [20%, 60%].
o Then find their lower-level “weaker” rules:
2% milk -> wheat bread [6%, 50%].
• Variations at mining multiple-level association rules.
o Level-crossed association rules:
o 2% milk -> Wonder wheat bread
o Association rules with multiple, alternative hierarchies:
o 2% milk -> Wonder bread
Multi-level Association: Uniform Support vs. Reduced Support
• Uniform Support: the same minimum support for all levels
o + One minimum support threshold. No need to examine itemsets containing any
item whose ancestors do not have minimum support.
o – Lower level items do not occur as frequently. If support threshold
too high ⇒ miss low level associations
too low ⇒ generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
o There are 4 search strategies:
Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item

Mining Quantitative Association Rules

• Determine the number of partitions for each quantitative attribute
• Map values/ranges to consecutive integer values such that the order is preserved
• Find the support of each value of the attributes, and combine when support is less than
MaxSup. Find frequent itemsets, whose support is larger than MinSup
• Use frequent set to generate association rules
• Pruning out uninteresting rules
Partial Completeness
• R : rules obtained before partition
• R’: rules obtained after partition
• Partial Completeness measures the maximum distance between a rule in R and its closest
generalization in R’
• is a generalization of itemset X: if
• The distance is defined by the ratio of support
K-Complete
• C : the set of frequent itemsets
• For any K ≥ 1, P is K-complete w.r.t C if:
1. P C
2. For any itemset X (or its subset) in C, there exists a generalization whose support
is no more than K times that of X (or its subset)
• The smaller K is, the less the information lost

Constraint based Association Mining

• Interactive, exploratory mining giga-bytes of data?
o Could it be real? — Making good use of constraints!
• What kinds of constraints can be used in mining?
o Knowledge type constraint: classification, association, etc.
o Data constraint: SQL-like queries
Find product pairs sold together in Vancouver in Dec.’98.
o Dimension/level constraints:
in relevance to region, price, brand, customer category.
o Rule constraints
small sales (price < $10) triggers big sales (sum > $200).
o Interestingness constraints:
strong rules (min_support ≥ 3%, min_confidence ≥ 60%).
• Pattern space pruning constraints
o Anti-monotonic: If constraint c is violated, its further mining can be terminated
o Monotonic: If c is satisfied, no need to check c again
o Succinct: c must be satisfied, so one can start with the data sets satisfying c
o Convertible: c is not monotonic nor anti-monotonic, but it can be converted into it
if items in the transaction can be properly ordered
• Data space pruning constraint
o Data succinct: Data space can be pruned at the initial pattern mining process
o Data anti-monotonic: If a transaction t does not satisfy c, t can be pruned from its
further mining

Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Introduction To MS Access 2019 v3.2
100% (1)
Introduction To MS Access 2019 v3.2
125 pages
Mining Multilevel Association Rules From Transactional Databases
No ratings yet
Mining Multilevel Association Rules From Transactional Databases
46 pages
Chapter 6
100% (1)
Chapter 6
51 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
Using Advanced Structured Query Language
No ratings yet
Using Advanced Structured Query Language
59 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Microsoft: Exam Questions AZ-900
No ratings yet
Microsoft: Exam Questions AZ-900
9 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Spark Streaming - Malay
100% (1)
Spark Streaming - Malay
1 page
Apex Interview Questions
No ratings yet
Apex Interview Questions
17 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
28 pages
Data Mining: Concepts and Techniques: January 14, 2014
No ratings yet
Data Mining: Concepts and Techniques: January 14, 2014
64 pages
Cheminformatics
No ratings yet
Cheminformatics
4 pages
Data Mining: Concepts and Techniques: November 21, 2013
No ratings yet
Data Mining: Concepts and Techniques: November 21, 2013
64 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
Top 9 Data Science Algorithms
No ratings yet
Top 9 Data Science Algorithms
152 pages
6th Sem-DBMS-Reference Up To Unit 5
No ratings yet
6th Sem-DBMS-Reference Up To Unit 5
81 pages
Data Analysis Using Apriori Algorithm & Neural Netwok: Ashutosh Padhi
No ratings yet
Data Analysis Using Apriori Algorithm & Neural Netwok: Ashutosh Padhi
27 pages
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
No ratings yet
Optimization Algorithms For Association Rule Mining (ARM) : K.Indira
118 pages
Data Mining: Association
No ratings yet
Data Mining: Association
41 pages
Association Rules
No ratings yet
Association Rules
64 pages
Assoc 1
No ratings yet
Assoc 1
26 pages
Apriori
No ratings yet
Apriori
27 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Apriori
No ratings yet
Apriori
27 pages
Visualizing Association Rules: Introduction To The R-Extension Package Arulesviz
No ratings yet
Visualizing Association Rules: Introduction To The R-Extension Package Arulesviz
24 pages
Visualizing Association Rules: Introduction To The R-Extension Package Arulesviz
No ratings yet
Visualizing Association Rules: Introduction To The R-Extension Package Arulesviz
24 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
42 pages
Data Mining Unit2
No ratings yet
Data Mining Unit2
9 pages
Association Rule Mining Using Apriori Al PDF
No ratings yet
Association Rule Mining Using Apriori Al PDF
11 pages
Shri Rishabh Vidyoday Mahavidyalay Akaltara (Banahil) (C.G.)
No ratings yet
Shri Rishabh Vidyoday Mahavidyalay Akaltara (Banahil) (C.G.)
1 page
A Test
No ratings yet
A Test
7 pages
Chapter IV Computer Security
No ratings yet
Chapter IV Computer Security
21 pages
Chapter 10 Association Rule
No ratings yet
Chapter 10 Association Rule
41 pages
Data Mining: Concepts and Techniques: April 30, 2012
No ratings yet
Data Mining: Concepts and Techniques: April 30, 2012
64 pages
Association Analysis: Unit-V
No ratings yet
Association Analysis: Unit-V
12 pages
Thabet Slimani - Efficiant Analysis of Pattern and Association Rule Mining Approaches
No ratings yet
Thabet Slimani - Efficiant Analysis of Pattern and Association Rule Mining Approaches
14 pages
1.2 Association Rule Mining: Abdulfetah Abdulahi A
No ratings yet
1.2 Association Rule Mining: Abdulfetah Abdulahi A
43 pages
Course Contents Course Name: T.Y.B.B.A. (Ca) Semester: Vi Subject Code: 19babbcu601 Subject Name: Advanced Java Course Objectives
No ratings yet
Course Contents Course Name: T.Y.B.B.A. (Ca) Semester: Vi Subject Code: 19babbcu601 Subject Name: Advanced Java Course Objectives
3 pages
6 Asso
No ratings yet
6 Asso
37 pages
Unit III: Concept Description: Characterization and Comparison
No ratings yet
Unit III: Concept Description: Characterization and Comparison
53 pages
Apriori
No ratings yet
Apriori
27 pages
Edgardo Ortega Martinez, - MARSSG
No ratings yet
Edgardo Ortega Martinez, - MARSSG
7 pages
Academic Year: 2020: Course Title: Data Structures and Algorithms Lab
No ratings yet
Academic Year: 2020: Course Title: Data Structures and Algorithms Lab
7 pages
Chapter 3
No ratings yet
Chapter 3
27 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
Association Rule Mining
No ratings yet
Association Rule Mining
61 pages
Unit-5 Finalized
No ratings yet
Unit-5 Finalized
15 pages
ADB Slides 5
No ratings yet
ADB Slides 5
52 pages
Microsoft Dynamics GP 2010 Service Pack 3: Downloads/servicepacks
No ratings yet
Microsoft Dynamics GP 2010 Service Pack 3: Downloads/servicepacks
25 pages
CH 4
No ratings yet
CH 4
58 pages
Unit 3 1
No ratings yet
Unit 3 1
34 pages
3final CH 5 Concept
No ratings yet
3final CH 5 Concept
101 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
Arules Viz
No ratings yet
Arules Viz
24 pages
Ultimate IGCSE Computer Science 0478 Revision Guide - Master The Exam With Expert Tips
No ratings yet
Ultimate IGCSE Computer Science 0478 Revision Guide - Master The Exam With Expert Tips
5 pages
CH-4 Mining Association Rules
No ratings yet
CH-4 Mining Association Rules
35 pages
Airline Reservation System
No ratings yet
Airline Reservation System
24 pages
ML Unit - Iii
No ratings yet
ML Unit - Iii
64 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
54 pages
Dmbi
No ratings yet
Dmbi
9 pages
CH 5
No ratings yet
CH 5
45 pages
Esri Map Book, Volume 31 - Environmental Systems Research Institute (Redlands, Calif) - Volume 31, Jul 15, 2016 - Esri Press - 1589484681 - Anna's Archive
No ratings yet
Esri Map Book, Volume 31 - Environmental Systems Research Institute (Redlands, Calif) - Volume 31, Jul 15, 2016 - Esri Press - 1589484681 - Anna's Archive
196 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
22 pages
Belajar Konsep MVC
No ratings yet
Belajar Konsep MVC
1 page
DWDM Unit 2 and 3
No ratings yet
DWDM Unit 2 and 3
31 pages
تقرير حاسبات دورة حياة تطوير البرمجيات
No ratings yet
تقرير حاسبات دورة حياة تطوير البرمجيات
5 pages
Association Rule Mining
No ratings yet
Association Rule Mining
72 pages
DWM Unit 4
No ratings yet
DWM Unit 4
11 pages
Practical 4 NoSQL PARTH
No ratings yet
Practical 4 NoSQL PARTH
4 pages
DBMS Week 1 Lecture Material
No ratings yet
DBMS Week 1 Lecture Material
115 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
12 pages
8 Chapter Eight
No ratings yet
8 Chapter Eight
20 pages
Assignment 15 IEE Format
No ratings yet
Assignment 15 IEE Format
4 pages
Unit3mining Association Rules
No ratings yet
Unit3mining Association Rules
21 pages
Unit-II Association Rules
No ratings yet
Unit-II Association Rules
16 pages
Survey Report
No ratings yet
Survey Report
48 pages
Unit - 3 Mining Frequent Patterns
No ratings yet
Unit - 3 Mining Frequent Patterns
10 pages
ML Lecture 2
No ratings yet
ML Lecture 2
19 pages
SAC01 - Questions With Answers
No ratings yet
SAC01 - Questions With Answers
20 pages
Data Mining Unit3
No ratings yet
Data Mining Unit3
19 pages
Lecture 2.1.1 2.1.2
No ratings yet
Lecture 2.1.1 2.1.2
23 pages
Lecture 2.3.5 2.3.6
No ratings yet
Lecture 2.3.5 2.3.6
19 pages
Lecture 2.3.1 2.3.2
No ratings yet
Lecture 2.3.1 2.3.2
23 pages
Session 8-Association Rules Mining
No ratings yet
Session 8-Association Rules Mining
75 pages
MDGD - MultiDrive User Manual 1.03.20160715
No ratings yet
MDGD - MultiDrive User Manual 1.03.20160715
128 pages
1association Analysis-Apriori
No ratings yet
1association Analysis-Apriori
67 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Unit 2: Scs5623 - Data Mining and Warehousing

Uploaded by

Unit 2: Scs5623 - Data Mining and Warehousing

Uploaded by

SCS5623 - DATA MINING AND WAREHOUSING

CONCEPT DESCRIPTION AND ASSOCIATION RULES

Attribute Oriented Induction

Example: Describe general characteristics of graduate students in the University

Step 1. Fetch relevant set of data using an SQL statement, e.g.,

Select * (i.e., name, gender, major, birth_place, birth_date, residence,

Step 2. Perform attribute-oriented induction

Step 3. Present results in generalized relation, cross-tab, or rule forms

Basic Algorithm for Attribute-Oriented Induction

Class Characterization: An Example

Presentation of Generalized Results

grad(x) Λ male(x) ⇒ birth_region(x) = “Canadd[t:53%] ∨ birth_region(x) = “foreign[t:47%]

• How to generate candidates?

• “abc” is a frequent pattern

Mining Quantitative Association Rules

Constraint based Association Mining

You might also like