BCA Semester VI Data Mining Module 3 (Presentation Kind of N
BCA Semester VI Data Mining Module 3 (Presentation Kind of N
consequent (then).
X⇒Y
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
Eg: “The average sales from Sony Digital Camera increase over
16% when sold together with Sony Laptop Computer”: both
Sony Digital Camera and Sony Laptop Computer are siblings,
where the parent itemset is Sony.
6. Based on the kinds of patterns to be mined:
sequential pattern/ structured pattern eg : customers
may tend to first buy a PC, followed by a digital camera,
and then a memory card.
Scalable Methods for Mining Frequent Patterns
Method:
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Minimize number of candidates
Facilitate support counting of candidates
Hashing: Reduce the Number of Candidates
set.
Dynamic Itemset Counting: Reduce Number of Scans
item p
Method
Completeness
Compactness
Food
milk bread
Freezer Chilled
Multilevel Associatons
Skim Milk
2% Milk
Level 2 [support = 4%]
[support = 6%]
Min_sup=5%
Multilevel Association: Uniform Support Vs Reduced
Support
Reduced Support: This means the minimum support will
be reduced minimum support at lower levels
Level 1 Milk
Min_sup=5% [support = 10%]
Skim Milk
2% Milk
Level 2 [support = 4%]
[support = 6%]
Min_sup=3%
Multilevel Association: Redundancy Filtering
Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
Multi-dimensional rules: ≥ 2 dimensions or predicates
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,
“coke”)
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
Categorical Attributes: finite number of possible values, no
ordering among values – ie, Occupation, Brand, Colour
Quantitative Attributes: numeric values, ie, age , salary , ht, wt
etc….
3.Mining Quantitative Associations
(age,income,buys)
Quantitative Association Rules
Numeric attributes are dynamically discretized
Such that the confidence or compactness of the rules mined
is maximized
2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat
Cluster adjacent association rules
to form general rules using
a 2-D grid
Example
age(X,”34-35”) ∧ income(X,”30-50K”)
⇒ buys(X,”high resolution TV”)
Association Rule Mining to Correlation Analysis
Learning step:
Classification
Algorithms
Training
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
April 13, 2021 Data Mining: Concepts and Techniques 6
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
Tenured?
Data cleaning
Data transformation
Accuracy :
age?
<=30
31..40
>40
no yes no yes
It is easy to comprehend
Similarly,
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as
the splitting attribute
April 13, 2021 Data Mining: Concepts and Techniques 23
Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is
defined as
Reduction in Impurity:
The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node (need to enumerate all the
possible splitting points for each attribute)
April 13, 2021 Data Mining: Concepts and Techniques 24
Comparing Attribute Selection Measures
28
Bayesian Theorem: Basics
Naive Bayes classifier assumes that all the features are unrelated
to each other. Presence or absence of a feature does not
influence the presence or absence of any other feature.
All the features are categorical variables with either of the 2 values:
T(True) or F( False).
April 13, 2021 Data Mining: Concepts and Techniques 33
April 13, 2021 Data Mining: Concepts and Techniques 34
Parrots have 50(10%) value for Swim, i.e., 10% parrot can swim
according to our data, 500 out of 500(100%) parrots have wings,
400 out of 500(80%) parrots are Green and 0(0%) parrots have
Dangerous Teeth.
Classes with Animal type Dogs shows that 450 out of 500(90%)
can swim, 0(0%) dogs have wings, 0(0%) dogs are of Green color
and 500 out of 500(100%) dogs have Dangerous Teeth.
Classes with Animal type Fishes shows that 500 out of 500(100%)
can swim, 0(0%) fishes have wings, 100(20%) fishes are of Green
color and 50 out of 500(10%) Fishes have Dangerous Teeth.
90
80
70
60
50
Instance-based learning:
Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean space.
Locally weighted regression
Constructs local approximation
Case-based reasoning
Uses symbolic representations and knowledge-based inference
dist(X1, X2)
predictor variable x
y = w0 + w1 x
one that minimizes the error between the actual data and estimate of
the line.
y = w0 + w1 x + w2 x2 + w3 x3
y = w0 + w1 x + w2 x2 + w3 x3
This can be solved using method of least squares