Full
Full
Unit-1
1
Outline
⚫ Types of Data
⚫ Data Quality
⚫ Data Preprocessing
2
What is Data?
Objects
variable, field, characteristic,
dimension, or feature
⚫ A collection of attributes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
Attribute Values
6
Properties of Attribute Values
7
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Qualitative
Categorica
female} test
⚫ Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
⚫ Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
⚫ For e.g. Consider a data set where each object is a student and each attribute
records whether or not a student took a particular course at a university. For a specific
student, an attribute has a value of 1 if the student took the course associated with that
attribute and a value of 0 otherwise. Because students take only a small fraction of all
available courses, most of the values in such a data set would be 0. Therefore, it is
more meaningful and more efficient to focus on the non-zero values.
Critiques of the attribute categorization
⚫ Incomplete
– Asymmetric binary: Binary attribute where only non-zero value is
important.
– Cyclical : A cyclic attribute has values that repeat in a period of
time. Ex. hour, week, year.
– Multivariate : multivalued attribute
– Resolution
◆ Patterns depend on the scale
– Size
◆ Type of analysis may depend on size of data
Types of data sets
⚫ Record
– Data Matrix
– Document Data
– Transaction Data
⚫ Graph
– World Wide Web
– Molecular Structures
⚫ Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
game
score
timeout
season
play
team
win
lost
ball
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
2
5 1
2
5
⚫ Sequences of transactions
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
⚫ Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
23
Data Quality
– Attributes may not be applicable to all cases (e.g., annual income is not applicable to
children)
⚫ Handling missing values
– Eliminate data objects or tuple
– Fill the missing value manually S.No Actual Value Mean Median Mode
– Estimate missing values 1 67 67 67 67
2 51 58 67
◆ By attribute mean/ median 3 67 67 67 67
◆ Assign a global constant such as - 4 56 56 56 56
5 58 58 58 58
– Ignore the missing value during analysis 6 48 48 48 48
7 89 89 89 89
8 51 58 67
9 74 74 74 74
Mean = (67+67+56+58+48+89+74)/9=51
Median = 48, 56,58,67,74,89 = 58
Mode (most frequent occur value
Duplicate Data
32
Similarity and Dissimilarity Measures
⚫ Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
⚫ Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
⚫ Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
⚫ Euclidean Distance
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Euclidean Distance
2 p1
p3 p4
1
p2
0
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
⚫ r = 2. Euclidean distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Common Properties of a Distance
A 1 2 4 2 5 X
4
B 100 300 200 600 100
500
40
C 10 15 20 10 30
Mahalanobis Distance
mean
Calculate Mean 2.8
260
17
x-mean
1.2
Find Difference (x-m) and transpose (x-m)’ (x-mean)’ = 1.2 240 23
240
23
2.7 -110 13
Calculate Covariance matrix -110 43000 -900
13 -900 70
MD = (106.7)1/2= 10.33
Similarity Between Binary Vectors
⚫ Common situation is that objects, x and y, have only
binary attributes
x= 1000000000
y= 0000001001
⚫ Example:
⚫ x=3205000200
⚫ y=1000000102
Cosine Similarity
⚫ x= 3205000200
⚫ y= 1000000102
x. y = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||x || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| y || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(x, y ) = 0.3150
Extended Jaccard Coefficient (Tanimoto coefficient)
X= (1,0,1,0,1)
Y=(1,1,1,0,1)
X.Y=1*1+0*1+1*1+0*0+1*1=3
||x||2 = ((1*1+0*0+1*1+0*0+1*1)1/2 )2 =3
||y||2 = ((1*1+1*1+1*1+0*0+1*1)1/2 )2 =4
EJ(x,y)=3/ (3+4-3)=3/4=0.75
Correlation measures the linear relationship
between objects
X= (-3,6,0,3,-6) Y= (1,-2,0,-1,2)
Mean of x= 0 Mean of y= 0 n=5
Sy = [ (1+4+1+4)/4]½ =1.5
Corr(x,y)= -7.25/(4.716+1.5) = -1
⚫ For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
𝑛
𝐻 𝑋 = − ⅀ 𝑝𝑖 log 2 𝑝𝑖
𝑖=1
⚫ log2(x)=ln(x)/ln(2)
Entropy Examples
Mutual Information
⚫ Information one variable provides about another
Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
Using Weights to Combine Similarities
⚫ Aggregation
⚫ Sampling
⚫ Discretization and Binarization
⚫ Attribute Transformation
⚫ Dimensionality Reduction
⚫ Feature subset selection
⚫ Feature creation
Aggregation
62
Sampling: With or without Replacement
Raw Data
63
Sampling: Cluster or Stratified Sampling
64
Sample Size
Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction (technique to replace the original data volume by
alternative smaller form of data representation)
Types : Parameteric e.g Regression and Log-Linear Models
Non- Parameteric e.g Histograms, clustering, sampling, Data cube
aggregation
Data compression : transformation are applied so as to obtain reduced or
compressed data.
Type : lossy, lossless
66
Curse of Dimensionality
⚫ When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
⚫ Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
What Is Wavelet Transform?
The wavelet transforms the data can be truncated
and this is helpful in data reduction. If we store a
small fraction of the strongest wavelet
coefficients, then the compressed approximation
of the original data can be obtained. For example,
the wavelet coefficients larger than some
determined threshold can be retained.
69
Wavelet Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT) for linear signal processing,
multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
70
Wavelet Decomposition
71
Wavelet Decomposition & regeneration of Signal
Dimensionality Reduction: PCA
https://www.youtube.com/watch?v=ZtS6sQUAh0c
Feature Subset Selection
90
Discretization in Supervised Settings
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
94
Normalization
Data Mining Classification
Unit-2
2
Type of classifiers
⚫ Linear Vs Non-Linear
A linear classifier uses a linear separating hyperplane to discriminate
instances from different classes whereas a non-linear classifier enables
the construction of more complex, non-linear decision surface.
⚫ Global Vs Local
A global classifiers fits a single model to the entire data set. In contrast,
a local classifier partitions the input space into smaller regions and fit a
distinct model to training instances in each region.
⚫ Generative Vs Discriminative
Classifiers that learn a generative model of every class in the process of
predicting class labels are known as generative classifiers. In contrast,
discriminative classifiers directly predict the class labels without
explicitly describing the distribution of every class label.
3
Rule-Based Classifier
4
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
of a rule
(Status=Single) → No Coverage = (4/10)=0.4= 40%
Accuracy = 2/4=.5=50%
How does Rule-based Classifier Work?
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
⚫ Exhaustive rules
– Classifier has exhaustive coverage if it accounts for every
possible combination of attribute values
– Each record is covered by at least one rule
Characteristics of Rule Sets: Strategy 2
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Rule Ordering Schemes
⚫ Rule-based ordering
– Individual rules are ranked based on their quality
⚫ Class-based ordering
– Rules that belong to the same class appear together
Building Classification Rules
⚫ Direct Method:
◆ Extract rules directly from data
◆ Examples: RIPPER, CN2, Holte’s 1R
⚫ Indirect Method:
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules
Direct Method: Sequential Covering
R1 R1
R2
Yes: 3
{} No: 4
Refund=No, Refund=No,
Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3 (Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1
– 𝑝1 𝑝0
𝐺𝑎𝑖𝑛 𝑅0 , 𝑅1 = 𝑝1 × [ 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 ]
𝑝1 + 𝑛1 𝑝0 + 𝑛 0
⚫ Growing a rule:
– Start from empty rule
– Add conjuncts as long as they improve FOIL’s
information gain
– Stop when rule no longer covers negative examples
– Prune the rule immediately using incremental reduced
error pruning
– Measure for pruning: v = (p-n)/(p+n)
◆ p: number of positive examples covered by the rule in
the validation set
◆ n: number of negative examples covered by the
rule in the validation set
– Pruning method: delete any final sequence of
conditions that maximizes v
Build a Rule Set
P
No Yes
Q R Rule Set
- +
Example
Name Give Birth Lay Eggs Can Fly Live in Have Legs Class
Water
human yes no no no yes mammals
python no yes no no no reptiles
salmon no yes no yes no fishes
whale yes no no yes no mammals
frog no yes no sometimes yes amphibians
komodo no yes no no yes reptiles
bat yes no yes no yes mammals
pigeon no yes yes no yes birds
cat yes no no no yes mammals
leopard shark yes no no yes no fishes
turtle no yes no sometimes yes reptiles
penguin no yes no sometimes yes birds
porcupine yes no no no yes mammals
eel no yes no yes no fishes
salamander no yes no sometimes yes amphibians
gila monster no yes no no yes reptiles
platypus no yes no no yes mammals
owl no yes yes no yes birds
dolphin yes no no yes no mammals
eagle no yes yes no yes birds
Advantages of Rule-Based Classifiers
25
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class P N
P True Positives (TP) False Negatives (FN)
N False Positives (FP) True Negatives (TN)
recognition rate
◼ Specificity = TN/N
27
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
28
Classifier Evaluation Metrics: Example
29
Confusion matrix for Multiclass
Confusion matrix for Multiclass
Evaluating Classifier Accuracy: Holdout
Holdout method
32
Evaluating Classifier Accuracy: Cross Validation
34
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
35
Estimating Confidence Intervals: Null Hypothesis
36
Estimating Confidence Intervals: t-test
37
Estimating Confidence Intervals: Table for t -distribution
Symmetric
Significance level, e.g.,
sig = 0.05 or 5% means
M1 & M2 are
significantly different
for 95% of population
Confidence limit, z =
sig/2
38
Estimating Confidence Intervals: Statistical Significance
39
Numerical for t-test
Post
S.No Pretest test Difference Difference2
1 23 35 -12 144 t=-115/SQRT((14*1513-(-115)*(-115))/13)
2 25 40 -15 225 t stat =-4.64831
3 28 30 -2 4
4 30 35 -5 25 ∝=0.05
5 25 40 -15 225
6 25 45 -20 400
7 26 30 -4 16 t case = -2.16
8 25 30 -5 25 tstat<tcase
9 22 35 -13 169
Therefore, it lies in rejection region
10 30 40 -10 100
11 35 40 -5 25 ignore the hypothesis
12 40 35 5 25
13 35 38 -3 9
14 30 41 -11 121
-115 1513
Model Selection: ROC Curves
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
42
Neural Network
Neural Networks are computational models that mimic the complex functions of
the human brain. The neural networks consist of interconnected nodes or neurons
that process and learn from data, enabling tasks such as pattern recognition and
decision making in machine learning.
Input Layer: This layer accepts input features. It provides information from the
outside world to the network, no computation is performed at this layer, nodes
here just pass on the information(features) to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world; they are
part of the abstraction provided by any neural network. The hidden layer performs
all sorts of computation on the features entered through the input layer and
transfers the result to the output layer.
Output Layer: This layer bring up the information learned by the network to the
outer world.
Activation Function
❑ The activation function does the non-linear transformation to the input making it
capable to learn and perform more complex tasks.
Variants of Activation Function
Sigmoid:
• Value Range : 0 to 1
•Input Features: The perceptron takes multiple input features; each input
feature represents a characteristic or attribute of the input data.
•Bias: A bias term is often included in the perceptron model. The bias allows
the model to make adjustments that are independent of the input. It is an
additional parameter that is learned during training.
Δ𝜔𝑖𝑗 = 𝜂𝛿𝑗 𝑂𝑖
Multilayer Perceptron Example
Δ𝜔𝑖𝑗 = 𝜂𝛿𝑗 𝑂𝑖
Multilayer Perceptron Example
Semi-supervised Learning
Active learning is an iterative type of supervised learning that is suitable for situations
where data are abundant, yet the class labels are scarce or expensive to obtain. The
learning algorithm is active in that it can purposefully query a user (e.g., a human
oracle) for labels. The number of tuples used to learn a concept this way is often much
smaller than the number required in typical supervised learning.
Ensemble Learning/ Classification combination method
bias, measures the average distance between the target position and the location where
the projectile hits the floor
variance, measures the deviation between x and the average position x where the
projectile hits the floor.
noise component associated with variability in the target position.
Bias –Variance tradeoff –Machine learning
The bias is known as the difference between the prediction of the values by
the Machine Learning model and the correct value. Being high in biasing gives a
large error in training as well as testing data. It recommended that an algorithm
should always be low-biased to avoid the problem of underfitting. By high bias, the
data predicted is in a straight-line format, thus not fitting accurately in the data in the
data set. Such fitting is known as the Underfitting of Data.
Voting/ Averaging
approach.
Boosting
Boosting is an ensemble modeling technique
that attempts to build a strong classifier
from the number of weak classifiers. It is
done by building a model by using weak
models in series. Firstly, a model is built
from the training data. Then the second
model is built which tries to correct the
errors present in the first model. This
procedure is continued, and models are
added until either the complete training data
set is predicted correctly, or the maximum
number of models is added.
Decision Tree
Decision Tree
Random Forest
Gradient Boosting
• Constructs a series of models
Models can be any predictive model
that has a differentiable loss function
Commonly, trees are the chosen model
• Boosting can be viewed as optimizing the
loss function by iterative functional
gradient descent.
• The predictions of the new model are
then added to the ensemble, and the
process is repeated until a stopping
criterion is met.
• Cross Entropy is used as loss function
XGB (Extreme Gradient Boosting)
At a basic level, the algorithm still follows a sequential strategy to improve the
next model based on gradient descent.
Difference between XGB and GBM
• Regularization is a technique in machine learning to avoid overfitting
• GBM tends to have a slower training time than the XGBoost because the latter
algorithm implements parallelization during the training process.
• XGBoost has its own in-built missing data handler, whereas GBM
doesn’t.
Stacked Generalization(Blending)
Class Imbalance Problem
If the number of positive and negative values are approximately equal known
As balanced dataset.
In many data sets there are a disproportionate number of instances
that belongs to different classes, a property known as
skew or class imbalance.
Example: Rare disease, Card fraud detection
• A correct classification of the rare class often has greater value than a correct classification
of the majority class.
Challenges
1. It can be difficult to find sufficiently many labelled samples of a rare class. A classifier
trained over an imbalanced data set shows a bias towards improving its performance over
the majority class, which is often not the desired behaviour.
2. Accuracy, is not well-suited for evaluating models in the presence of class imbalance in the
test data. Need to use alternative evaluation metrics that are sensitive to the skew and can
capture different criteria of performance than accuracy.
Evaluating Performance with Class Imbalance
Evaluating Performance with Class Imbalance
Multi-class Problem
Approaches for extending the binary classifiers to handle multiclass problems
Multiclass problem
One Vs Rest
One Vs Rest
One Vs One
One Vs One
Unit- 3
1
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis
2
Basic Concepts: Frequent Patterns
diaper}
◼ i.e., every transaction having {beer, diaper, nuts} also
5
Apriori: A Candidate Generation & Test Approach
6
Implementation of Apriori
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
8
Candidate Generation: An SQL Implementation
◼ SQL Implementation of candidate generation
◼ Suppose the items in Lk-1 are listed in an order
◼ Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
◼ Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
9
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
11
12
Rule Generation
13
14
How to Count Supports of Candidates?
15
Brute –force approach
◼ Scan the database of transactions to determine the
support of each candidate itemset
◼ Must match every candidate itemset against every transaction,
which is an expensive operation
16
Support Counting using Enumeration
Lexicographic
ordering
17
Support Counting Using a Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Support Counting Using a Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Improvement of the Apriori Method
21
Hash based Techniques
22
Hash based Techniques
23
Transaction Reduction
24
Partitioning
25
Dynamic Itemset Counting
26
Dynamic Itemset Counting
27
Dynamic Itemset Counting
28
Dynamic Itemset Counting
29
Sampling
30
Closed itemset and Frequent itemset
31
Closed itemset and Frequent itemset
32
Closed itemset and Frequent itemset
33
Closed Patterns and Max-Patterns
34
Closed Patterns and Max-Patterns
35
Closed Patterns and Max-Patterns
36
Closed Patterns and Max-Patterns
37
FP-Growth Algorithm
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
the database.
38
The Frequent Pattern Growth Mining Method
database partition
◼ Method
◼ For each frequent item, construct its conditional
FP-tree
◼ Until the resulting FP-tree is empty, or it contains only
39
FP-Growth Algorithm
40
FP-Growth Algorithm
Step 4: Construct Trie Data Structure.
a) Inserting the set {K, E, M, O, Y} b) Inserting the set {K, E, O, Y} c) Inserting the set {K, E, M}
d) Inserting the set {K, M, Y} e) Inserting the set {K, E, O}
a b c
e
41
FP-Growth Algorithm
Step 5: Compute Conditional Pattern Base. It is path labels of all the paths
which lead to any node of the given item in the frequent-pattern tree.
Step6: Compute Conditional Frequent Pattern Tree. It is done by taking the set of
elements that is common in all the paths in the Conditional Pattern Base of that item and
calculating its support count by summing the support counts of all the paths in the
Conditional Pattern Base.
42
Advantages of the Pattern Growth Approach
◼ Divide-and-conquer:
◼ Decompose both the mining task and DB according to the
frequent patterns obtained so far
◼ Lead to focused search of smaller databases
◼ Other factors
◼ No candidate generation, no candidate test
◼ Compressed database: FP-tree structure
◼ No repeated scan of entire database
◼ Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
43
ECLAT: Mining by Exploring Vertical Data
Format
◼ The ECLAT algorithm stands for Equivalence Class
Clustering and bottom-up Lattice Traversal.
◼ Vertical format: t(AB) = {T11, T25, …}
◼ tid-list: list of trans.-ids containing an itemset
◼ Deriving frequent patterns based on vertical intersections
◼ t(X) = t(Y): X and Y always happen together
◼ t(X) t(Y): transaction having X always has Y
◼ Using diffset to accelerate mining
◼ Only keep track of differences of tids
◼ t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
◼ Diffset (XY, X) = {T2}
44
ECLAT Algorithm
Consider the following transactions record:-
The given data is a boolean matrix
where for each cell (i, j), the value
denotes whether the j’th item is
included in the i’th transaction or not. 1
means true while 0 means false.
minimum support = 2
K=1
45
Eclat algorithm
Which pattern are interesting: pattern evaluation method
Shifting through the patterns to identify the most interesting ones is not a
trivial task because “one person’s trash might be another person’s treasure.”
It is therefore important to establish a set of well-accepted criteria for
evaluating the quality of association patterns.
The first set of criteria can be established through statistical arguments. Patterns
that involve a set of mutually independent items or cover very few transactions are
considered uninteresting because they may capture spurious relationships in the data.
Such patterns can be eliminated by applying Evaluation of Association Patterns
objective interestingness measure that uses statistics derived data to determine
whether a pattern is interesting. Examples of objective interestingness measures
include support, confidence, and correlation.
48
Lift
◼ If some rule had a lift of 1, it would imply that the probability of
occurrence of the antecedent and that of the consequent are
independent of each other. When two events are independent of each
other, no rule can be drawn involving those two events.
◼ If the lift is > 1, like it is here for Rules 1 and 2, that lets us know the
degree to which those two occurrences are dependent on one another,
and makes those rules potentially useful for predicting the consequent
in future data sets.
◼ If the lift is <1, then the occurrence of A is negatively correlated with
the occurrence of B, meaning that the occurrence of one likely leads to
absence of the other one.
49
2
50
Null transaction
M- milk, c-coffee
A null-transaction is a transaction that does not contain any of the
itemset being examined.
51
Are lift and 2 Good Measures of Correlation?
◼ Over 20
interestingness
measures have
been proposed
Which are good
ones?
52
Which Null-Invariant Measure Is Better?
◼ IR (Imbalance Ratio): measure the imbalance of two
itemsets A and B in rule implications
support
◼ Potential solution: Aggregate the low-support
attribute values
Handling Categorical Attributes
◼ Distribution of attribute values can be highly skewed
◼ Example: 85% of survey participants own a computer at home
◼ Most records have Computer at home = Yes
◼ Computation becomes expensive; many frequent itemsets involving
the binary item (Computer at home = Yes)
◼ Potential solution:
◼ discard the highly frequent items
◼ Computational Complexity
◼ Binarizing the data increases the number of items
◼ But the width of the “transactions” remain the same as the
number of original (non-binarized) attributes
◼ Produce more frequent itemsets but maximum size of frequent
itemset is limited to the number of original attributes
Handling Continuous Attributes
◼ Different methods:
◼ Discretization-based
◼ Statistics-based
◼ Non-discretization based
◼ minApriori
→ Full_recovery
Discretization-based Methods
Discretization-based Methods
◼ Equal-width binning
<1 2 > <3 4 5 6 7 > < 8 9>
◼ Equal-depth binning
◼ Cluster-based
◼ Supervised discretization
Continuous attribute, v
1 2 3 4 5 6 7 8 9
Chat Online = Yes 0 0 20 10 20 0 0 0 0
Chat Online = No 150 100 0 0 0 100 100 150 100
◼ Execution time
◼ If the range is partitioned into k intervals, there are O(k2) new items
◼ If an interval [a,b) is frequent, then all intervals that subsume [a,b)
must also be frequent
◼ E.g.: if {Age [21,25), Chat Online=Yes} is frequent,
then {Age [10,50), Chat Online=Yes} is also frequent
◼ Improve efficiency:
◼ Use maximum support to avoid intervals that are too wide
Statistics-based Methods
◼ Example:
{Income > 100K, Online Banking=Yes} → Age: =34
◼ Rule consequent consists of a continuous variable,
characterized by their statistics
◼ mean, median, standard deviation, etc.
◼ Approach:
◼ Withhold the target attribute from the rest of the data
◼ Extract frequent itemsets from the rest of the attributes
◼ Binarize the continuous attributes (except for the target attribute)
◼ For each frequent itemset, compute the corresponding descriptive
statistics of the target attribute
◼ Frequent itemset becomes a rule by introducing the target variable
as rule consequent
◼ Apply statistical test to determine interestingness of the rule
Statistics-based Methods
TID W1 W2 W3 W4 W5
D1 2 2 0 0 1
D2 0 0 1 2 2
D3 2 3 0 0 0
D4 0 0 1 0 1
D5 1 1 1 0 2
Example:
W1 and W2 tends to appear together in the
same document
Min-Apriori
◼ Data contains only continuous attributes of the same
“type”
◼ e.g., frequency of words in a document
TID W1 W2 W3 W4 W5
D1 2 2 0 0 1
D2 0 0 1 2 2
D3 2 3 0 0 0
D4 0 0 1 0 1
D5 1 1 1 0 2
◼ Potential solution:
◼ Convert into 0/1 matrix and then apply existing algorithms
◼ lose word frequency information
◼ Discretization does not apply as users want association among words
based on how frequently they co-occur, not if they occur with similar
frequencies
Min-Apriori
TID W1 W2 W3 W4 W5 Example:
D1 0.40 0.33 0.00 0.00 0.17
Sup(W1,W2)
D2 0.00 0.00 0.33 1.00 0.33
D3 0.40 0.50 0.00 0.00 0.00 = .33 + 0 + .4 + 0 + 0.17
D4 0.00 0.00 0.33 0.00 0.17 = 0.9
D5 0.20 0.17 0.33 0.00 0.33
Anti-monotone property of Support
TID W1 W2 W3 W4 W5
D1 0.40 0.33 0.00 0.00 0.17
D2 0.00 0.00 0.33 1.00 0.33
D3 0.40 0.50 0.00 0.00 0.00
D4 0.00 0.00 0.33 0.00 0.17
D5 0.20 0.17 0.33 0.00 0.33
Example:
Sup(W1) = 0.4 + 0 + 0.4 + 0 + 0.2 = 1
Sup(W1, W2) = 0.33 + 0 + 0.4 + 0 + 0.17 = 0.9
Sup(W1, W2, W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17
Sequential Patterns Examples of Sequence
◼ Sequence of different transactions by a customer
at an online store:
< {Digital Camera,iPad} {memory card} {headphone,iPad cover} >
Element
Event
(Transaction)
E1 E1 E3 (Item)
E2 E2
E2 E3 E4
Sequence
Sequence Data
Timeline
10 15 20 25 30 35
Sequence Database:
Sequence ID Timestamp Events Object A:
Sequence A:
A 10 2, 3, 5 2 6 1
3 1
A 20 6, 1
5
A 23 1
B 11 4, 5, 6
B 17 2 Object B:
Sequence B:
B 21 7, 8, 1, 2 4 2 7 1
B 28 1, 6 5 8 6
C 14 1, 8, 7 6 1
2
Object C:
Sequence C:
1
7
8
Sequence Data vs. Market-basket Data
Sequence Database: Market- basket Data
◼ Given:
◼ a database of sequences
minsup
◼ Task:
◼ Find all subsequences with support ≥ minsup
Sequential Pattern Mining: Example
◼ Candidate 2-subsequences:
<{i1, i2}>, <{i1, i3}>, …,
<{i1} {i1}>, <{i1} {i2}>, …, <{in} {in}>
◼ Candidate 3-subsequences:
<{i1, i2 , i3}>, <{i1, i2 , i4}>, …,
<{i1, i2} {i1}>, <{i1, i2} {i2}>, …,
<{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …,
<{i1} {i1} {i1}>, <{i1} {i1} {i2}>, …
Extracting Sequential Patterns: Simple example
()
◼ Given 2 events: a, b
(a) (b)
◼ Candidate 2-subsequences:
<{a} {a}>, <{a} {b}>, <{b} {a}>, <{b}
{b}>, <{a, b}>.
◼ Candidate 3-subsequences:
<{a} {a} {a}>, <{a} {a} {b}>, <{a} {b}
{a}>, <{a} {b} {b}>,
<{b} {b} {b}>, <{b} {b} {a}>, <{b} {a}
{b}>, <{b} {a} {a}>
Generalized Sequential Pattern (GSP)
◼ Step 1:
◼ Make the first pass over the sequence database D to yield all the 1-
element frequent sequences
◼ Step 2:
Repeat until no new frequent sequences are found
◼ Candidate Generation:
◼ Merge pairs of frequent subsequences found in the (k-1)th pass to generate
candidate sequences that contain k items
◼ Candidate Pruning:
◼ Prune candidate k-sequences that contain infrequent (k-1)-subsequences
◼ Support Counting:
◼ Make a new pass over the sequence database D to find the support for these
candidate sequences
◼ Candidate Elimination:
◼ Eliminate candidate k-sequences whose actual support is less than minsup
Candidate Generation
◼ Base case (k=2):
◼ Merging two frequent 1-sequences <{i1}> and <{i2}> will produce the
following candidate 2-sequences: <{i1} {i1}>, <{i1} {i2}>, <{i2} {i2}>,
<{i2} {i1}> and <{i1, i2}>. (Note: <{i1}> can be merged with itself to
produce: <{i1} {i1}>)
Frequent
3-sequences
Frequent
3-sequences
xg = 2, ng = 0, ms= 4
< {1} {2} {3} {4} {5}> < {1} {4} > No
< {1} {2,3} {3,4} {4,5}> < {2} {3} {5} > Yes
< {1,2} {3} {2,3} {3,4} {2,4} {4,5}> < {1,2} {5} > No
Mining Sequential Patterns with Timing Constraints
◼ Approach 1:
◼ Mine sequential patterns without timing
constraints
◼ Postprocess the discovered patterns
◼ Approach 2:
◼ Modify GSP to directly prune candidates that
xg = 2, ng = 0, ws = 1, ms= 5
<… {a c} … >,
<… {a} … {c}…> ( where time({c}) –
time({a}) ≤ ws)
<…{c} … {a} …> (where time({a}) –
time({c}) ≤ ws)
will contribute to the support count of
candidate pattern
Spade algorithm
https://www.youtube.com/watch?v=ny7Cn1Ttncc&
ab_channel=GRIETCSEPROJECTS
102
Unit 4: Cluster detection
Prepared by:
Dr. Nivedita Palia
1
What is Cluster Analysis?
■ Cluster: A collection of data objects
■ similar (or related) to one another within the same group
2
Clustering for Data Understanding and
Applications
■ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
■ Information retrieval: document clustering
■ Land use: Identification of areas of similar land use in an earth
observation database
■ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
■ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
■ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
■ Climate: understanding earth climate, find patterns of atmospheric
and ocean
■ Economic Science: market resarch
3
Clustering as a Preprocessing Tool (Utility)
■ Summarization:
■ Preprocessing for regression, PCA, classification, and
association analysis
■ Compression:
■ Image processing: vector quantization
■ Finding K-nearest Neighbors
■ Localizing search to one or a small number of clusters
■ Outlier detection
■ Outliers are often viewed as those “far away” from any
cluster
4
Quality: What Is Good Clustering?
5
Measure the Quality of Clustering
■ Dissimilarity/Similarity metric
■ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
■ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
■ Weights should be associated with different variables
based on applications and data semantics
■ Quality of clustering:
■ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
■ It is hard to define “similar enough” or “good enough”
■ The answer is typically highly subjective
6
Considerations for Cluster Analysis
■ Partitioning criteria
■ Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
■ Separation of clusters
■ Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
■ Similarity measure
■ Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
■ Clustering space
■ Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
7
Requirements and Challenges
■ Scalability
■ Clustering all the data instead of only on samples
■ Ability to deal with different types of attributes
■ Numerical, binary, categorical, ordinal, linked, and mixture of
these
■ Constraint-based clustering
■ User may give inputs on constraints
■ Use domain knowledge to determine input parameters
■ Interpretability and usability
■ Others
■ Discovery of clusters with arbitrary shape
■ Ability to deal with noisy data
■ Incremental clustering and insensitivity to input order
■ High dimensionality
8
Major Clustering Approaches
■ Partitioning approach:
■ Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
■ Typical methods: k-means, k-medoids, CLARANS
■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or objects)
using some criterion
■ Agglomerative approach(bottom-up) or divisive approach(top-down)
■ Typical methods: Diana, Agnes, BIRCH, CAMELEON
■ Density-based approach:
■ Based on connectivity and density functions
■ Typical methods: DBSACN, OPTICS, DenClue
■ Grid-based approach:
■ based on a multiple-level granularity structure
■ Typical methods: STING, WaveCluster, CLIQUE
9
Partitioning Algorithms: Basic Concept
10
The K-Means Clustering Method
11
An Example of K-Means Clustering
K=2
13
K-means Numerical
•The new cluster center is computed by taking mean of all the points contained in that cluster.
14
K-means Numerical
15
Variations of the K-Means Method
■ Most of the variants of the k-means which differ in
■ Selection of the initial k means
■ Dissimilarity calculations
■ Strategies to calculate cluster means
■ Handling categorical data: k-modes
16
Hierarchical Clustering
■ Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
17
AGNES (Agglomerative Nesting)
■ Introduced in Kaufmann and Rousseeuw (1990)
■ Implemented in statistical packages, e.g., Splus
■ Use the single-link method and the dissimilarity matrix
■ Merge nodes that have the least dissimilarity
■ Go on in a non-descending fashion
■ Eventually all nodes belong to the same cluster
18
Dendrogram: Shows How Clusters are Merged
19
DIANA (Divisive Analysis)
20
Distance between Clusters X X
21
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
■ Centroid: the “middle” of a cluster
22
Extensions to Hierarchical Clustering
■ Major weakness of agglomerative clustering methods
24
Density-Based Clustering: Basic Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an
Eps-neighbourhood of that point
■ NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if
■ p belongs to NEps(q)
p MinPts = 5
■ core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
25
Density-Reachable and Density-Connected
■ Density-reachable:
■ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
■ Density-connected
■ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
26
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
■ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
■ Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
27
DBSCAN: The Algorithm
■ Arbitrary select a point p
■ Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
■ If p is a core point, a cluster is formed
■ If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
■ Continue the process until all of the points have been
processed
28
DBSCAN: Sensitive to Parameters
29
Assessing Clustering Tendency
■ Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
■ Test spatial randomness by statistic test: Hopkins Static
■ Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in the
data space
■ Sample n points, p1, …, pn, uniformly from D. For each pi, find its
nearest neighbor in D: xi = min{dist (pi, v)} where v in D
■ Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and
v ≠ qi
■ Calculate the Hopkins Statistic:
use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
■ For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
31
Measuring Clustering Quality
32
Measuring Clustering Quality: Extrinsic Methods
groups of objects
■ Need to have the background knowledge on the relationship
outliers, or
■ Model outliers and treat those not matching the model as normal
■ Challenges
■ Imbalanced classes, i.e., outliers are rare: Boost the outlier class
39
Outlier Detection III: Semi-Supervised Methods
■ Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
■ Semi-supervised outlier detection: Regarded as applications of
semi-supervised learning
■ If some labeled normal objects are available
■ Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
■ Those not fitting the model of normal objects are detected as outliers
■ If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
■ To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods
40
Outlier Detection (1): Statistical Methods
■ Statistical methods (also known as model-based methods) assume that
the normal data follow some statistical model (a stochastic model)
■ The data not following the model are outliers.
■ Example (right figure): First use Gaussian distribution to
model the normal data
■ For each object y in region R, estimate gD(y), the
probability of y fits the Gaussian distribution
■ If gD(y) is very low, y is unlikely generated by the
Gaussian model, thus an outlier
■ Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
■ There are rich alternatives to use various statistical models
■ E.g., parametric vs. non-parametric
41
Outlier Detection (2): Proximity-Based Methods
■ An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates from
the proximity of most of the other objects in the same data set
■ Example (right figure): Model the proximity of an
object using its 3 nearest neighbors
■ Objects in region R are substantially different
from other objects in the data set.
■ Thus the objects in R are outliers
■ The effectiveness of proximity-based methods highly relies on the
proximity measure.
■ In some applications, proximity or distance measures cannot be
obtained easily.
■ Often have a difficulty in finding a group of outliers which stay close to
each other
■ Two major types of proximity-based outlier detection
■ Distance-based vs. density-based
42
Outlier Detection (3): Clustering-Based Methods
■ Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
■ Example (right figure): two clusters
■ All points not in R form a large cluster
■ The two points in R form a tiny cluster,
thus are outliers
■ Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
■ Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets
43
Avoiding False Discoveries
■ Statistical Background
■ Significance Testing
■ Hypothesis Testing
■ Data scientists need to help ensure that results of data analysis are
not false discoveries, i.e., not meaningful or reproducible
Statistical Testing
■ Statistical approaches are used to help avoid many of
these problems
■ Statistics has well-developed procedures for
evaluating the results of data analysis
■ Significance testing
■ Hypothesis testing
■ Examples:
k P(R= k)
0 0.001
1 0.01
2 0.044
3 0.117
4 0.205
5 0.246
6 0.205
7 0.117
8 0.044
9 0.01
10 0.001
Probability and Distributions ..
■
Gaussian Distribution
■
Statistical Testing
■
Examples of Null Hypotheses
■ A coin or a die is a fair coin or die.
percent
■ Effect size measures the magnitude of the effect
or characteristic being evaluated, and is often the
magnitude of the test statistic.
■ Brings in domain considerations
31/2021 19
19
20
SOM Clusters of LA Times Document Data
21
22
Issues with SOM
23
24
Comparison of DBSCAN and K-means
87
88
Comparison of DBSCAN and K-means
89