0% found this document useful (0 votes)
35 views129 pages

Mining

Uploaded by

yasmintamatimo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views129 pages

Mining

Uploaded by

yasmintamatimo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 129

Data Mining and Business Intelligence

Overview

Introduction Technologies

Applications

By
Dr. Nora Shoaip
Lecture1

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024 - 2025
Introduction
• Why Data Mining
• What is Data Mining
• Data Mining Applications
• Categories of Mining Techniques
Why Data Mining?
 The era of Explosive Growth of Data: in the petabytes!
 Automated data collection and availability: tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, transactions, stocks, …
 Science: Remote sensing, bioinformatics, …
 Society and everyone: news, digital cameras, social feeds
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible
 Linear growth of data management tasks with data volumes
 Massive data volumes, but still little insight!
 Solution! Data mining—The automated analysis of massive data sets

3
What is Data Mining?
Data mining (knowledge discovery from data)
o Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
o Data mining: a misnomer?
• Alternative names
o Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, information
harvesting, business intelligence, etc.
• Is everything “data mining”?
o Simple search and query processing
o (Deductive) expert systems

4
Knowledge Discovery Process
Selection:
• Finding data relevant to
the task
Processing:
• Cleaning and putting
data in format suitable
for mining
Transformation
• Performing summaries,
aggregations or
consolidation
Data Mining
• Applying the data
mining algorithms to
extract knowledge
Evaluation
• Locating useful
knowledge

5
What Kinds of Data Can Be Mined?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web

6
Data Mining Applications

 Understanding Customer Behavior


o Market basket analysis.
o Market segmentation.
o Targeted Advertisement.
o Market Forecasting.
o Recommender Systems

7
Data Mining Applications cont…

 Social Networks Mining


o Community detection.
o Friends recommendation.
o Trend Analysis.
o Event detection.
o Personality prediction.

8
Data Mining Applications cont…

 Web and Text Mining


o Web usage mining.
o Web structure mining.
o Search engines.
o Email categorization.
o Fact checking.

9
Categories of Mining Techniques

 Descriptive Data Mining.

 Predictive Data Mining.

10
Frequent Patterns Mining

 Descriptive data mining technique.


 Finds commonly occurring patterns in data.
What items are frequently purchased together in your supermarket basket?
 Applied on:
 Transactional data (Market Basket Analysis)
 Sequential Data
 Graph Data

11
Clustering

 Descriptive data mining.


 Unsupervised learning.
 Divide data into groups.
 Applications:
 Market segmentation
 community detection.

12
Classification
 Predictive data mining.
 Supervised learning.
Construct models (functions) based on some training examples
 Describe and distinguish classes or concepts for future prediction
 e.g., classify countries based on (climate),
 or classify cars based on (gas mileage)
 Predict some unknown class labels

Typical methods
 Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification, …

13
Know Your Data
• Data Objects & Attribute Types
• Basic Statistical Descriptions of
Data
Objects and Attributes

 A data object represents an entity


 Also sample, example, instance, data point, or object (in a DB : Data Tuple)
 e.g. customers, students, patients, books
 An attribute is a data field, representing a characteristic or feature of a data
object
 Also noun attribute, dimension, feature, and variable (DB and DM, DWs, ML,
Statistics)
 e.g. name, age, salary, gender, grade, …
 Attribute (feature) vector  A set of attributes that describe an object

15
Attribute Types:Nominal Attributes

 Symbol or names of things


 Each value represents category, code, or state
 also referred to as categorical
 e.g. hair color, marital status, customer ID
 Possible to be represented as numbers (coding)
 Qualitative

16
Attribute Types:Binary Attributes
 Nominal with only two values representing two states or
categories: 0 or 1 (absent or present)
 Also Boolean (true or false)
 Qualitative
 Symmetric: both states are equally valuable and have the same
weight
 e.g. gender
 Asymmetric: states are not equally important
 e.g. medical test outcomes

17
Attribute Types:Ordinal Attributes

 Qualitative
 Values have a meaningful order or ranking, but magnitude
between successive values is not known
 e.g. professional rank, grade, customer satisfaction
 Useful for data reduction of numerical attributes

18
Attribute Types:Numeric Attributes

 Quantitative
 Interval-scaled: measured on a scale of equal-size units
 e.g. temperature, year
 Do not have a true zero point
 Not possible to be expressed as multiples
 Ratio-scaled: have a true zero point
 A value can be expressed as a multiple of another
 e.g. years of experience, weight, salary

19
Discrete vs. Continuous Attributes

 Discrete Attribute: has a finite or countably infinite set of


values, integers or otherwise
 e.g. hair color, smoker, medical test, Customer_ID
 Customer_ID is countably infinite  infinite values but
one-to-one correspondence with natural numbers
 If an attribute is not discrete, it’s continuous
 e.g. height, weight, age

20
Outline
 Data Objects & Attribute Types
• What is an Object?
• What is an Attribute?
• Attribute Types
• Continuous vs. Discrete
 Basic Statistical Descriptions of Data
• Measuring central tendency
• Measuring Data dispersion
• Basic Graphic displays
 Measuring Data similarity & dissimilarity
• Data matrix & dissimilarity matrix
• Proximity Measures for( Nominal- Binary) attributes
• Dissimilarity of Numerical Data

21
Measuring Central Tendency

22
Measuring Central Tendency

 Median: middle value in set of ordered values


 N is odd  median is middle value of ordered set
 N is even  median is not unique  average of two middlemost
values
 Expensive to compute for large # of observations
 Mode: value that occurs most frequently in the attribute values
 Works for both qualitative and quantitative attributes
 Data can be unimodal, bimodal, or trimodal – no mode?

23
Measuring Central Tendency
Example
 Salary (in thousands of dollars), shown in increasing
order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
 Mean = ?
 Median = ?
 Mode = ?

24
Measuring Central
Tendency
Example
Salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110

• Mean = 58,000
• Median = 54,000
• Mode = 52,000 and 70,000 – bimodal

25
Measuring dispersion of Data

26
Measuring dispersion
of Data

27
Measuring dispersion
of Data
 Five-Number Summary:
 Median (Q2), quartiles Q1 and Q3, &
smallest and largest individual
observations – in order
 Boxplots: visualization technique for the five-
number summary
 Whiskers terminate at min & max OR the
most extreme observations within
1.5 × IQR of the quartiles – with
remainder points (outliers) plotted
individually
28
Ex:
Suppose that a hospital tested the age and body fat data for 18
randomly selected adults with the following results:
•Calculate the mean, median, and standard deviation of age
and %fat.
•Draw the boxplots for age and %fat.
•Calculate the correlation coefficient. Are these two attributes
positively or negatively correlated? Compute their covariance.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

29
Solution

Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

30
Solution %fat
7.8
9.5
Draw the boxplots for age and %fat. 17.8
For Age 25.9
26.5
 Q1=39, median= 51, Q3=57, min=23, max=61 27.2
 IQR= 57-39= 18, 1.5 IQR= 27 27.4
28.8
 newMin= 39-27= 12, newMax= 57+27= 84 30.2
31.2
For Fat 31.4
 Q1=26.5, median= 30.7, Q3=34.1, min=7.8, max=42.5 32.9
33.4
 IQR= 34.1-26.5= 7.6, 1.5 IQR= 11.4 34.1
34.6
 newMin= 26.5-11.4= 15.1, 35.7
 newMax= 34.1+11.4= 45.5 41.2
42.5

Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7 25
31
Visual Representations
of Data Distributions
 Histograms
 Scatter Plots: each pair of values is treated
as a pair of coordinates and plotted as
points in plane
 X and Y are correlated if one attribute
implies the other
 positive, negative, or null
(uncorrelated)
 For more attributes, we use a scatter
plot matrix
32
Visual Representations
of Data Distributions

Uncorrelated data

33
Data Mining and Business Intelligence

similarity

data quality
Know Your Data
Preprocessing

By
Dr. Nora Shoaip
Lecture2

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024 - 2025
Measuring Data similarity &
dissimilarity
• Data matrix & dissimilarity matrix
• Proximity Measures for( Nominal-
Binary) attributes
• Dissimilarity of Numerical Data
Measuring Data similarity & dissimilarity

25
Data matrix & dissimilarity matrix

21
Data matrix & dissimilarity matrix

21
Proximity Measures for Nominal
attributes

21
Proximity Measures for Binary
attributes

21
Proximity Measures for Binary
attributes- Example

21
Dissimilarity of Numerical Data

21
Dissimilarity of Numerical Data
MinKowski Distance

21
Why preprocess data?
Major tasks
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Overview

Databases are highly susceptible to noisy, missing, and


inconsistent data
Low-quality data will lead to low-quality mining results

“How can the data be preprocessed in order to help improve the


quality of the data and, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency
and ease of the mining process?”

12
Why Preprocess Data?

To satisfy the requirements of the intended use


 Factors of data quality:
◦Accuracy  lack of due to faulty instruments, errors caused by
human/computer/transmission, deliberate errors …
◦Completeness  lack of due to different design phases, optional attributes
◦Consistency  lack of due to semantics, data types, field formats …
◦Timeliness
◦Believability how much the data are trusted by users
◦Interpretability  how easy the data are understood

13
Major Preprocessing Tasks
That Improve Quality of Data

 Data cleaning  filling in missing values, smoothing noisy data, identifying or


removing outliers, and resolving inconsistencies
 Data integration  include data from multiple sources in your analysis, map
semantic concepts, infer attributes …
 Data reduction  obtain a reduced representation of the data set that is much
smaller in volume, while producing almost the same analytical results
 Discretization  raw data values for attributes are replaced by ranges or higher
conceptual levels
 Data transformation  normalization
14
Data Cleaning
 Data in the Real World Is Dirty!
◦incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
e.g., Occupation=“ ” (missing data)
◦noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
◦inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
◦Intentional  Jan. 1 as everyone’s birthday?
15
Data Cleaning

… fill in missing values, smooth out noise while identifying outliers,


and correct inconsistencies in the data

A missing value may not imply an error in the data!


◦e.g. driver’s license number

16
Data Cleaning
Missing Values

 Ignore the tuple  not very effective, unless the tuple contains
several attributes with missing values
 Fill in the missing value manually  time consuming, not
feasible for large data sets
 Use a global constant  replace all missing attribute values by
same value (e.g. unknown)
 may mistakenly think that “unknown” is an interesting concept

17
Data Cleaning
Missing Values

 Use mean or median  For normal (symmetric) data


distributions, the mean is used, while skewed data distribution
should employ the median
 Use mean or median for all samples belonging to the same
class as the given tuple  e.g. mean or median of customers in
a certain age group
 Use the most probable value  using regression, inference-
based tools such as Bayesian formula or decision tree
 Most popular

18
Data Cleaning
Noisy Data

Noise is a random error or variance in a measured


variable

Data smoothing techniques:


1. Binning
2. Regression
3. Outlier Analysis

19
Data Cleaning
Noisy Data

1. Binning  smooth a sorted data value by consulting its


“neighborhood”
◦sorted values are partitioned into a # of “buckets,” or bins  local
smoothing
◦equal-frequency bins  each bin has same # of values
◦equal-width bins  interval range of values per bin is constant
 Smoothing by bin means  each bin value is replaced by the bin mean
 Smoothing by bin medians  each bin value is replaced by the bin median
 Smoothing by bin boundaries  each bin value is replaced by the closest
boundary value (min & max in a bin are bin boundaries)
20
Data Cleaning
Partition into (equal-
Noisy Data frequency) bins
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Example: Sorted data for price (in dollars): Bin 3: 25, 28, 34
4, 8, 15, 21, 21, 24, 25, 28, 34 Smoothing by bin means

Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries

Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
21
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
22
Data Cleaning
Noisy Data

2. Regression  Conform data values to a function


◦Linear regression  find “best” line to fit two attributes so that one
attribute can be used to predict the other
3. Outlier Analysis
 Potter’s Wheel  Automated interactive data
cleaning tool

23
Data Mining and Business Intelligence

Integration

Reduction
Data Pre-processing
Transformation

By
Dr. Nora Shoaip
Lecture 3

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024 - 2025
Data Integration
• Entity Identification Problem
• Redundancy and correlation analysis
• Tuple duplication
Data Integration

Merging data from multiple data stores


Helps reduce and avoid redundancies and inconsistencies in the resulting data set
Challenges:
 Semantic heterogeneity  entity identification problem
 Structure of data  functional dependencies and referential constraints
 Redundancy

3
Data Integration
Entity Identification Problem

 Schema integration and object matching


 Metadata  name, meaning, data type, and range of values
permitted, null rules for handling blank, zero, or null values

 can help avoid errors in schema integration and data


transformation

4
Data Integration
Redundancy and Correlation Analysis

5
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500

6
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

7
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

8
Data Integration
Redundancy and Correlation Analysis

9
Data Integration
Redundancy and Correlation Analysis

10
Data Integration
Redundancy and Correlation Analysis

11
Data Integration
Redundancy and Correlation Analysis

Time AllElectronics HighTech


point

T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5

12
Data Integration
More Issues

Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes  A, B, … versus 90%,
80% …

13
Data Reduction
• Wavelet transforms
• PCA
• Attribute subset selection
• Regression
• Histograms
• Clustering
• Sampling
Data Reduction
Strategies

 Dimensionality reduction  reduce number of attributes


◦Wavelet transforms, PCA, Attribute subset selection
 Numerosity reduction  replace original data volume by smaller data representation
◦Parametric  a model is used to estimate the data - only the data parameters are
stored
Regression
◦Nonparametric  store reduced representations of the data
Histograms, clustering, sampling
 Compression  transformations applied to obtain a “compressed” representation of
original data
◦Lossless, Lossy

15
Data Reduction
Attribute Subset Selection

 find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
 An exhaustive search can be prohibitively expensive
 Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
 Attribute construction  e.g. area attribute based on height and width attributes
16
Data Reduction
Attribute Subset
Selection

17
Data Reduction- Numerosity reduction
Regression

 Data is modeled to fit a straight line


 A random variable y (response variable), can be modeled
as a linear function of another random variable x
(predictor variable)
Regression line equation  y = wx + b
 w and b are regression coefficients  they specify the
slope of the line and y-intercept
 Solved for by the method of least squares minimize error
between actual line separating data and estimate of the
line (best-fitting line)

18
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

19
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

20
Data Reduction
Histograms

 A histogram for an attribute, A, partitions the data distribution of A into disjoint


subsets, referred to as buckets or bins.

 a single attribute–value/frequency pair singleton buckets

 Often, buckets represent continuous ranges for the given attribute.

 Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).

 Equal-frequency (or equal-depth): roughly, the frequency of each bucket is constant


(i.e., each bucket contains roughly the same number of contiguous data samples).

21
Data Reduction
Histograms

The following data are a list of AllElectronics


prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14,
14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18,
18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.

22
Data Reduction
Sampling

 A large data set represented by a smaller random data sample


 Simple random sample without replacement (SRSWOR) of size s  draw s of the N
tuples (s < N)
◦all tuples are equally likely to be sampled
 Simple random sample with replacement (SRSWR) of size s  similar to SRSWOR,
but each time a tuple is drawn, it’s recorded then placed back so it may be drawn again
 Cluster sample  If tuples are grouped into M “clusters,” an SRS of s clusters can be
obtained
 Stratified sample  If tuples are divided into strata, a stratified sample is generated by
obtaining an SRS at each stratum
◦e.g. stratum is created for each customer age group
23
Data Reduction
Sampling

24
Transformation and Discretization
Transformation Strategies

 Smoothing  binning, regression

 Attribute construction

 Aggregation

 Normalization  raw values of a numeric attribute (e.g. age) replaced by interval

labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)

 Concept hierarchy  e.g. street generalized to higher-level concepts (city or country)

25
Transformation and Discretization
Transformation by Normalization

To help avoid dependence on the choice of measurement units


Give all attributes equal weight
Methods:
min-max normalization
z-score normalization

26
Transformation and Discretization
Transformation by Normalization

27
Transformation and Discretization
Transformation by Normalization

28
Transformation and Discretization
Concept Hierarchy

 Concept hierarchy organizes concepts (i.e., attribute values) hierarchically


 Concept hierarchies facilitate drilling and rolling to view data in multiple
granularity
 Concept hierarchy formation: Recursively reduce data by collecting and
replacing low level concepts (e.g. age values) by higher level concepts (e.g.
age groups: youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
 Concept hierarchy can be automatically formed for both numeric and nominal
data  discretization
29
Transformation and Discretization
Concept Hierarchy

For nominal data:


Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
street, city, province or state, country  street < city < province or state < country
Specification of a set of attributes, but not of their partial ordering  order
automatically generated by system
e.g. Location  country contains a smaller #distinct values than street
automatically generate concept hierarchy based on # distinct values per attribute in the
given attribute set
Not for all concepts! Time  year (20), month (12), day of week (7)

30
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
31
Summary

21
Data Mining and Business Intelligence

Apriori

Mining Frequent Patterns,


FP-Growth
Associations, & Correlations
Evaluation
By Methods
Dr. Nora Shoaip

Lecture 4

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024 - 2025
Outline
 The Basics
• Market Basket Analysis
• Frequent Item sets
• Association Rules

 Frequent Item set Mining Methods


• Apriori Algorithm
• Generating Association Rules from Frequent Item sets
• FP-Growth

 Pattern Evaluation Methods

2
The Basics: What Is Frequent Pattern Analysis?

• Frequent pattern: a pattern (a set of items, subsequences, substructures,


etc.) that occurs frequently in a data set

• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the


context of frequent itemsets and association rule mining

3
The Basics

21
The Basics

Motivation: Finding inherent regularities in data


What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis

5
The Basics

6
The Basics : Frequent Itemsets
Itemset X = {x1, …, xk} ex: X={A, B, C, D, E, F}
Find all the rules X  Y with minimum support and confidence
 support, s, probability that a transaction contains X  Y
 confidence, c, conditional probability that a transaction having X also
contains Y

7
The Basics : Frequent Itemsets
Itemset X = {x1, …, xk} ex: X={A, B, C, D, E, F}
Find all the rules X  Y with minimum support and confidence
 support, s, probability that a transaction contains X  Y
 confidence, c, conditional probability that a transaction having X also
contains Y

8
The Basics : Association Rules
Ex: Let supmin = 50%, confmin = 50% Transaction-id Items bought
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} 10 A, B, D
Association rules: 20 A, C, D
A  D (60%, 100%) 30 A, D, E
D  A (60%, 75%) 40 B, E, F
50 B, C, D, E, F

9
The Basics : Association Rules

 If frequency of itemset I satisfies min_support count then I is a frequent


itemset
 If a rule satisfies min_support and min_confidence thresholds, it is said
to be strong
 problem of mining association rules reduced to mining frequent itemsets
 Association rules mining becomes a two-step process:
 Find all frequent itemsets that occur at least as frequently as a
predetermined min_support count
 Generate strong association rules from the frequent itemsets that satisfy
min_support and min_confidence

10
Outline
 The Basics
• Market Basket Analysis
• Frequent Item sets
• Association Rules

 Frequent Item set Mining Methods


• Apriori Algorithm
• Generating Association Rules from Frequent Item sets
• FP-Growth

 Pattern Evaluation Methods

11
Mining Frequent Itemsets: Apriori
Goes as follows:
 Find frequent 1-itemsets  L1
 Use L1 to find frequent 2-itemsets  L2
 … until no more frequent k-itemsets can be found

Each Lk itemset requires a full dataset scan

To improve efficiency, use the Apriori property:


 ―All nonempty subsets of a frequent itemset must also be frequent‖ –
if a set cannot pass a test, all of its supersets will fail the same test as
well – if P(I) < min_support then P(I  A) < min_support

12
Mining Frequent Itemsets: Apriori

Transactional data example


N=9, min_supp count=2 Scan dataset for Compare
count of each candidate support
candidate with min_support
TID List of items
C1 L1
T100 I1, I2, I5 Itemset Support
Itemset Support count
T200 I2, I4 count
T300 I2, I3 {I1} 6 {I1} 6
T400 I1, I2, I4 {I2} 7 {I2} 7
T500 I1, I3 {I3} 6 {I3} 6
T600 I2, I3 {I4} 2 {I4} 2
T700 I1, I3 {I5} 2 {I5} 2
T800 I1, I2, I3, I5
T900 I1, I2, I3
13
Mining Frequent Itemsets: Apriori
Itemset Support
C2 Itemset C2 count
{I1, I2}
{I1, I2} 4 Itemset Support
{I1, I3} L2 count
Itemset Support
{I1, I3} 4
{I1, I4}
count {I1, I4} 1 {I1, I2} 4
{I1, I5}
{I1} 6 {I1, I5} 2 {I1, I3} 4
{I2, I3}
{I2} 7 {I2, I3} 4 {I1, I5} 2
{I2, I4}
{I3} 6 {I2, I4} 2 {I2, I3} 4
{I2, I5}
{I4} 2 {I2, I5} 2 {I2, I4} 2
{I3, I4}
{I5} 2 {I3, I4} 0 {I2, I5} 2
{I3, I5}
{I3, I5} 1
{I4, I5}
{I4, I5} 0
Compare candidate
Generate C2 candidates support with min_supp
Scan dataset for count
from L1 by joining L1  L1 of each candidate 14
Mining Frequent Itemsets: Apriori
C3 = L2  L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}
Not all subsets are frequent
Compare candidate
 Prune (Apriori property) Scan dataset for
count of each support with
Itemset Support candidate min_supp
count L3
{I1, I2} 4 C3
Itemset Support Itemset Support
{I1, I3} 4 Itemset count count
{I1, I5} 2 {I1, I2, I3} 2 {I1, I2, I3} 2
{I1, I2, I3}
{I2, I3} 4 {I1, I2, I5} 2 {I1, I2, I5} 2
{I1, I2, I5}
{I2, I4} 2
{I2, I5} 2
Two joining (lexicographically ordered) k-itemsets
must share first k-1 items 
Generate C3 candidates
{I1, I2} is not joined with {I2, I4}
from L2 by joining L2 L2
15
Mining Frequent Itemsets: Apriori

Itemset Support
count Itemset
Not all subsets are frequent
{I1, I2, I3} 2
{I1, I2, I3, I5}  Prune
{I1, I2, I5} 2

C4 =   Terminate

16
Mining Frequent Itemsets: Apriori

17
Apriori
Algorithm
Generate Ck using Lk-1 to find Lk

Join

Prune

18 11/3/2024
Mining Frequent Itemsets:
Generating Association Rules from Frequent Itemsets

19
Mining Frequent Itemsets:
Generating Association Rules from Frequent Itemsets
Nonempty subsets Association Rules Confidence
Itemset Support
count
{I1, I2} {I1, I2} I5 2/4 = 50%
{I1, I2, I3} 2
{I1, I2, I5} 2 {I1, I5} {I1, I5} I2 2/2 = 100%

{I2, I5} {I2, I5} I1 2/2 = 100%


{I1} I1 {I2, I5} 2/6 = 33%

{I2} I2 {I1, I5} 2/7 = 29%


{I5} I5 {I1, I2} 2/2 = 100%

For a min_confidence = 70%


20
Mining Frequent Itemsets:
FP-Growth

 To avoid costly candidate generation


 Divide-and-conquer strategy:
 Compress database representing frequent items into a frequent
pattern tree (FP-tree) – 2 passes over dataset
 Divide compressed database (FP-tree) into conditional databases,
then mine each for frequent itemsets – traverse through the FP-tree

21 11/3/2024
Mining Frequent Itemsets:
FP-Growth

Transactional data example Scan dataset for Compare candidate


N=9, min_supp count=2 count of each support with
candidate min_supp
TID List of items
T100 I1, I2, I5 C1 L1 - Reordered
T200 I2, I4 Itemset Support Itemset Support
count count
T300 I2, I3
{I1} 6 {I2} 7
T400 I1, I2, I4
T500 I1, I3 {I2} 7 {I1} 6
T600 I2, I3 {I3} 6 {I3} 6
T700 I1, I3 {I4} 2 {I4} 2
T800 I1, I2, I3, I5 {I5} 2 {I5} 2
T900 I1, I2, I3
22
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction

FP-tree

L1 - Reordered null { }
Itemset Support Node
count Link

{I2} 7
{I1} 6
{I3} 6
{I4} 2
{I5} 2

23
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction

FP-tree null { }

L1 - Reordered T100 TID List of items


Itemset Support Node I2:1
count Link
T100 I1, I2, I5
{I2} 7 T200 I2, I4
{I1} 6 I1:1 T300 I2, I3
{I3} 6 T400 I1, I2, I4
T500 I1, I3
{I4} 2
I5:1 T600 I2, I3
{I5} 2
T700 I1, I3
T800 I1, I2, I3, I5
Order of items is kept throughout path construction, with
T900 I1, I2, I3
common prefixes shared whenever applicable

24
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction

FP-tree null { }

L1 - Reordered
Itemset Support Node
I2:1 T200 TID List of items
count Link T100 I1, I2, I5
{I2} 7 T200 I2, I4

{I1} 6 I1:1 I4:1 T300 I2, I3


T400 I1, I2, I4
{I3} 6
T500 I1, I3
{I4} 2 I5:1 T600 I2, I3
{I5} 2 T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

25
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree
null { }

L1 - Reordered
Itemset Support Node
I2:2 T200
count Link TID List of items

{I2} 7 T100 I1, I2, I5

{I1} 6 I1:1 I4:1 T200 I2, I4


T300 I2, I3
{I3} 6
T400 I1, I2, I4
{I4} 2 I5:1 T500 I1, I3
{I5} 2 T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

26
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree null { }

L1 - Reordered
Itemset Support Node
I2:2
count Link TID List of items

{I2} 7 T300 T100 I1, I2, I5

I1:1 I3:1 I4:1 T200 I2, I4


{I1} 6
T300 I2, I3
{I3} 6
T400 I1, I2, I4
{I4} 2 I5:1 T500 I1, I3
{I5} 2 T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

27
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree
null { }

L1 - Reordered
Itemset Support Node
I2:3
count TID List of items
Link
T100 I1, I2, I5
{I2} 7 T300
T200 I2, I4
{I1} 6 I1:1 I3:1 I4:1
T300 I2, I3
{I3} 6 T400 I1, I2, I4
{I4} 2 I5:1
T500 I1, I3

{I5} 2 T600 I2, I3


T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

28
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree
null { }

L1 - Reordered
Itemset Support Node
I2:7 I1:2
count Link
{I2} 7
{I1} 6 I1:4 I3:2 I4:1 I3:2

{I3} 6
{I4} 2 I5:1 I3:2 I4:1
{I5} 2

For Tree
I5:1
Traversal

29
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction

Bottom-up algorithm – start from leaves and FP-tree


null { }
go up to root
L1 - Reordered
I2:7 I1:2
Itemset Support Node
count Link

{I2} 7
I1:4 I3:2 I4:1 I3:2
{I1} 6
{I3} 6
{I4} 2 I5:1 I3:2 I4:1
{I5} 2

I5:1

30
Mining Frequent Itemsets:
FP-Growth – Conditional FP-tree Construction

For I5 FP-tree
L1 - Reordered null { }
Itemset Support Node
count Link

{I2} 7
TID List of items
{I1} 6
T100 I1, I2, I5
{I3} 6
T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 T400 I1, I2, I4

Eliminate I5 T500 I1, I3


T600 I2, I3

Eliminate transactions T700 I1, I3


not including I5 T800 I1, I2, I3, I5
T900 I1,
31I2, I3
11/3/2024
Mining Frequent Itemsets:
FP-Growth – Conditional FP-tree Construction
FP-tree null { }
For I5
L1 - Reordered
Itemset Support Node I2:1
count Link

{I2} 7
TID List of items
{I1} 6 I1:1
T100 I1, I2, I5
{I3} 6
T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 T400 I1, I2, I4

Eliminate transactions not T500 I1, I3


including I5 T600 I2, I3
T700 I1, I3
Eliminate I5 T800 I1, I2, I3, I5
T900 I1,
32I2, I3
11/3/2024
Mining Frequent Itemsets:
FP-Growth – Conditional FP-tree Construction
FP-tree
For I5 null { }

L1 - Reordered
Itemset Support Node I2:2
count Link

{I2} 7
TID List of items
{I1} 6 I1:2
T100 I1, I2, I5
{I3} 6
T200 I2, I4
{I4} 2 T300 I2, I3
I3:1
{I5} 2 Eliminate transactions T400 I1, I2, I4
not including I5 T500 I1, I3
T600 I2, I3

Eliminate I5 T700 I1, I3


T800 I1, I2, I3, I5
T900 I1,
33I2, I3
11/3/2024
Mining Frequent Itemsets:
FP-Growth – Conditional FP-tree Construction

For I4 FP-tree
null { }

L1 - Reordered
Itemset Support Node I2:2
count Link

{I2} 7
TID List of items
{I1} 6 I1:1
T100 I1, I2, I5
{I3} 6
T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 Eliminate transactions T400 I1, I2, I4
not including I4 T500 I1, I3
T600 I2, I3
Eliminate I4 T700 I1, I3
T800 I1, I2, I3, I5
T900 I1,
34I2, I3
11/3/2024
Mining Frequent Itemsets:
FP-Growth – Conditional FP-tree Construction
FP-tree
For I3 null { }

L1 - Reordered
Itemset Support Node I2:4 I1:2
count Link

{I2} 7
TID List of items
{I1} 6 I1:2
T100 I1, I2, I5
{I3} 6 Eliminate
T200 I2, I4
{I4} 2
transactions not
T300 I2, I3
including I3
{I5} 2 T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
Eliminate T700 I1, I3
I3 T800 I1, I2, I3, I5
T900 I1,
35I2, I3
11/3/2024
Mining Frequent Itemsets:
FP-Growth

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated

I5 {{I2, I1: 1}, {I2, I1, I3: 1}} <I2:2, I1:2> {I2, I5: 2}, {I1, I5: 2},
{I2, I1, I5: 2}
I4 {{I2, I1: 1}, {I2: 1}} <I2:2> {I2, I4: 2}
I3 {{I2, I1: 2}, {I2: 2}, {I1: 2}} <I2:4, I1:2>, <I1:2> {I2, I3: 4}, {I1, I3: 4},
{I2, I1, I3: 2}
I1 {{I2: 4}} <I2:4> {I2, I1: 4}

Paths ending with item

36
Outline
 The Basics
• Market Basket Analysis
• Frequent Item sets
• Association Rules

 Frequent Item set Mining Methods


• Apriori Algorithm
• Generating Association Rules from Frequent Item sets
• FP-Growth

 Pattern Evaluation Methods

37
Pattern Evaluation Methods

38
Pattern Evaluation Methods

39

You might also like