0% found this document useful (0 votes)
9 views60 pages

Chapter 3& 4

Chapter 3 discusses data preprocessing, emphasizing the importance of data quality and the major tasks involved, including data cleaning, integration, reduction, and transformation. It highlights common issues such as noisy, missing, and inconsistent data, and outlines methods to address these problems to improve data mining results. The chapter also covers strategies for data reduction and transformation techniques to enhance the efficiency of data analysis.

Uploaded by

Mubarek Adem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views60 pages

Chapter 3& 4

Chapter 3 discusses data preprocessing, emphasizing the importance of data quality and the major tasks involved, including data cleaning, integration, reduction, and transformation. It highlights common issues such as noisy, missing, and inconsistent data, and outlines methods to address these problems to improve data mining results. The chapter also covers strategies for data reduction and transformation techniques to enhance the efficiency of data analysis.

Uploaded by

Mubarek Adem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 60

Chapter 3

Data Preprocessing

1
Data Preprocessing

• Data Preprocessing: An Overview


– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation
2
2
Data Quality Measures
• Before feeding data to DM we have to make sure
the quality of data
• A well-accepted multidimensional data quality
measures are the following:
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

3
Data Preprocessing
• Today’s real-world databases are highly susceptible to
noisy, missing, and inconsistent data due to their typically
huge size and their likely origin from multiple,
heterogeneous sources
– Low-quality data will lead to low-quality mining results
• How can the data be pre-processed in order to help
improve the quality of the data and, consequently, of the
mining results?
• How can the data be pre-processed so as to improve the
efficiency and ease of the mining process?”

4
Data Preprocessing…
The major tasks in data preprocessing includes:
• Data cleaning: it helps to get rid of bad data
– Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
– Integration of data from multiple sources (databases, data
warehouses, or files) into a coherent data store
• Data reduction: can reduce the data size by
– Dimensionality reduction
– Numerosity/size reduction
– Data compression
• Data transformation
– Normalization
5
– Discretization
Data Preprocessing…
Data is often of low quality:
• Data mining requires collecting great amount of data
(available in warehouses or databases) to achieve the
intended objective
• In addition to its heterogeneous and distributed nature
of data, data in the Real World Is Dirty and low quality
• Why?
– You didn’t collect it yourself!
– It was probably created for some other use, and then you
came along wanting to integrate it
– People make mistakes (typos)
– People are busy (“this is good enough”) to systematically
organize carefully using structured formats
6
Data Preprocessing…
• Data processing techniques, when applied before mining, can
substantially improve the overall quality of the patterns mined
and/or the time required for the actual mining
• Incomplete, noisy, and inconsistent data are commonplace
properties of large real world databases and data warehouses
• There are many possible reasons for noisy data (having incorrect
attribute values)
– The data collection instruments used may be faulty
– There may have been human or computer errors occurring at data entry
– Errors in data transmission can also occur
– Incorrect data may also result from inconsistencies in naming
conventions or data codes used or inconsistent formats for input fields,
such as date.

7
Types of problems with data
• Some data have problems on their own that needs to be
cleaned:
– Outliers: misleading data that do not fit to most of the data/facts
– Missing data: attributes values might be absent which needs to be
replaced with estimates
– Irrelevant data: attributes in the database that might not be of
interest to the DM task being developed
– Noisy data: attribute values that might be invalid or incorrect. E.g.
typographical errors
– Inconsistent data, duplicate data, etc.
• Other data are problematic only when we want to integrate it
– everyone had their own way of structuring and formatting data,
based on what was convenient for them
– How to integrate data organized in different format following
different conventions
8
Case study: Government Agency Data

• What we want:

ID Name City State

1 Ministry of Transportation Addis Ababa Addis Ababa

2 Ministry of Finance Addis Ababa Addis Ababa

3 Office of Foreign Affairs Addis Ababa Addis Ababa

9
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which require data
cleaning
• Having a large amount of redundant data may slow down or
confuse the knowledge discovery process
• What’s wrong here?

ID Name City State


1 Ministry of Transportation Addis Ababa Addis Ababa
2 Ministry of Finance Addis Ababa Addis Ababa
3 Ministry of Finance Addis Ababa Addis Ababa

10
Data Cleaning: Incomplete (Missing) Data
• Incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– e.g., Occupation=“ ” (missing data)
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
ID Name City State

1 Ministry of Transportation Addis Ababa Addis Ababa

2 Ministry of Finance Addis Ababa

3 Office of Foreign Affairs Addis Ababa Addis Ababa

• What’s wrong here? A missing required field


11
Data Cleaning: Incomplete (Missing) Data
• Missing data may be due to
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding and may not be
considered important at the time of entry
– not register history or changes of the data
• How to handle Missing data? Missing data may need to be
inferred
– Ignore the missing value:
• This method is not very effective, unless the tuple contains several
attributes with missing values.
• not effective when the percentage of missing values per attribute
varies considerably
– Fill in the missing value manually: tedious + infeasible?
12
Data Cleaning: Incomplete (Missing) Data
How to handle Missing data? con’t…

– Use a global constant to fill in the missing value:


• Replace all missing attribute values by the same constant
e.g. “unknown”
• Although this method is simple it is not foolproof /safe
(incapable of going wrong or being misused)
– Fill automatically with the most probable value:
calculate, say, using Expected Maximization (EM)
algorithm

13
Predict missing value using EM
• Solves estimation with incomplete data
– Obtain initial estimates for parameters
– Iteratively use estimates for missing data and continue until convergence
• EM Example: out of six data items given known values= {1, 5,
10, 4}, estimate the two missing data items? Suppose that we
initially guess mean = 3

The algorithm stop


here since the last
two estimates are
only 0.05 apart.
Thus, our estimate
for the two items is
4.97.
14
Data Cleaning: Noisy Data
• Noisy: containing noise, errors, or outliers
– e.g., Salary=“−10” (an error)
• Typographical errors are errors that corrupt data
• Let say ‘green’ is written as ‘rgeen’

• Incorrect attribute values may be due to


– faulty data collection instruments (e.g.: OCR)
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
15
15
Data Cleaning: How to catch Noisy Data
• Pay someone to manually check all data
• Sort data by frequency
– ‘green’ is more frequent than ‘rgeen’
– Works well for categorical data
• Use constraints to Catch Corrupt Data
– E.g. Numerical constraints
• Weight can’t be negative
• People can’t have more than 2 parents
• Salary can’t be less than Birr 300
• Use statistical techniques to Catch Corrupt Data
– Check for outliers (the case of the 8 meters man)
– Check for correlated outliers (“pregnant males”)
• People can be male
• People can be pregnant
• People can’t be male AND pregnant 16
16
Data Integration
• Data integration combines data from multiple
sources (database, data warehouse, files and
sometimes from non-electronic sources) into a
coherent store
• Because of the use of different sources, data that is
fine on its own may become problematic when we
want to integrate it
• Some of the issues are:
– Different formats and structures
– Conflicting and redundant data
17
Data Integration: Formats
• Not everyone uses the same format
• Dates are especially problematic:
– 12/19/13
– 19/12/13
– 19/12/2013
– 19-12-13
– Jan 19, 2013
– 19 January 2013
– 19th Jan. 2013

• Are you frequently writing money as:


– Birr 200, Br. 200, 200 Birr, …

18
Data Integration: Inconsistent
• Inconsistent data: containing discrepancies in codes or names,
which is also the problem of Lack of standardization / naming
conventions. e.g.,
– Age=“42” vs. Birthday=“13/03/2013”
– Some use “1,2,3” for rating; others “A, B, C”
• Discrepancy between records

ID Name City State


1 Ministry of Transportation Addis Ababa Addis Ababa region
Addis Ababa
2 Ministry of Finance Addis Ababa administration
Addis Ababa regional
3 Office of Foreign Affairs Addis Ababa administration

19
Data Integration: different structure
What’s wrong here?

ID Name City State


Ministry of
1 Transportation Addis Ababa AA

ID Name City State


Ministry of Addis
Two Finance Ababa AA

Name ID City State


Office of Foreign Addis
Affairs 3 Ababa AA
20
Data Integration: Conflicting Data

• Detecting and resolving data value conflicts


–For the same real world entity, attribute values from
different sources are different
–Possible reasons: different representations, different
scales, e.g., American vs. British units
• weight measurement: KG or pound
• Height measurement: meter or inch

21
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue, age
• Redundant attributes may be able to be detected by correlation
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality

22
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
• Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
• Data reduction strategies
– Dimensionality reduction,
• Select best attributes or remove unimportant attributes
– Numerosity reduction
• Reduce data volume by choosing alternative, smaller forms of
data representation
– Data compression
• Is a technology that reduce the size of large files such that
smaller files take less memory space and fast to transfer over a
23 network or the Internet,
Data Reduction: Dimensionality Reduction
Dimensionality reduction
– Help eliminate irrelevant features and reduce noise
– Reduce time and space required in data mining
– Allow easier visualization
• Method: attribute subset selection
– One of the method to reduce dimensionality of data is by selecting
best attributes
– Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
– Helps to avoid Irrelevant attributes: that contain no information
that is useful for the data mining task at hand
» E.g., is students' ID relevant to predict students' GPA?

24
Dimensionality reduction Con’t…
• Commonly used heuristic attribute selection methods:
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...

– Step-wise attribute elimination:


• Repeatedly eliminate the worst attribute

– Best combined attribute selection and elimination

25
Data Reduction: Numerosity Reduction
• Different methods can be used, including Clustering
and sampling
• Clustering
– Partition data set into clusters based on similarity, and
store cluster representation
• Sampling
– obtaining a small sample s to represent the whole data set N
– Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
– Key principle: Choose a representative subset of the data
using suitable sampling technique

26
Types of Sampling
• Simple random sampling
– There is an equal probability of selecting any particular item
– Simple random sampling may have very poor performance in the
presence of skew
• Stratified sampling:
– Stratified sampling, which partition the data set and draw samples
from each partition (proportionally, i.e., approximately the same
percentage of the data)
• Sampling without replacement
– Once an object is selected, it is removed from the population
• Sampling with replacement
– A selected object is not removed from the population

27
Sampling: Stratified Sampling

Raw Data Stratified Sample

28
Data Reduction: Data Compression
• Two types of data compression:
 Lossless compression is a compression technique that does not
lose any data in the compression process. lossless compression
can reduce it to about half that size
 Lossy compression: To reduce a file significantly beyond 50%,
one must use lossy compression. Lossy compression will strip
some of the redundant data in a file. Because of this data loss, only
certain applications are fit for lossy compression, like images,
audio, and video.

Original Data
lossless Compressed
Data

s s y
Original Data lo
Approximated 29
Data Reduction: Data Compression
• Lossless and lossy compression have become part of our
every day vocabulary largely due to the popularity o f
– MP3 music files
• Compare it with WAV files
– JPEG image files
• Compare it with GIF or
– MP4 or MPEG video files
• Compare it with AVI files
• These formats makes the resulting file much smaller so
that several dozen music, image and/or video files can fit,
for example, on a single compact disk, mobile phones, etc

30
Data Transformation
• A function that maps the entire set of values of a given attribute to
a new set of replacement values such that each old value can be
identified with one of the new values
• Methods for data transformation
– Normalization: Scaled to fall within a smaller, specified range
of values
• min-max normalization
• z-score normalization

– Discretization: Reduce data size by dividing the range of a


continuous attribute into intervals. Interval labels can then be
used to replace actual data values
• Discretization can be performed recursively on an attribute
using method such as
– Binning: divide values into intervals
– Concept hierarchy generation: organizes concepts (i.e.,
attribute values) hierarchically 31
Simple Discretization: Binning

• Equal-depth (frequency) partitioning


– Divides the range into N intervals, each
containing approximately same number of
samples
– Good data scaling
– Managing categorical attributes can be tricky

32
Chapter 4

Mining Association Rules

33
Association Rule Mining/ARM/
• ARM was introduced by Agrawal et al. (1993).
• Given a set of records each of which contain some number
of items from a given collection;
– ARM produce dependency rules which will predict
occurrence of an item based on occurrences of other items.

• Motivation of ARM:
ARM Finding inherent regularities in data
– What products were often purchased together? Pasta & Tea?
– What are the subsequent purchases after buying a PC?
• ARM aims to extract interesting correlations, frequent
patterns, associations among sets of items in the
transaction databases or other data repositories
34
Cont’d…
• Association Rule Mining is a technique in data
mining used to discover interesting relationships,
patterns, or associations between items in large
datasets.
• It is widely used in market basket analysis, where
the goal is to find associations between items
purchased together.
Key Concepts in Association Rule Mining
• Transaction Database: A collection of transactions,
where each transaction is a set of items.
35
Association Rule Mining Con’t…
• Association rules are widely used in various areas such as
– market and risk management
– inventory control
– medical diagnosis
– Web usage mining
– intrusion detection
– catalog design and
– customer shopping behavior analysis, etc.
• ARM is to find out association rules that satisfy the
predefined minimum support and confidence from a given
database
36
Applications
• Market Basket Analysis: Find products
frequently bought together.
• Fraud Detection: Identify unusual transaction
patterns.
• Healthcare: Discover associations between
symptoms and diseases.

37
Association Rule Mining Con’t…
• Based on the concept of strong rules, Agrawal et al.
(1993) introduced association rules for discovering
regularities between products in large scale transaction
data recorded by point-of-sale (POS) systems in
supermarkets
• For example, the rule {onion, potatoes} Burger
found in the sales data of a supermarket would indicate
that if a customer buys onions and potatoes together, he or
she is likely to also buy hamburger meat
– Such information can be used as the basis for decisions about
marketing activities such as, e.g., promotional pricing or product
placements
38
Association Rule Mining Con’t…
• In general, ARM can be viewed as a two-step
process
– Finding frequent patterns from large itemsets
• find those itemsets whose occurrences exceed a
predefined threshold in the database; those itemsets
are called frequent or large itemsets.
– Generating association rules from these itemsets
• generate association rules from those large itemsets
with the constraints of minimal confidence

39
Association Rule Mining Con’t…
• The problem of ARM is defined as: Let I= {i1,i2,…., in} be
a set of n attributes called items. Let D= {t1, t2,…., tm} be a
set of transactions called the database. Each transaction in
D has a unique transaction ID and contains a subset of the
items in I. A rule is defined as an implication of the form
XY (which means that Y may present in the transaction
if X is in the transaction) where X, Y C I and X ∩ Y = Ø
• The sets of items (for short itemsets) X and Y are called
antecedent (left-hand-side or LHS) and consequent (right-
hand-side or RHS) of the rule respectively

40
Frequent Patterns
• are patterns (such as itemsets) that appear in a data set
frequently
– For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset
• Mining frequent patterns leads to the discovery of
associations and correlations among items in large
transactional or relational data sets
• can help in many business decision-making processes,
such as
– catalog design,
– Store layout
– and customer shopping behavior analysis
41
Frequent Patterns Con’t…
• A typical example of frequent itemset mining is market
basket analysis
– analyzes customer buying habits by finding associations between
the different items that customers place in their “shopping
baskets”
• For example, if customers are buying milk, how likely are
they to also buy bread on the same trip to the supermarket?
– Such information can lead to increased sales by helping retailers
do selective marketing and plan their shelf space
• Support and confidence are the two measures of
association rule interestingness
– They respectively reflect the usefulness and certainty of
discovered rules
42
Frequent Patterns Con’t…
• Since the database is large and users concern about only
those frequently purchased items, usually thresholds of
support and confidence are predefined by users to drop
those rules that are not so interesting or useful
• The two thresholds are called minimal support and
minimal confidence respectively
– Support (s) of an association rule XY is defined as the
percentage/fraction of records that contain X∪Y to the total
number of records in the database
– Confidence of an association rule XY is defined as the
percentage/fraction of the number of transactions that contain
X∪Y to the total number of records that contain X

43
Frequent Patterns Con’t…

44
Frequent Pattern Analysis
• Basic concepts:
– itemset: A set of one or more items
– k-itemset X = {x1, …, xk}: An itemset that contains k items
– support, s, is the fraction of transactions that contains X (i.e., the
probability that a transaction contains X)
• support of X and Y greater than user defined threshold s; i.e.
support probability of s that a transaction contains X  Y

Support (XY)= P (X U Y)
Support (XY)= Support_count (X U
Y)
Total transaction |D|
• P(X U Y) indicates the probability that a transaction contains
the union of set X and set Y
• Support_count of an itemset is the number of transactions
45
that contain the itemset
Frequent Pattern Analysis
• Confidence: is the probability of finding Y in a transaction with
all X1,X2,…,Xn
– confidence, c, conditional prob. that a transaction having X also contains Y;
i.e. conditional prob. (confidence) of Y given X > user threshold c
Confidence (XY)= P (Y|X)
Confidence (XY)= P (Y|X)=Support (X U Y)
Support (X)
= Support_count (X U Y)
Support_count (X)
– P (Y|X) indicates the conditional probability that a transaction contain Y given
X
• An itemset X is frequent if X’s support is no less than a minsup
threshold
• Rules that satisfy both a minimum support threshold (min sup)
and a minimum confidence threshold (min conf) are called
strong 46
Example: Finding frequent itemsets
• Given a support threshold (S), sets of X items that
appear in greater than or equal to S baskets are called
frequent itemsets
• Example: Frequent Itemsets
– Itemsets bought={milk, coke, Pepsi, biscuit, juice}.
– Support = 4 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
– Frequent itemsets: {m}, {c}, {b}, {j}, {m,b} , {b,c} 47
Association Rules
• Find all rules on itemsets of the form XY with minimum
support and confidence
– If-then rules about the contents of baskets.
• {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then
it is likely to contain j.”
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.” Note: “support” of an association rule is the
support of the set of items it mentions.
– Confidence of this association rule is the probability of j given i1,
…,ik. It is the number of transactions i1,…,ik containing item j
– Example: Confidence of the rule{m, b} → c
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}
B4 = {c, j} B5 = {m, p, b} B6 =
{m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• An association rule: {m, b} → c (with confidence = 2/4 = 50%)48
Example: Association Rules
• Let say min_support = 50%, min_confidence = 50%, identify
frequent item pairs and define association rules
Tid Items bought Customer Customer
10 Coke, Nuts, Tea buys both buys Tea
20 Coke, Coffee, Tea
30 Coke, Tea, Eggs
40 Nuts, Eggs, Milk
50 Coffee, Tea, Eggs, Milk Customer
buys Coke

• Frequent Pattern:
– Coke:3, Tea:4, Eggs:3, {Coke, Tea}:3
• Association rules:
– Coke  Tea (60%, 100%)
– Tea  Coke (60%, 75%) 49
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
–A two-pass approach called a-priori limits the need for main
memory
–Key idea: if a set of items appears at least s times, so does
every subset.
• Contrapositive for pairs: if item i does not appear in s baskets,
then no pair including i can appear in s baskets.
• FPGrowth: A Frequent Pattern-Growth Approach
–Mining Frequent Patterns Without Candidate Generation
–Uses the Apriori Pruning Principle
–Scan DB only twice!
• Once to find frequent 1-itemset (single item pattern)
• The other to construct FP-tree, the data structure of
FPGrowth 50
A-Priori Algorithm
• Apriori is a seminal algorithm proposed by R. Agrawal and
R. Srikant in 1994 for mining frequent itemsets for Boolean
association rules
– The name of the algorithm is based on the fact that the algorithm
uses prior knowledge of frequent itemset properties
• Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k + 1)-
itemsets
– First, the set of frequent 1-itemsets is found by scanning the
database to accumulate the count for each item, and collecting
those items that satisfy minimum support. The resulting set is
denoted L1

51
A-Priori Algorithm Con’t….
– Next, L1 is used to find L2 , the set of frequent 2-itemsets, which
is used to find L3 , and so on, until no more frequent k-itemsets
can be found
– The finding of each Lk requires one full scan of the database
• To improve the efficiency of the level-wise generation of
frequent itemsets, an important property called the Apriori
property is used to reduce the search space
• Apriori property: All nonempty subsets of a frequent
itemset must also be frequent
– E.g. If {Coke, Tea, nuts} is frequent, so is {Coke, Tea} i.e.,
every transaction having {Coke, Tea, nuts} also contains {Coke,
Tea}
52
Apriori: A Candidate Generation & Test
Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemset. For each k, we construct two sets of k –
tuples:
• Ck = candidate k - tuples = those that might be frequent sets
(support > s ) based on information from the pass for k –1
• Lk = the set of truly frequent k –tuples
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
53
generated
A-Priori for All Frequent Itemsets
• C1 = all items; L1 = those counted on first pass to be
frequent.; C2 = pairs with support ≥ s, both chosen from L1;
In general, Ck = k –tuples, each k –1 of which is in Lk -1; Lk =
members of Ck with support ≥ s
Count All pairs Count All pairs of
All of items items from
the items the pairs
items from L1 L2

C1 Filter L1 Construct C2 Filter L2 Construct C3

First pass Second pass 54


Bottlenecks of the Apriori Approach
• The Apriori algorithm reduces the size of candidate
frequent itemsets by using “Apriori property.” However, it
still requires two nontrivial computationally expensive
processes
i. It may generate a huge number of candidate sets that will
be discarded later in the test stage
ii. It requires as many database scans as the size of the
largest frequent itemsets
- In order to find frequent k-itemsets, the Apriori
algorithm needs to scan database k times
- It is costly to go over each transaction in the database
to determine the support of the candidate itemsets
55
Pattern-Growth Approach
• The FP-Growth Approach
– FP-growth was first proposed briefly by Han et al (2000)
– Depth-first search: search depth wise by identifying
different set of combinations with a given single or pair
of items
– Avoid explicit candidate generation
• FP-growth adopts a divide-and-conquer strategy
– First, it compresses the database representing frequent
items into a frequent-pattern tree, or FP-tree, which
retains the itemset association information 56
Pattern-Growth Approach
– It then divides the compressed database into a set of
conditional databases (a special kind of projected
database), each associated with one frequent item or
“pattern fragment,” and mines each such database
separately
• An FP-Tree is constructed by first creating the
root of the tree, labeled with “null.”
• This algorithm generates frequent itemsets from
FP-tree by traversing in bottom-up fashion

57
Construct FP-tree from a Transaction Database
Assume min_support = 3 and Min_confidence = 80%
TIDitems Items bought (ordered) frequent
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
1.Scan DB once, find Header Table
frequent 1-itemset Item frequency head
f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2.Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3 m:2 b:1
3.Scan DB again,
construct FP-tree F-list = f-c-a-b-m-p p:2 m:1 58
Benefits of the FP-tree Structure
• Completeness
– Preserve complete information for frequent pattern
mining
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Never be larger than the original database

59
Benefits of FP-growth over Apriori

• FP-growth is faster than Apriori because:


– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building (no
pattern matching)

60

You might also like