Mining
Mining
Overview
Introduction Technologies
Applications
By
Dr. Nora Shoaip
Lecture1
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2024 - 2025
Introduction
• Why Data Mining
• What is Data Mining
• Data Mining Applications
• Categories of Mining Techniques
Why Data Mining?
The era of Explosive Growth of Data: in the petabytes!
Automated data collection and availability: tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, transactions, stocks, …
Science: Remote sensing, bioinformatics, …
Society and everyone: news, digital cameras, social feeds
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Linear growth of data management tasks with data volumes
Massive data volumes, but still little insight!
Solution! Data mining—The automated analysis of massive data sets
3
What is Data Mining?
Data mining (knowledge discovery from data)
o Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
o Data mining: a misnomer?
• Alternative names
o Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, information
harvesting, business intelligence, etc.
• Is everything “data mining”?
o Simple search and query processing
o (Deductive) expert systems
4
Knowledge Discovery Process
Selection:
• Finding data relevant to
the task
Processing:
• Cleaning and putting
data in format suitable
for mining
Transformation
• Performing summaries,
aggregations or
consolidation
Data Mining
• Applying the data
mining algorithms to
extract knowledge
Evaluation
• Locating useful
knowledge
5
What Kinds of Data Can Be Mined?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
6
Data Mining Applications
7
Data Mining Applications cont…
8
Data Mining Applications cont…
9
Categories of Mining Techniques
10
Frequent Patterns Mining
11
Clustering
12
Classification
Predictive data mining.
Supervised learning.
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
e.g., classify countries based on (climate),
or classify cars based on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification, …
13
Know Your Data
• Data Objects & Attribute Types
• Basic Statistical Descriptions of
Data
Objects and Attributes
15
Attribute Types:Nominal Attributes
16
Attribute Types:Binary Attributes
Nominal with only two values representing two states or
categories: 0 or 1 (absent or present)
Also Boolean (true or false)
Qualitative
Symmetric: both states are equally valuable and have the same
weight
e.g. gender
Asymmetric: states are not equally important
e.g. medical test outcomes
17
Attribute Types:Ordinal Attributes
Qualitative
Values have a meaningful order or ranking, but magnitude
between successive values is not known
e.g. professional rank, grade, customer satisfaction
Useful for data reduction of numerical attributes
18
Attribute Types:Numeric Attributes
Quantitative
Interval-scaled: measured on a scale of equal-size units
e.g. temperature, year
Do not have a true zero point
Not possible to be expressed as multiples
Ratio-scaled: have a true zero point
A value can be expressed as a multiple of another
e.g. years of experience, weight, salary
19
Discrete vs. Continuous Attributes
20
Outline
Data Objects & Attribute Types
• What is an Object?
• What is an Attribute?
• Attribute Types
• Continuous vs. Discrete
Basic Statistical Descriptions of Data
• Measuring central tendency
• Measuring Data dispersion
• Basic Graphic displays
Measuring Data similarity & dissimilarity
• Data matrix & dissimilarity matrix
• Proximity Measures for( Nominal- Binary) attributes
• Dissimilarity of Numerical Data
21
Measuring Central Tendency
22
Measuring Central Tendency
23
Measuring Central Tendency
Example
Salary (in thousands of dollars), shown in increasing
order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Mean = ?
Median = ?
Mode = ?
24
Measuring Central
Tendency
Example
Salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 70, 110
• Mean = 58,000
• Median = 54,000
• Mode = 52,000 and 70,000 – bimodal
25
Measuring dispersion of Data
26
Measuring dispersion
of Data
27
Measuring dispersion
of Data
Five-Number Summary:
Median (Q2), quartiles Q1 and Q3, &
smallest and largest individual
observations – in order
Boxplots: visualization technique for the five-
number summary
Whiskers terminate at min & max OR the
most extreme observations within
1.5 × IQR of the quartiles – with
remainder points (outliers) plotted
individually
28
Ex:
Suppose that a hospital tested the age and body fat data for 18
randomly selected adults with the following results:
•Calculate the mean, median, and standard deviation of age
and %fat.
•Draw the boxplots for age and %fat.
•Calculate the correlation coefficient. Are these two attributes
positively or negatively correlated? Compute their covariance.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
29
Solution
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
30
Solution %fat
7.8
9.5
Draw the boxplots for age and %fat. 17.8
For Age 25.9
26.5
Q1=39, median= 51, Q3=57, min=23, max=61 27.2
IQR= 57-39= 18, 1.5 IQR= 27 27.4
28.8
newMin= 39-27= 12, newMax= 57+27= 84 30.2
31.2
For Fat 31.4
Q1=26.5, median= 30.7, Q3=34.1, min=7.8, max=42.5 32.9
33.4
IQR= 34.1-26.5= 7.6, 1.5 IQR= 11.4 34.1
34.6
newMin= 26.5-11.4= 15.1, 35.7
newMax= 34.1+11.4= 45.5 41.2
42.5
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7 25
31
Visual Representations
of Data Distributions
Histograms
Scatter Plots: each pair of values is treated
as a pair of coordinates and plotted as
points in plane
X and Y are correlated if one attribute
implies the other
positive, negative, or null
(uncorrelated)
For more attributes, we use a scatter
plot matrix
32
Visual Representations
of Data Distributions
Uncorrelated data
33
Data Mining and Business Intelligence
similarity
data quality
Know Your Data
Preprocessing
By
Dr. Nora Shoaip
Lecture2
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2024 - 2025
Measuring Data similarity &
dissimilarity
• Data matrix & dissimilarity matrix
• Proximity Measures for( Nominal-
Binary) attributes
• Dissimilarity of Numerical Data
Measuring Data similarity & dissimilarity
25
Data matrix & dissimilarity matrix
21
Data matrix & dissimilarity matrix
21
Proximity Measures for Nominal
attributes
21
Proximity Measures for Binary
attributes
21
Proximity Measures for Binary
attributes- Example
21
Dissimilarity of Numerical Data
21
Dissimilarity of Numerical Data
MinKowski Distance
21
Why preprocess data?
Major tasks
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Overview
12
Why Preprocess Data?
13
Major Preprocessing Tasks
That Improve Quality of Data
16
Data Cleaning
Missing Values
Ignore the tuple not very effective, unless the tuple contains
several attributes with missing values
Fill in the missing value manually time consuming, not
feasible for large data sets
Use a global constant replace all missing attribute values by
same value (e.g. unknown)
may mistakenly think that “unknown” is an interesting concept
17
Data Cleaning
Missing Values
18
Data Cleaning
Noisy Data
19
Data Cleaning
Noisy Data
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
21
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
22
Data Cleaning
Noisy Data
23
Data Mining and Business Intelligence
Integration
Reduction
Data Pre-processing
Transformation
By
Dr. Nora Shoaip
Lecture 3
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2024 - 2025
Data Integration
• Entity Identification Problem
• Redundancy and correlation analysis
• Tuple duplication
Data Integration
3
Data Integration
Entity Identification Problem
4
Data Integration
Redundancy and Correlation Analysis
5
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500
6
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
7
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
8
Data Integration
Redundancy and Correlation Analysis
9
Data Integration
Redundancy and Correlation Analysis
10
Data Integration
Redundancy and Correlation Analysis
11
Data Integration
Redundancy and Correlation Analysis
T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5
12
Data Integration
More Issues
Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes A, B, … versus 90%,
80% …
13
Data Reduction
• Wavelet transforms
• PCA
• Attribute subset selection
• Regression
• Histograms
• Clustering
• Sampling
Data Reduction
Strategies
15
Data Reduction
Attribute Subset Selection
find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
An exhaustive search can be prohibitively expensive
Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
Attribute construction e.g. area attribute based on height and width attributes
16
Data Reduction
Attribute Subset
Selection
17
Data Reduction- Numerosity reduction
Regression
18
Data Reduction
Regression
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
19
Data Reduction
Regression
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
20
Data Reduction
Histograms
Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).
21
Data Reduction
Histograms
22
Data Reduction
Sampling
24
Transformation and Discretization
Transformation Strategies
Attribute construction
Aggregation
labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)
25
Transformation and Discretization
Transformation by Normalization
26
Transformation and Discretization
Transformation by Normalization
27
Transformation and Discretization
Transformation by Normalization
28
Transformation and Discretization
Concept Hierarchy
30
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
31
Summary
21
Data Mining and Business Intelligence
Apriori
Lecture 4
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2024 - 2025
Outline
The Basics
• Market Basket Analysis
• Frequent Item sets
• Association Rules
2
The Basics: What Is Frequent Pattern Analysis?
3
The Basics
21
The Basics
5
The Basics
6
The Basics : Frequent Itemsets
Itemset X = {x1, …, xk} ex: X={A, B, C, D, E, F}
Find all the rules X Y with minimum support and confidence
support, s, probability that a transaction contains X Y
confidence, c, conditional probability that a transaction having X also
contains Y
7
The Basics : Frequent Itemsets
Itemset X = {x1, …, xk} ex: X={A, B, C, D, E, F}
Find all the rules X Y with minimum support and confidence
support, s, probability that a transaction contains X Y
confidence, c, conditional probability that a transaction having X also
contains Y
8
The Basics : Association Rules
Ex: Let supmin = 50%, confmin = 50% Transaction-id Items bought
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} 10 A, B, D
Association rules: 20 A, C, D
A D (60%, 100%) 30 A, D, E
D A (60%, 75%) 40 B, E, F
50 B, C, D, E, F
9
The Basics : Association Rules
10
Outline
The Basics
• Market Basket Analysis
• Frequent Item sets
• Association Rules
11
Mining Frequent Itemsets: Apriori
Goes as follows:
Find frequent 1-itemsets L1
Use L1 to find frequent 2-itemsets L2
… until no more frequent k-itemsets can be found
12
Mining Frequent Itemsets: Apriori
Itemset Support
count Itemset
Not all subsets are frequent
{I1, I2, I3} 2
{I1, I2, I3, I5} Prune
{I1, I2, I5} 2
C4 = Terminate
16
Mining Frequent Itemsets: Apriori
17
Apriori
Algorithm
Generate Ck using Lk-1 to find Lk
Join
Prune
18 11/3/2024
Mining Frequent Itemsets:
Generating Association Rules from Frequent Itemsets
19
Mining Frequent Itemsets:
Generating Association Rules from Frequent Itemsets
Nonempty subsets Association Rules Confidence
Itemset Support
count
{I1, I2} {I1, I2} I5 2/4 = 50%
{I1, I2, I3} 2
{I1, I2, I5} 2 {I1, I5} {I1, I5} I2 2/2 = 100%
21 11/3/2024
Mining Frequent Itemsets:
FP-Growth
FP-tree
L1 - Reordered null { }
Itemset Support Node
count Link
{I2} 7
{I1} 6
{I3} 6
{I4} 2
{I5} 2
23
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree null { }
24
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree null { }
L1 - Reordered
Itemset Support Node
I2:1 T200 TID List of items
count Link T100 I1, I2, I5
{I2} 7 T200 I2, I4
25
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree
null { }
L1 - Reordered
Itemset Support Node
I2:2 T200
count Link TID List of items
26
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree null { }
L1 - Reordered
Itemset Support Node
I2:2
count Link TID List of items
27
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree
null { }
L1 - Reordered
Itemset Support Node
I2:3
count TID List of items
Link
T100 I1, I2, I5
{I2} 7 T300
T200 I2, I4
{I1} 6 I1:1 I3:1 I4:1
T300 I2, I3
{I3} 6 T400 I1, I2, I4
{I4} 2 I5:1
T500 I1, I3
28
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
FP-tree
null { }
L1 - Reordered
Itemset Support Node
I2:7 I1:2
count Link
{I2} 7
{I1} 6 I1:4 I3:2 I4:1 I3:2
{I3} 6
{I4} 2 I5:1 I3:2 I4:1
{I5} 2
For Tree
I5:1
Traversal
29
Mining Frequent Itemsets:
FP-Growth – FP-tree Construction
{I2} 7
I1:4 I3:2 I4:1 I3:2
{I1} 6
{I3} 6
{I4} 2 I5:1 I3:2 I4:1
{I5} 2
I5:1
30
Mining Frequent Itemsets:
FP-Growth – Conditional FP-tree Construction
For I5 FP-tree
L1 - Reordered null { }
Itemset Support Node
count Link
{I2} 7
TID List of items
{I1} 6
T100 I1, I2, I5
{I3} 6
T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 T400 I1, I2, I4
{I2} 7
TID List of items
{I1} 6 I1:1
T100 I1, I2, I5
{I3} 6
T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 T400 I1, I2, I4
L1 - Reordered
Itemset Support Node I2:2
count Link
{I2} 7
TID List of items
{I1} 6 I1:2
T100 I1, I2, I5
{I3} 6
T200 I2, I4
{I4} 2 T300 I2, I3
I3:1
{I5} 2 Eliminate transactions T400 I1, I2, I4
not including I5 T500 I1, I3
T600 I2, I3
For I4 FP-tree
null { }
L1 - Reordered
Itemset Support Node I2:2
count Link
{I2} 7
TID List of items
{I1} 6 I1:1
T100 I1, I2, I5
{I3} 6
T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 Eliminate transactions T400 I1, I2, I4
not including I4 T500 I1, I3
T600 I2, I3
Eliminate I4 T700 I1, I3
T800 I1, I2, I3, I5
T900 I1,
34I2, I3
11/3/2024
Mining Frequent Itemsets:
FP-Growth – Conditional FP-tree Construction
FP-tree
For I3 null { }
L1 - Reordered
Itemset Support Node I2:4 I1:2
count Link
{I2} 7
TID List of items
{I1} 6 I1:2
T100 I1, I2, I5
{I3} 6 Eliminate
T200 I2, I4
{I4} 2
transactions not
T300 I2, I3
including I3
{I5} 2 T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
Eliminate T700 I1, I3
I3 T800 I1, I2, I3, I5
T900 I1,
35I2, I3
11/3/2024
Mining Frequent Itemsets:
FP-Growth
I5 {{I2, I1: 1}, {I2, I1, I3: 1}} <I2:2, I1:2> {I2, I5: 2}, {I1, I5: 2},
{I2, I1, I5: 2}
I4 {{I2, I1: 1}, {I2: 1}} <I2:2> {I2, I4: 2}
I3 {{I2, I1: 2}, {I2: 2}, {I1: 2}} <I2:4, I1:2>, <I1:2> {I2, I3: 4}, {I1, I3: 4},
{I2, I1, I3: 2}
I1 {{I2: 4}} <I2:4> {I2, I1: 4}
36
Outline
The Basics
• Market Basket Analysis
• Frequent Item sets
• Association Rules
37
Pattern Evaluation Methods
38
Pattern Evaluation Methods
39