Data Mining Project
Data Mining Project
(HONOUR)
DATA MINING (MTM3223)
GROUP PROJECT
NAME STUDENT ID
QUESTION A
TASK 1
ITEMSET FREQUENCY/SUPPORT
Bread 3/10 = 30%
Milk 7/10 = 70%
Cheese 5/10 = 50%
Beer 6/10 = 60%
Umbrella 5/10 = 50%
Diaper 4/10 = 40%
Water 7/10 = 70%
Detergent 3/10 = 30%
ITEMSET FREQUENCY/SUPPORT
Milk, Water 60%
Cheese, Umbrella 50%
ITEMSET FREQUENCY/SUPPORT
Milk, Water, Cheese, Umbrella 3/10 = 30%
Explanation: The item set in Step 5 was below the frequency 50% which makes it invalid. So
the item in Step 4 will be used to generate the association rule
ASSOCIATION RULE
Milk =>> Water 6
[60%, ( × 100¿ = 85.71%]
7
Cheese =>> Umbrella 5
[50%, ( ×100 ¿ = 100%]
5
Water =>> Milk 6
[60%, ( × 100¿ = 85.71%]
7
Umbrella =>> Cheese 5
[50%, ( ×100 ¿ = 100%]
5
QUESTION B
TID Items
T1 MILK, WATER, BEER
T2 MILK, WATER
T3 MILK, WATER, BEER
T4 MILK, BEER
T5 WATER, BEER, CHEESE, UMBRELLA
T6 MILK, WATER, CHEESE, UMBRELLA
T7 BEER, CHEESE, UMBRELLA
T8 MILK, WATER, CHEESE, UMBRELLA
T9 MILK, WATER, CHEESE, UMBRELLA
T10 BEER
FP TREE:
{}
BEER: 1 WATER: 3
WATER: 3
CHEESE: 1 CHEESE: 3
BEER: 2 BEER: 1
UMBRELLA: 3
UMBRELLA: 1
MILK=7
WATER=7
FREQUENT PATTERN (FP-TREE) = MILK: 7, WATER: 7, BEER: 6, CHEESE:
BEER=6
5, UMBRELLA: 5
CHEESE=5
UMBRELLA=5
TASK 2
ZeroR Model
Yes No
Purchase Yes 6 4 0.6
No 0 0 0.0
1 0 0.6
OneR Model
Errors
Purchase
Yes No
Gender Femal 3 2
e
Male 3 2
Majority
class/Rules
Yes 2/5
Yes 2/5
Purchase Majority Errors
Yes No class/Rules
New York 0 2 No 2/2
Los Angeles 1 0 Yes 0/1
Chicago 0 1 No 0/1
Houston 1 0 Yes 1/1
Location Miami 0 1 No 0/1
San Francisco 1 0 Yes 1/1
Boston 1 0 Yes 1/1
Dallas 1 0 Yes 1/1
Seattle 1 0 Yes 1/1
Purchase Majority Errors
Yes No class/Rules
Time 1- 20 2 4 No 4/6
Spent 21 - 30 3 0 Yes 0/3
Browsing 41 - 60 1 0 Yes 0/1
(min)
Majority Errors
Purchase
class/Rules
Yes No
Numbe 1 - 10 3 3 Yes 3/6
r of 11 - 20 3 1 Yes 1/4
Page
viewed
Purchase
Yes No
Time Spent Yes 4 0 1
Browsing No 2 4 0.33
0.67 1 0.8
Likelihood Table
Likelihood Table
Likelihood Table
Likelihood Table
Likelihood Table
E (Purchase, age) = P (20 - 30)* E (1, 2) + P (31 - 40)* E (2, 1) + P (41 – 50) * E (2, 0) +
[P (2/10)* E (1)]
= 0.75
= 0.92- 0.75
= 0.22
Entropy Gender
= 0. 485 + 0.485
= 0.97
= 0.97- 0.97
=0
Entropy Location
+ P (Seattle)* E (1, 0)
=0
= 0.97 – 0
= 0.97
E (Purchase, Time Spent Browsing) = P (1 – 20)* E (2, 4) + P (21 – 40)* E (3, 0) + P (41- 60)*
E (1, 0)
= 0.952
= 0.97 – 0.95
= 0.02
Entropy Number of Pages Viewed
= 1.05
= 0.97-1.05
= 0.08
Decision Tree Model Age Gender Location T. Spent N. of Purchase
Browsing Pages
viewed
20 - 30 Male Chicago 1 - 20 1 - 10 No
20 - 30
20 - 30 Male Miami 1 - 20 1 - 10 No
20 - 30 Male Boston 1 - 20 1 - 10 Yes
51 - 60
51 - 60 Female New York 1 - 20 11 - 20 No
For 20 - 30
= P (3/3)* E (0.92)
= 0.92
= 0.92 - .92
=0
=0
= 0.92- 0
= 0.92
= P (3/3)* E (0.92)
= 0.92
= 0.92 – 0.92
=0
For 31 – 40
Entropy (Purchase, Gender) = P (Female)* E (2, 1)
= P (3/3)* E (0)
= -0.92
= 0 – (-0.92)
= 0.92
P (Seattle)* E (1, 0)
=0
= 0.92 – 0
= 0.92
= P (3/3)* E (-0.92)
= - 0.92
= 0 – (-0.92)
= 0.92
Entropy (Purchase, Number of Pages Viewed) = P (1 – 10)* E (1, 1) + P (11 – 20)* E (1, 0)
= P (2/3)* E (1) + P (1/3)* E (0)
= - 0.33
= 0.52 – (-0.33)
= 0.85
For 51 - 60
= P (1/2)
= 0.5
= 1 – 0.5
=0
=1–0
=0
=0
=1–0
= P (2/2) * E (1)
=1
= 1- 1
=0
20 - 30 31 - 40 41 - 50 51 - 60
San
Boston New York Seattle Houston New York
Francisco
Task 3:
Scatter Plot for 2 Clusters of the Data Set:
CLUSTER 1 CLUSTER 2
7
5
COLOUR INTENSITY
0
65 70 75 80 85 90 95 100 105 110 115
MAGNESIUM
5
COLOR INTENSITY
0
60 80 100 120 140 160 180
MAGNESIUM
12
10
COLOR INTENSITY
0
60 80 100 120 140 160 180
MAGNESIUM
Explanation:
Clustering is the process of creating a group of abstract objects within classes of similar objects.
The three-scatter plots above consists of 2, 3, and 4 clusters respectively. The first scatter
diagram shows two clusters with high class similarity, group 1 with low intra-class similarity and
group 2 with high intra-class similarity. All three clusters has high inter-class similarity. Clusters
1 and 2 show a high intra-class similarity. While cluster 3 has low intra-class similarity. The
third scatter diagram consists of 4 clusters. All clusters except cluster 3 have high inter-class
similarity. There is low inter-class similarity between cluster 3 and 4. There are strong intra-class
similarities between the four clusters. There are also possible outliers which can be identified in
each scatter diagram.