Data Mining - Sem 3 - Assignment - 2
Data Mining - Sem 3 - Assignment - 2
Data Mining - Sem 3 - Assignment - 2
Assignment - 2
20.11.2023
1
ASSIGNMENT - 2
Questions 50 Marks
Q1) Consider the market basket transactions shown in the following table. Use Apriori
algorithm to answer the questions that follow.
TID Item Bought
1 oregano, chocolate, milk, cheese, french fries
2 milk, french fries, cheese, ketchup
3 chocolate, cheese, oregano, ketchup
4 chocolate, cheese, french fries
5 french fries, cheese, oregano, chocolate
6 chocolate, ketchup
7 oregano, french fries, ketchup
8 oregano, french fries, chocolate
9 ketchup, oregano, milk
10 french fries, chocolate
• Assuming the minimum support threshold is fixed at 40%, list the set of frequent 1-itemsets
(L1) and with their respective supports.
• List the itemsets in the set of candidate 2-itemsets (C2) and calculate their supports.
• Generate all association rules from the itemsets in L2 and also compute the confidence of
these rules.
Compute all class conditional and class prior probabilities. Use Naive Bayes classifier to
predict the class of the following tuple:
X = (age = young, income = medium, employed = yes, credit rating = good)
Q3) Consider the following set of points:
{44, 28, 48, 26, 32, 14, 52, 50}
2
Assuming that k=2, and initial cluster centres for k-means clustering are 5 and 38, compute
the sum of squared errors (SSE) and cluster assignment for each iteration.
Q4) Consider the following dataset with points in a two-dimensional space:
Point Coordinates Class
A 2, 3 1
B 4, 5 1
C 6, 1 2
D 8, 4 2
E 3, 6 1
F 1, 1 1
G 5, 2 2
H 7, 3 2
I 9, 5 2
J 4, 3 1
You are given a test point P: (5, 3) for which you need to determine the class using the k-
nearest neighbors (KNN) algorithm.
• For k = 3, calculate the Euclidean distance between point P and all other points in the
dataset. Identify the k nearest neighbors of point P.
• Determine the class of point P.
• Discuss how the choice of k might impact the classification result.
Q5) Consider the training examples shown in the following table for a classification problem.
Student ID Admission Admission Gender Predicted Actual
Category List Class Class
1 Sports First M 0 0
2 Arts Second F 0 1
3 Arts Third F 0 0
4 Sports Fourth M 0 1
5 Academics First F 0 0
6 Arts Second F 1 1
7 Arts Third M 1 1
8 Sports Fourth M 1 0
9 Sports Third F 1 1
10 Arts First F 1 0
11 Sports Third M 1 1
12 Sports Second F 1 1
13 Sports First M 1 0
3
• Which is a better attribute for split based on the Gini Index: Admission Category or
Admission List? Why?
• Create a confusion matrix for the above data set and compute False positive rate,accuracy,
recall and precision.