0% found this document useful (0 votes)

49 views5 pages

HW 1

This document describes the instructions for Homework 1 of the course CSIT5210 Data Mining and Knowledge Discovery. It includes 5 questions worth 20 marks each, totaling 100 marks. Students can use a coupon to waive one question and receive full marks for it. The coupon must be stapled to the submitted assignment and indicate the waived question number. Question 1 asks about adapting the Apriori algorithm to mine frequent sequences from customer transaction data. Question 2 asks about finding frequent itemsets where the support threshold depends on the itemset size. Question 3 asks about applying the k-means algorithm to cluster 2D data points. Question 4 asks about hierarchical clustering using agglomerative clustering and analyzing the resulting dendrogram. Question 5

Uploaded by

calvinlam12100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views5 pages

HW 1

Uploaded by

calvinlam12100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CSIT5210 Data Mining and Knowledge Discovery (Fall Semester 2023)

Homework 1
Deadline: 10 Oct, 2023 3pm
(Please hand in during lecture.)
Full Mark: 100 Marks

Coupon Instructions:
1. You can use a coupon to waive any question you want and obtain full marks for this question.
2. You can waive at most one question in each assignment.
3. You can also answer the question you will waive. We will also mark it but will give full marks to this
question.
4. The coupon is non-transferrable. That is, the coupon with a unique ID can be used only by the student
who obtained it in class.
5. Please staple the coupon to the submitted assignment.
6. Please write down the question no. you want to waive on the coupon.

Q1 [20 Marks]

(a) In general, we have a number of customers. For illustration, we are given two customers, namely X and
Y. The following shows 5 transactions for these two customers. Each transaction contains three kinds of
information: (1) customer ID (e.g., X and Y), (2) the time that this transaction occurred, and (3) all the
items involved in this transaction.

Customer X, time 1, items A, B, C

Customer Y, time 2, items A, F
Customer X, time 3, items D, E
Customer X, time 4, item G
Customer Y, time 5, items D, E, G

For example, the first transaction corresponds to that customer X bought item A, item B and item C at
time 1, while the last transaction corresponds to that customer Y bought item D, item E and item G at
time 5.

A sequence is defined to be a series of itemsets in form of <S1, S2, S3, …, Sm> where Si is an itemset for
i = 1, 2, …, m. The above transactions can be transformed into two sequences as follows.

X: <{A, B, C}, {D, E}, {G}>

Y: <{A, F}, {D, E, G}>

After this transformation, each customer is associated with a sequence.

Given a sequence S in form of <S1, S2, S3, …, Sm> and another sequence S’ in form of <S1’, S2’, S3’, …,
Sn’> , S is said to be a subsequence of S’ if m  n and there exist m integers, namely i1, i2, …, im, such
that (1) 1i1<i2< …< imn, and (2) Sj  Si j’ for j = 1, 2, …, m. If S is a subsequence of S’, then S’ is
defined to be a super-sequence of S.
The support of a sequence S is defined to be the total number of customers which sequences are super-
sequences of S.
Given a positive integer k, a sequence in form of <S1, S2, S3, …, Sm> is said to be a k-sequence if
m

 |Si|=k.
i1
1/5
Can the Apriori algorithm be adapted to mining all k-sequences with support at least 2 where k = 2, 3,
4, …. ? If yes, please write down the proposed method using the concept of the Apriori algorithm and
illustrate your algorithm with the above example. If no, please explain the reason.

(b) We want to study the same problem setting described in (a). However, each customer is associated to
one binary attribute called “Rich” to indicate whether this customer is rich or not. There are only 2
possible values in this attribute, namely “Yes” and “No”. In our example, customer X could have “Yes”
in attribute “Rich” and customer Y could have “No” in attribute “Rich”.

Given a k-sequence S and a value v in attribute “Rich”, the support of a sequence S with respect to value
v is defined to be the total number of customers which sequences are super-sequences of S and are
associated with value v in attribute “Rich”. The important ratio of S is defined to be the support of S
with respect to value “Yes” divided by the support of S with respect to value “No”.

Can the Apriori algorithm be adapted to mining all k-sequences with important ratio at least 2 and the
support at least 1 where k = 2, 3, 4, ….? If yes, please write down the proposed method using the
concept of the Apriori algorithm and illustrate your algorithm with the above example. If no, please
explain the reason.

Note that when we compute the important ratio, if we encounter a division of a non-zero number by zero,
we could regard it as a positive infinity value.

2/5
Q2 [20 Marks]

Given a positive integer K, we denote SK to be a set of K-itemsets with support at least 1.

Given a positive integer K and a positive integer l, we define a set SK, l which is a subset of SK such that
each K-itemset in SK, l has its support at least sl where sl is the l-th greatest value in the multi-set of the
supports of all K-itemsets in SK. For example, the second greatest value in a multi-set of {4, 4, 3, 2} is 4
while the second greatest value of another multi-set of {4, 3, 3, 2} is 3.

We are given six items, namely A, B, C, D, E and F.

Suppose l is fixed and is set to 2.

We want to find SK, l for K = 1, 2 and 3.

The following shows four transactions with six items. Each row corresponds to a transaction where 1
corresponds to a presence of an item and 0 corresponds to an absence.
A B C D E F
0 0 1 1 0 0
0 1 0 0 1 1
1 0 1 1 0 0
1 0 1 1 0 0

(a) (i) What is S1, 2?

(ii) What is S2, 2?
(iii) What is S3, 2?
(b) Can algorithm FP-growth be adapted to finding S1, 2, S2, 2 and S3, 2. If yes, please write down how to
adapt algorithm FP-growth and illustrate the adapted algorithm with the above example. If no,
please explain the reason.
(c) There are two parameters of finding SK, l. They are K and l. In the traditional problem of finding
frequent itemsets, we need to provide only one parameter, a support threshold.

It seems that it is troublesome to set one more parameter in the problem of finding SK, l (compared
with the traditional frequent itemset mining you learnt). What are the advantages of the problem of
finding SK, l compared with the traditional problem?

3/5
Q3 [20 Marks]

Consider the following eight two-dimensional data points:

x1:(17, 12), x2: (5, 12), x3: (17, 14), x4: (5, 16), x5: (20, 15), x6: (3, 9), x7: (12, 3), x8: (12, 32)

Consider algorithm k-means.

(a) Please answer the following questions. You are required to show the information about each final cluster
(including the mean of the cluster and all data points in this cluster). You can consider writing a program
for this part but you are not required to submit the program.
(i) If k = 2 and the initial means are (12, 3) and (12, 32), what is the output of the algorithm?
(ii) If k = 2 and the initial means are (5, 12) and (17, 12), what is the output of the algorithm?
(iii) If k = 3 and the initial means are (12, 3), (12, 32) and (5, 12), what is the output of the algorithm?
(iv) If k = 4 and the initial means are (12, 3), (12, 32), (5, 12) and (17, 12), what is the output of the
algorithm?
(b) What are the advantages and the disadvantages of algorithm k-means?
For each disadvantage, please also give a suggestion to enhance algorithm k-means.

Q4 [20 Marks]

Consider eight data points.

The following matrix shows the pairwise distances between any two points.
1 2 3 4 5 6 7 8
1 0 
 
2  11 0 
3  5 13 0 
 
4 12 2 14 0 

5  7 17 1 18 0 


6 13 4 15 5 20 0 
 
7  9 15 12 16 15 19 0 
8  11 20 12 21 17 22 30 0 


(a) Please use the agglomerative approach to group these points with distance group average linkage.
Draw the corresponding dendrogram for the clustering. You are required to specify the distance metric
in the dendrogram.
(b) Suppose that we want to find 4 clusters. According to the dendrogram in (a), please state the 4 clusters.
For each cluster, please include all data points involved.
(c) (i) What is the greatest possible number of data points in a cluster containing data 1 and data 5?
(ii) What is the smallest possible number of data points in a cluster containing data 1 and data 5?
(d) Suppose that data points satisfy the triangle inequality. That is, for any three data points, a, b and c, we
have |a, b| + |b, c|  |a, c| where |a, b| denotes the pairwise distance between a and b, and |b, c| and |a, c|
have similar meanings. Does the triangle inequality enhance the agglomerative approach? If yes, please
elaborate it. If no, please give the reason.

4/5
Q5 [20 Marks]
(a) In class, we learnt k-means where the center of each cluster is the “average” or “mean” of all data points
assigned to this cluster. k-medoids is exactly the same as k-means except that the center of each cluster is
defined to be one of the data points assigned to this cluster which has the minimum average distance
between this point and another point assigned to this cluster.
One disadvantage of method k-medoids is that all data points given are based on numeric attributes only.
The major principle of k-medoids is to find the closest center for each data point and update the center of
each cluster with all points assigned to this cluster iteratively.
Consider that the data points have only categorical attributes. We would like to adopt the major principle of
k-medoids for clustering on these data points. Please write down how you adapt this k-medoids method for
this purpose.

(b) Another disadvantage of method k-means is that k (i.e., the number of clusters) should be pre-
determined.
One may suggest the following method to determine parameter k.
 Step 1: Set variable e0 to
 Step 2: Set k to 1 initially
 Step 3: Run the original k-means method and obtain k cluster centers (or means)
 Step 4: Set variable ek to the sum of the distances between points and their closest cluster centers
(according to the k cluster centers found).
 Step 5: If ek converges (i.e., (ek-1 – ek) is equal to 0 or an extremely small number), then return k.
Otherwise, increment k by 1 and repeat Step 3 to Step 4.
Can the above method determine a good value for k (i.e., the number of clusters)? Please explain.
If your answer is no, please also give an algorithm to determine a good value for k and explain why it is
better than the above method.

5/5

DWM Solution May 2019
No ratings yet
DWM Solution May 2019
9 pages
HW 2
No ratings yet
HW 2
7 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
21 pages
Ca-3 QB (Pec-It602b) - 2024-1
No ratings yet
Ca-3 QB (Pec-It602b) - 2024-1
12 pages
DWM Assignment
No ratings yet
DWM Assignment
15 pages
Data Mining Notes
No ratings yet
Data Mining Notes
31 pages
Cluster Analysis: Basic Concepts and Methods: 10.1 Exercises
No ratings yet
Cluster Analysis: Basic Concepts and Methods: 10.1 Exercises
16 pages
DWM 5
No ratings yet
DWM 5
9 pages
Capture D'écran, Le 2025-04-21 À 21.26.38
No ratings yet
Capture D'écran, Le 2025-04-21 À 21.26.38
14 pages
Exam DM 071214 Ans
No ratings yet
Exam DM 071214 Ans
7 pages
Data Mining Practice Final Exam Solutions: True/False Questions
100% (1)
Data Mining Practice Final Exam Solutions: True/False Questions
5 pages
ML - TH - Assignment 2 - 2024-25 - TA1728472836250
No ratings yet
ML - TH - Assignment 2 - 2024-25 - TA1728472836250
4 pages
Assignment 1 5
No ratings yet
Assignment 1 5
4 pages
B43 Exp5 ML
No ratings yet
B43 Exp5 ML
6 pages
Mid Term
No ratings yet
Mid Term
12 pages
DWDM-CSE-Question Bank
No ratings yet
DWDM-CSE-Question Bank
11 pages
Assignment ON Data Mining: Submitted by Name: Manjula.T
No ratings yet
Assignment ON Data Mining: Submitted by Name: Manjula.T
11 pages
Feedback The Correct Answer Is:analysis of Time Series
No ratings yet
Feedback The Correct Answer Is:analysis of Time Series
42 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
13 pages
DMDW Nov-Dec 2022
No ratings yet
DMDW Nov-Dec 2022
4 pages
Seperated
No ratings yet
Seperated
11 pages
Gtu Computer 3160714 Summer 2023
No ratings yet
Gtu Computer 3160714 Summer 2023
3 pages
Python With Data Science
No ratings yet
Python With Data Science
102 pages
Data Mining Suggestions
No ratings yet
Data Mining Suggestions
5 pages
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
100% (1)
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
5 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
7 pages
Data Mining II 4986
No ratings yet
Data Mining II 4986
4 pages
Data Mining
No ratings yet
Data Mining
7 pages
SPI Training
100% (1)
SPI Training
105 pages
Register Organization of 8086 PDF
100% (1)
Register Organization of 8086 PDF
10 pages
Introduction To Data Mining Assignment 2
No ratings yet
Introduction To Data Mining Assignment 2
1 page
DataMining - Workbook MCQ
No ratings yet
DataMining - Workbook MCQ
16 pages
DWDM Mid QB - 241105 - 192423
No ratings yet
DWDM Mid QB - 241105 - 192423
2 pages
Suppose A Student Collected The Price and Weight of 20 Products in A Shop With The Following Result
No ratings yet
Suppose A Student Collected The Price and Weight of 20 Products in A Shop With The Following Result
4 pages
MSI Application Packaging
No ratings yet
MSI Application Packaging
48 pages
Big Data Exercieses
No ratings yet
Big Data Exercieses
6 pages
DM QB
No ratings yet
DM QB
7 pages
Exam dm1 121017 Ans
No ratings yet
Exam dm1 121017 Ans
8 pages
IML-IITKGP - Assignment 8 Solution
No ratings yet
IML-IITKGP - Assignment 8 Solution
8 pages
CA2-Question Bank MCQ (PEC-CSBS601D)
No ratings yet
CA2-Question Bank MCQ (PEC-CSBS601D)
9 pages
Midterm F07 Solutions
No ratings yet
Midterm F07 Solutions
4 pages
DWDM MID - 2 Question Paper and Online Bits
No ratings yet
DWDM MID - 2 Question Paper and Online Bits
3 pages
Dcs 7302
No ratings yet
Dcs 7302
17 pages
Normalization Based K Means Clustering Algorithm
No ratings yet
Normalization Based K Means Clustering Algorithm
5 pages
640005
No ratings yet
640005
4 pages
Question Bank: Q1) What Is Data Warehouse?
No ratings yet
Question Bank: Q1) What Is Data Warehouse?
17 pages
DW Model Questions
No ratings yet
DW Model Questions
8 pages
Jntuqp DWDM
No ratings yet
Jntuqp DWDM
8 pages
CS 515 Data Warehousing and Data Mining
No ratings yet
CS 515 Data Warehousing and Data Mining
5 pages
126VW122019
No ratings yet
126VW122019
2 pages
Data Warehousing and Mining April 2019
No ratings yet
Data Warehousing and Mining April 2019
4 pages
Important Questions Related To Module-1 & Module-2
No ratings yet
Important Questions Related To Module-1 & Module-2
2 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
5 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
Data Mining - Sem 3 - Assignment - 2
No ratings yet
Data Mining - Sem 3 - Assignment - 2
5 pages
DC-30 - System Recovery Guide - V2.0 - EN
No ratings yet
DC-30 - System Recovery Guide - V2.0 - EN
12 pages
B.Tech Degree S8 (S, FE) / S6 (PT) (S, FE) Examination June 2023 (2015 Scheme)
No ratings yet
B.Tech Degree S8 (S, FE) / S6 (PT) (S, FE) Examination June 2023 (2015 Scheme)
4 pages
Assignment 03
No ratings yet
Assignment 03
9 pages
Arduino ARINC 429
100% (2)
Arduino ARINC 429
2 pages
Well Design Operations Manual
No ratings yet
Well Design Operations Manual
22 pages
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
No ratings yet
Question Bank Semester: IV Sem Subject: Data Science Sub Code: 17MCA441 SL - No. Questions Marks
4 pages
Gtag Auditing Network and Comms MGMT 2nd Ed Rev
No ratings yet
Gtag Auditing Network and Comms MGMT 2nd Ed Rev
46 pages
JAVA Internship
No ratings yet
JAVA Internship
63 pages
Assignment 1: Data Mining MGSC5126 - 10
No ratings yet
Assignment 1: Data Mining MGSC5126 - 10
10 pages
Data Mining Practice Final Sol
No ratings yet
Data Mining Practice Final Sol
5 pages
Ads R2022
No ratings yet
Ads R2022
178 pages
MG30-R1, Machine Protection Relay
No ratings yet
MG30-R1, Machine Protection Relay
43 pages
IS328 Final Exam
No ratings yet
IS328 Final Exam
12 pages
Exam DUT 070816 Ans
No ratings yet
Exam DUT 070816 Ans
5 pages
Online Resources: Where To From Here
No ratings yet
Online Resources: Where To From Here
4 pages
Adobe Photoshop 7 Keyboard Shortcuts
81% (27)
Adobe Photoshop 7 Keyboard Shortcuts
2 pages
Unit-1 Cloud Computing
No ratings yet
Unit-1 Cloud Computing
18 pages
Database Security
No ratings yet
Database Security
22 pages
C++ Polymorphism
No ratings yet
C++ Polymorphism
6 pages
Structured Query Language (SQL) : Textbook Reference Database Management Systems: Chapter 5
No ratings yet
Structured Query Language (SQL) : Textbook Reference Database Management Systems: Chapter 5
146 pages
SAP Basic Nav
No ratings yet
SAP Basic Nav
27 pages
Nihilize
No ratings yet
Nihilize
6 pages
Brochure Powerful and Intuitive Machine Tool Probing Software
No ratings yet
Brochure Powerful and Intuitive Machine Tool Probing Software
17 pages
Quality Management System Structure
No ratings yet
Quality Management System Structure
11 pages
Cloud Deployment Model - Javatpoint
No ratings yet
Cloud Deployment Model - Javatpoint
20 pages
WPA3
No ratings yet
WPA3
3 pages
Max17048 Max17049
No ratings yet
Max17048 Max17049
19 pages
8B10B Encoder/Decoder Megacore Function (Ed8B10B) : November 2001 Ver. 1.02 Data Sheet
No ratings yet
8B10B Encoder/Decoder Megacore Function (Ed8B10B) : November 2001 Ver. 1.02 Data Sheet
11 pages
Updated Dbms Lab Obe
No ratings yet
Updated Dbms Lab Obe
4 pages
Dump State
No ratings yet
Dump State
10 pages
The Effects of Technology On Society: Katie Christianson - Josiah Lenthe - Alex Cole
No ratings yet
The Effects of Technology On Society: Katie Christianson - Josiah Lenthe - Alex Cole
30 pages
Multiple Output Power Supply
No ratings yet
Multiple Output Power Supply
15 pages
Delegates C#
No ratings yet
Delegates C#
6 pages
IGNOU MCA Discrete Mathematics Previous Years Unsolved Papers MCS 212
From Everand
IGNOU MCA Discrete Mathematics Previous Years Unsolved Papers MCS 212
Manish Soni
No ratings yet
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
From Everand
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
Manish Soni
No ratings yet

HW 1

Uploaded by

HW 1

Uploaded by

CSIT5210 Data Mining and Knowledge Discovery (Fall Semester 2023)

Customer X, time 1, items A, B, C

X: <{A, B, C}, {D, E}, {G}>

After this transformation, each customer is associated with a sequence.

Given a positive integer K, we denote SK to be a set of K-itemsets with support at least 1.

We are given six items, namely A, B, C, D, E and F.

Suppose l is fixed and is set to 2.

(a) (i) What is S1, 2?

Consider the following eight two-dimensional data points:

Consider algorithm k-means.

Consider eight data points.

You might also like