0 ratings0% found this document useful (0 votes) 199 views12 pagesIS328 Final Exam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Name: ID Number:
@USP
THE UNIVERSITY OF THE
SOUTH TACINC
15228: Data Mining
Faculty of Science, Technology and Environment
School of Computing Information and Mathematical Sciences
Final Exomination
Semester 22016
Mode [Face to Face)
Duration of Exam: 3 hours + 10 minutes
Reading Time: 10 minutes
Witing Time: 3 hours
Total mark: 100
Question and Answer Booklet
Instructions
This exam hos thvee sections: A, B, and C
Answer ALL questions in Sections A and 8.
‘Answer ONLY ONE Question from Section C
‘Answer multiple choice questions on the multiple choice grid provided on page &,
Wie your answers fo Sections 8 and C in the answer sctipt provided.
This exam paper has 11 pages incluaing ths cover page.
You may use non-programmable calcviators.
Tis exams worth 80% of your overall mark, Minimum pas maak for ths exam is 40/100,
Hand this examination booklet to your supervisor when you complete the
‘examination
—_———$—$<$<$<$<$<<_—_—_—___—
e808 Fn am Sumer 2 2018 Peete 11Name: ID Number:
Multipk
(01) Which ofthe following is not an atibute type in data mining?
A Interval
B Ratio
© Random
D — Oxdinal
E Nominal
Consider two objects represented by the tuples
(22, 1, 42, 10) and Q20, 0,36, 8) and answer the questions 2 to 4
(02) The Euclidean distance berwoun the two objets P and Q.
A 116
B76
© 6m
D617
(03) The Manhattan distance between the two objects P and Q.
9
U
1B
5
pow>
(04) The Minkowski distance between the two objects P and Q, using
A 516
B63
© Set
D Gis
(05). Assume that we have the following fequent itemsets fora given transaction database
{1A,B}, {A.C}, (B.C), {A.D}, {B,D}, {A,B,C} and (A,B, D}}
How many different candidate rules are there?
2
1%
16
2
oweName: ID Number:
(96) A common weakness of association rule mining is that
A Itistoo inefficient
B_ It produces too many rules
© produces not enough interesting rules
D_— Allorthe above
Use the three-class confusion matrix below to answer questions (07) through (10)
Computed Decision
Glass1 | Class2 | Class 3
Class1 | 10 5
Class2 | 5 16
Class3 | 2 4 "
(07) How many instances were correctly classified?
A
B36
cs
DoT
(08) How many instances were incorrectly classified with class 2
6
com>
7
5
9
109) Which class instances were classified withthe least error rate?
Class 1
Class 2
Class 3
(Class [and Class 2
voe>
10) What isthe misclassification error rate for the mode!”
A 40%
B 50%
C 60e%,
D 70%Nam
11) Given the following two objects.
Attibutel Atributed AtributeS Attabutes
Object 1 10 0
Objest2 1 o 4 °
‘What i the distance between the objects if all variables are symmetric?
‘What is the distance between the objects if all variables are asymmetric?
0s, 0s
03, 067
067,08
067,067
come
12) Which ofthe following is nota normalisation metho?
‘min-max normalisation
decimal sealing
‘score normalisation
logarithmie normalisation
woe
413) Assume the APRIORI algorithm identified the following seven 4-item sets that satisfy a user
given suppor threshold
acde, acdt, adfg, bede, beef, bed, cdef.
‘What inital GandiateS-itemsets are created by the APRIORI algorithm?
A acdef, boot,
Bacdef, bedet
© bodef, acd
D —bodef, abed
114) Which ofthe following isnot @ data reduction technique?
Data cube aggregation
Dimensionality Reduction
Data Compression
Data Transformation
Numerosity Reduetion
mone
45) Suppose that doct and doc? are two vectors as follows:
5, 0,3,0,2, 0,0,2, 0,0)
2.0,1,0,5,1,0,5,0,2)
‘The cosine similarity between the two vectors is
A 0798
B 0756
© 0765
D 07s9
15208 Eam Sameer 2 216 Page 4 1ID Number
Nam
16) Suppose a group of 12 students with the test scores sted as follows:
19, 71,48, 63,35, 85,69, 81, 72, 88, 99,95
By partitioning them into four bins sing equal width method,
‘how many numbers are therein the third bin?
vom>
Consider the transactions below and answer the questions 17, 18 and 19.
Transaction-id items
1 ABCE
2 ABDE
3 BCDE
4 BDE
5 ABD
6 BEC
7 BAE
8 CBE
9 BE
0 CE
17) What isthe support ofthe itemset (B,C.E}?
‘A 20%
B 30%
40%
D 50%
E 60%
18) The length of the possible largest frequent itemSet is
A 2
BS
c 4
DoS
EB 6
19), Which ofthe following rues has the highest confidence?
AREAS BD
B RUB DE
© RECS BE
DRED SAE
BREED AB
20) Which ofthe following ate strategies for data tansformation?
A Smoothing
B Attribute Construction
Aggregation
D_— Allof the aboveName: ID Number:
Section A - Multiple Choice Questions (Each question has only one answer)
1) (A) (8) (©) (©)
2 (A) (8) () 6) OE)
3) A) (8) () 6) OE)
4 ~) () (©) ©
5) (A) (8) (C) (OE)
6 (A) (8) (Cc) (OE)
n A (@) (©) © €
8) (A) (8) (Cc) (0) (BE)
9) (A) (8) (C) (BD) (E)
10) (A) (B) (Cc) (BD) (E)
11) (A) (8) (C) (DE)
12) (A) (8) (C) (DB) (ED
13) (A) (B) (C) (D) (ED
14) (A) (8) (C) (D) ED
15) (A) (8) (C) (DB) (ED
16) (A) (8) (C) (DB) (EY
17) (4) (8) (C) OE)
18) (A) (B) (Cc) (0) (E)
19) (A) (B) (Cc) (0) (E)
20) (A) (B) (C) (0) (E)Name: ID Number:
Section B
Short Answers and Calculations (60 Marks)
Question 21: Frequent Itemset_and Sequence Mining (20 marks)
(2) Explain the diference betwean the following [2 marks]
(Frequent mst
(i) Candidate temset
(©) Given are the following five transactions on items (A,B, C,D, K)
7. Teme
100 (A.B.K]
200 (A.B
300, (A.D.
40 (C.D)
00 ick
‘0 (A.D.
Use the Aprior algorithm to compute all frequent itemsets, and their support, with minimum
support of 33.34%, Its important that you clearly indicate the steps ofthe algorithm. [8
marks]
(©) Which of the itemsets from b) are closed? Which ofthe temsets from b) are maximal?
[2 marks}
Consider the following frequent 3-sequences and answer the questions (d) ~ (0).
<{1,.2,3) >.< (1,2}13) >, < (112, 3} >< {1,2} 44) >< (1,3) (4) >,
< 11,2, 4) >, < (2, 3}13) >, < £2, 3)44} > < {2} 43) (3) >, and < {2} (3) 44) >
(4) Listall the candidate 4-soquences produced by the candidate generation step ofthe
‘Generalized Sequential Pattern (GSP) algorithm. [3 marks]
(6) Listall the candidate 4-soquences pruned during the candidate pruning step of the GSP
algorithm (assuming no timing constrains), (2 marks)
(9 Listall the candidate 4-sequences pruned during the candidate pruning step of the GSP
algorithm (assuming maxgap = 1). [3 marks)Nam ID Number
‘Question 22: Classification Techniques (20 marks)
(@) Describe principles and ideas ofthe decision tree-based classification [4 marks]
(©) Derive all possible rules from the decision tree below and write down a set of classification
rales [4 mars]
Refund
Yes, nS
Marital
(single, | Status
Divorced} (Married)
Taxable ee
Income
(6) Whats. confusion matrix? (11 mark]
Using the following tet set evaluate the above model, Create the confusion matrix and
caleulat the classification accuracy and eror rate (7 marks)
[Refund Mavi Staus —[ Taxable Income
No Divered 75000"
Yes Single ‘$90.
No. Dyssad ‘ro0000
Yer Mare 000
No Single 35000
Ne: Mate $5000
(@) A datase of 1000 cases was partitioned into a training st of 600 cases and a validation set of
‘400 cases, A K-Nearest Neighbours model with k-I had a misclassifcation err rate oF 8%
‘on the validation data. It was subsequently found thatthe partioning had been done
incorrectly and that 100 eases from the taining data set had been aceidentally duplested and
had overritten 100 cases in the validation dataset. What isthe mislassifcaion error rate for
‘the 300 cases that were truly part ofthe validation data? [4 marks}Name: ID Number:
Question 23: Cluster Analysis (20_ marks)
(2) List and briefly describe the following three approaches for clustering. [3 marks]
(i) Partitioning Methods
(ii) Hierarchical Methods
(iil) Density-Based Methods
() List at least six requirements of clustering in data mining. (3 marks]
(@)K-Means Clustering ~ 8 marks
‘Suppose you want to cluster the eight points shown below using K-means.
[ar_[ar
2 [0
2s
sie
sie
715
o4
1?
rm
‘Assume that k = 3 and that initially the points are assigned to clusters as
follows: Cl = (xl, x2, x3}, C2 = {x4, x5, x6), C3 = (X7, x8}, Apply the k-means
algorithm until convergence (i., until the clusters do not change),
using the Manhattan distance. Make sure you clearly identify the final clustering
and show your steps. Give the value of the k-means error function after
‘convergence. [8 marks]
(6) Hierarchical Clustering - 6 marks
Describe the principles and ideas regarding Agglomorative Hierarchical
(Clustering. Show the different steps of the algorithm using the dissimilarity matrix
below and complete link clustering. Give partial results after each step. (6 marks}Name: ID Number:
Section C
Answer only ONE Question (20 Marks)
uestion 24: General Data Mi
Issues (20 marks
8) Why do we pre-process data? Briefly desribe the processes involved in data pre-processing.
[4marks]
'b) Explain the difference between classification and prediction. Illustrate the difference using
examples [2 marks]
€) Briefly outline how to compute the dissimilarity between objects described by the following
types of variables
(Numerical (ntrva-scaled) variables [2 marks]
i) Asymmetric binary variables [2 marks}
(Gi) Categorical variables [2 marks}
4) Given the following measurements forthe variable age:
18; 22; 25; 42528; 43; 33; 35; 56,28;
standardize the variable by the following:
(Compute the mean absolute deviation of age[4 marks]
ii) Compute the z-score forthe frst four measurements: [4 marks)Name:
ID Number:
1» 25 Data Mining Applications and Big Data (20 marks)
a) Data Mining Applications ~ 6 marks
»)
Data mining applications can be found in five areas namely: financial data analysis (FDA),
‘etal and tclecommunication industries (RTI); science and engineering (SE); intrusion
‘detection and prevention (IDP); and recommender systems (RS).
‘Choose ONLY one ftom the five areas named above and describe how the aplication will
‘work and the benefit(s) of such an application (6 marks}
Privacy and Data Mining — 6 marks
‘Vinay banks with the Bank of the South Pacific (BSP) which has a robust data mining
system, The bank's data mining team has been studying Vinay's bank eaed usage patterns,
“They notice that reently he has made numerous payments at Vinod Patel Hardware Stores.
Based on their data analysis, the bank then decided to contact him to discuss thei special
loans package for home renovations,
(Discuss how this may conflict with your righ o privacy. [2 marks]
i) Deserbe a privacy-preserving data mining method tht may allow the bank to
perform customer patter analysis without infringing on customers’ right to privacy.
[2 marks)
(ii) Describe an example where data mining could be used to help society, [1 mark]
Gv) Explain how data mining may be detrimental to society [1 mark]
©) Big Data Analytics ~ 8 marks
(i) Whats ig Data in simple terms? [1 mark]
(it How can big data be desrived? [1 mark]
(i) Diseuss Some Key enabiers for big data [2 marks}
(je) Explain the ference between structured data and unstructured data. Give examples, (2
‘marks}
(0) What ae the special requirements for data mining procedures when?
handling big ata? [2 marks}
End of Paper
{Mond this exemination booke! ond the answer Sco To your supeniso when you complete
‘he examination]