0% found this document useful (0 votes)
188 views12 pages

IS328 Final Exam

IS328

Uploaded by

Tetz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
188 views12 pages

IS328 Final Exam

IS328

Uploaded by

Tetz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 12
Name: ID Number: @USP THE UNIVERSITY OF THE SOUTH TACINC 15228: Data Mining Faculty of Science, Technology and Environment School of Computing Information and Mathematical Sciences Final Exomination Semester 22016 Mode [Face to Face) Duration of Exam: 3 hours + 10 minutes Reading Time: 10 minutes Witing Time: 3 hours Total mark: 100 Question and Answer Booklet Instructions This exam hos thvee sections: A, B, and C Answer ALL questions in Sections A and 8. ‘Answer ONLY ONE Question from Section C ‘Answer multiple choice questions on the multiple choice grid provided on page &, Wie your answers fo Sections 8 and C in the answer sctipt provided. This exam paper has 11 pages incluaing ths cover page. You may use non-programmable calcviators. Tis exams worth 80% of your overall mark, Minimum pas maak for ths exam is 40/100, Hand this examination booklet to your supervisor when you complete the ‘examination —_———$—$<$<$<$<$<<_—_—_—___— e808 Fn am Sumer 2 2018 Peete 11 Name: ID Number: Multipk (01) Which ofthe following is not an atibute type in data mining? A Interval B Ratio © Random D — Oxdinal E Nominal Consider two objects represented by the tuples (22, 1, 42, 10) and Q20, 0,36, 8) and answer the questions 2 to 4 (02) The Euclidean distance berwoun the two objets P and Q. A 116 B76 © 6m D617 (03) The Manhattan distance between the two objects P and Q. 9 U 1B 5 pow> (04) The Minkowski distance between the two objects P and Q, using A 516 B63 © Set D Gis (05). Assume that we have the following fequent itemsets fora given transaction database {1A,B}, {A.C}, (B.C), {A.D}, {B,D}, {A,B,C} and (A,B, D}} How many different candidate rules are there? 2 1% 16 2 owe Name: ID Number: (96) A common weakness of association rule mining is that A Itistoo inefficient B_ It produces too many rules © produces not enough interesting rules D_— Allorthe above Use the three-class confusion matrix below to answer questions (07) through (10) Computed Decision Glass1 | Class2 | Class 3 Class1 | 10 5 Class2 | 5 16 Class3 | 2 4 " (07) How many instances were correctly classified? A B36 cs DoT (08) How many instances were incorrectly classified with class 2 6 com> 7 5 9 109) Which class instances were classified withthe least error rate? Class 1 Class 2 Class 3 (Class [and Class 2 voe> 10) What isthe misclassification error rate for the mode!” A 40% B 50% C 60e%, D 70% Nam 11) Given the following two objects. Attibutel Atributed AtributeS Attabutes Object 1 10 0 Objest2 1 o 4 ° ‘What i the distance between the objects if all variables are symmetric? ‘What is the distance between the objects if all variables are asymmetric? 0s, 0s 03, 067 067,08 067,067 come 12) Which ofthe following is nota normalisation metho? ‘min-max normalisation decimal sealing ‘score normalisation logarithmie normalisation woe 413) Assume the APRIORI algorithm identified the following seven 4-item sets that satisfy a user given suppor threshold acde, acdt, adfg, bede, beef, bed, cdef. ‘What inital GandiateS-itemsets are created by the APRIORI algorithm? A acdef, boot, Bacdef, bedet © bodef, acd D —bodef, abed 114) Which ofthe following isnot @ data reduction technique? Data cube aggregation Dimensionality Reduction Data Compression Data Transformation Numerosity Reduetion mone 45) Suppose that doct and doc? are two vectors as follows: 5, 0,3,0,2, 0,0,2, 0,0) 2.0,1,0,5,1,0,5,0,2) ‘The cosine similarity between the two vectors is A 0798 B 0756 © 0765 D 07s9 15208 Eam Sameer 2 216 Page 4 1 ID Number Nam 16) Suppose a group of 12 students with the test scores sted as follows: 19, 71,48, 63,35, 85,69, 81, 72, 88, 99,95 By partitioning them into four bins sing equal width method, ‘how many numbers are therein the third bin? vom> Consider the transactions below and answer the questions 17, 18 and 19. Transaction-id items 1 ABCE 2 ABDE 3 BCDE 4 BDE 5 ABD 6 BEC 7 BAE 8 CBE 9 BE 0 CE 17) What isthe support ofthe itemset (B,C.E}? ‘A 20% B 30% 40% D 50% E 60% 18) The length of the possible largest frequent itemSet is A 2 BS c 4 DoS EB 6 19), Which ofthe following rues has the highest confidence? AREAS BD B RUB DE © RECS BE DRED SAE BREED AB 20) Which ofthe following ate strategies for data tansformation? A Smoothing B Attribute Construction Aggregation D_— Allof the above Name: ID Number: Section A - Multiple Choice Questions (Each question has only one answer) 1) (A) (8) (©) (©) 2 (A) (8) () 6) OE) 3) A) (8) () 6) OE) 4 ~) () (©) © 5) (A) (8) (C) (OE) 6 (A) (8) (Cc) (OE) n A (@) (©) © € 8) (A) (8) (Cc) (0) (BE) 9) (A) (8) (C) (BD) (E) 10) (A) (B) (Cc) (BD) (E) 11) (A) (8) (C) (DE) 12) (A) (8) (C) (DB) (ED 13) (A) (B) (C) (D) (ED 14) (A) (8) (C) (D) ED 15) (A) (8) (C) (DB) (ED 16) (A) (8) (C) (DB) (EY 17) (4) (8) (C) OE) 18) (A) (B) (Cc) (0) (E) 19) (A) (B) (Cc) (0) (E) 20) (A) (B) (C) (0) (E) Name: ID Number: Section B Short Answers and Calculations (60 Marks) Question 21: Frequent Itemset_and Sequence Mining (20 marks) (2) Explain the diference betwean the following [2 marks] (Frequent mst (i) Candidate temset (©) Given are the following five transactions on items (A,B, C,D, K) 7. Teme 100 (A.B.K] 200 (A.B 300, (A.D. 40 (C.D) 00 ick ‘0 (A.D. Use the Aprior algorithm to compute all frequent itemsets, and their support, with minimum support of 33.34%, Its important that you clearly indicate the steps ofthe algorithm. [8 marks] (©) Which of the itemsets from b) are closed? Which ofthe temsets from b) are maximal? [2 marks} Consider the following frequent 3-sequences and answer the questions (d) ~ (0). <{1,.2,3) >.< (1,2}13) >, < (112, 3} >< {1,2} 44) >< (1,3) (4) >, < 11,2, 4) >, < (2, 3}13) >, < £2, 3)44} > < {2} 43) (3) >, and < {2} (3) 44) > (4) Listall the candidate 4-soquences produced by the candidate generation step ofthe ‘Generalized Sequential Pattern (GSP) algorithm. [3 marks] (6) Listall the candidate 4-soquences pruned during the candidate pruning step of the GSP algorithm (assuming no timing constrains), (2 marks) (9 Listall the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming maxgap = 1). [3 marks) Nam ID Number ‘Question 22: Classification Techniques (20 marks) (@) Describe principles and ideas ofthe decision tree-based classification [4 marks] (©) Derive all possible rules from the decision tree below and write down a set of classification rales [4 mars] Refund Yes, nS Marital (single, | Status Divorced} (Married) Taxable ee Income (6) Whats. confusion matrix? (11 mark] Using the following tet set evaluate the above model, Create the confusion matrix and caleulat the classification accuracy and eror rate (7 marks) [Refund Mavi Staus —[ Taxable Income No Divered 75000" Yes Single ‘$90. No. Dyssad ‘ro0000 Yer Mare 000 No Single 35000 Ne: Mate $5000 (@) A datase of 1000 cases was partitioned into a training st of 600 cases and a validation set of ‘400 cases, A K-Nearest Neighbours model with k-I had a misclassifcation err rate oF 8% ‘on the validation data. It was subsequently found thatthe partioning had been done incorrectly and that 100 eases from the taining data set had been aceidentally duplested and had overritten 100 cases in the validation dataset. What isthe mislassifcaion error rate for ‘the 300 cases that were truly part ofthe validation data? [4 marks} Name: ID Number: Question 23: Cluster Analysis (20_ marks) (2) List and briefly describe the following three approaches for clustering. [3 marks] (i) Partitioning Methods (ii) Hierarchical Methods (iil) Density-Based Methods () List at least six requirements of clustering in data mining. (3 marks] (@)K-Means Clustering ~ 8 marks ‘Suppose you want to cluster the eight points shown below using K-means. [ar_[ar 2 [0 2s sie sie 715 o4 1? rm ‘Assume that k = 3 and that initially the points are assigned to clusters as follows: Cl = (xl, x2, x3}, C2 = {x4, x5, x6), C3 = (X7, x8}, Apply the k-means algorithm until convergence (i., until the clusters do not change), using the Manhattan distance. Make sure you clearly identify the final clustering and show your steps. Give the value of the k-means error function after ‘convergence. [8 marks] (6) Hierarchical Clustering - 6 marks Describe the principles and ideas regarding Agglomorative Hierarchical (Clustering. Show the different steps of the algorithm using the dissimilarity matrix below and complete link clustering. Give partial results after each step. (6 marks} Name: ID Number: Section C Answer only ONE Question (20 Marks) uestion 24: General Data Mi Issues (20 marks 8) Why do we pre-process data? Briefly desribe the processes involved in data pre-processing. [4marks] 'b) Explain the difference between classification and prediction. Illustrate the difference using examples [2 marks] €) Briefly outline how to compute the dissimilarity between objects described by the following types of variables (Numerical (ntrva-scaled) variables [2 marks] i) Asymmetric binary variables [2 marks} (Gi) Categorical variables [2 marks} 4) Given the following measurements forthe variable age: 18; 22; 25; 42528; 43; 33; 35; 56,28; standardize the variable by the following: (Compute the mean absolute deviation of age[4 marks] ii) Compute the z-score forthe frst four measurements: [4 marks) Name: ID Number: 1» 25 Data Mining Applications and Big Data (20 marks) a) Data Mining Applications ~ 6 marks ») Data mining applications can be found in five areas namely: financial data analysis (FDA), ‘etal and tclecommunication industries (RTI); science and engineering (SE); intrusion ‘detection and prevention (IDP); and recommender systems (RS). ‘Choose ONLY one ftom the five areas named above and describe how the aplication will ‘work and the benefit(s) of such an application (6 marks} Privacy and Data Mining — 6 marks ‘Vinay banks with the Bank of the South Pacific (BSP) which has a robust data mining system, The bank's data mining team has been studying Vinay's bank eaed usage patterns, “They notice that reently he has made numerous payments at Vinod Patel Hardware Stores. Based on their data analysis, the bank then decided to contact him to discuss thei special loans package for home renovations, (Discuss how this may conflict with your righ o privacy. [2 marks] i) Deserbe a privacy-preserving data mining method tht may allow the bank to perform customer patter analysis without infringing on customers’ right to privacy. [2 marks) (ii) Describe an example where data mining could be used to help society, [1 mark] Gv) Explain how data mining may be detrimental to society [1 mark] ©) Big Data Analytics ~ 8 marks (i) Whats ig Data in simple terms? [1 mark] (it How can big data be desrived? [1 mark] (i) Diseuss Some Key enabiers for big data [2 marks} (je) Explain the ference between structured data and unstructured data. Give examples, (2 ‘marks} (0) What ae the special requirements for data mining procedures when? handling big ata? [2 marks} End of Paper {Mond this exemination booke! ond the answer Sco To your supeniso when you complete ‘he examination]

You might also like