0% found this document useful (0 votes)

37 views8 pages

Data Mining f20 Practice Final Solutions

Uploaded by

Kashif Kashif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views8 pages

Data Mining f20 Practice Final Solutions

Uploaded by

Kashif Kashif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Page 1 of 8

Data Mining Practice Final Exam Solutions

True/False Questions:

1. T F Our use of association analysis will yield the same frequent itemsets and strong
association rules whether a specific item occurs once or three times in an individual
transaction.

2. T F The k-means clustering algorithm that we studied will automatically find the best
value of k as part of its normal operation.

3. T F A density-based clustering algorithm can generate non-globular clusters.

4. T F In association rule mining the generation of the frequent itermsets is the

computational intensive step.

Multiple Choice Questions

5. In the figure below, there are two clusters. They are connected by a line which represents the
distance used to determine inter-cluster similarity.

Which inter-cluster similarity metric does this line represent (circle one)?

a. MIN
b. MAX
c. Group Average
d. Distance between centroids
Page 2 of 8

Short Form Questions

1. Sometimes a data set is partitioned such that a validation set is provided. What is the purpose of the
validation set?

The validation set is used to select amongst multiple models or to tune a specific model (which can
be viewed as a family of models).

Here is more explanation. In this sense is it used like a test set in that it is used for evaluation, but it
is not like a test set in that it cannot be used for reporting the performance of the model. That is, the
validation set chooses a model—for example the right amount or pruning—but then that model can
be evaluated on a test set and that performance number can be reported. If one dataset could be used
for finding the best model and reporting the performance, you could generate 1million models and
then pick the best on the data set and report that performance. But it is likely that model really does
not have the best performance but just happened to do best on that one data set. .

2. Are decision trees easy to interpret (circle one): Yes No

3. How can you convert a decision tree into a rule set? Explain the process.

Create one rule per leaf node by traversing the conditions from the root node to the leaf and
conjoining those conditions. Note that the rules would be mutually exclusive, meaning that only one
rule could “fire” at a time.

4. List two reasons why data mining is popular now and it wasn’t as popular 20 years ago.

Faster computers, cheaper memory, more data being routinely recorded (e.g., popularity of the Web
and devices like smartphones), and to a lesser degree better algorithms.

5. How does an ordinal feature differ from a nominal feature? Explain in one or two sentences.

An ordinal feature is a nominal feature where there is a natural ordering of each attribute value.

6. Sally measures the pressure of all of tires coming into her garage for an oil change and records the
values. Unknown to her, her tire gauge is miscalibrated and adds 3 psi to each reading. According to
the definition of noise used by our textbook, is this error introduced by the tire gauge considered
noise? Answer “yes” or “no” and justify your answer.

No, since noise must be random, not systematic.

Page 3 of 8

7. For a two-class classification problem, with a Positive class P and a negative class N, we can
describe the performance of the algorithm using the following terms: TP, FP, TN, and FN.

a) What do each of these terms refer to?

TP: True Positive

TN: True Negative

FP: False Positive

FN: False Negative

b) Place the 4 terms listed above in part a into the appropriate slots in the table below.

Predicted

Positive Negative

Positive TP FN
Actual
Negative FP TN

c) Provide the formula for accuracy in terms of TP, TN, FP, and FN.

(TP + TN) / (TP + TN + FP + FN)

d) Provide the formula for precision and recall using TP, TN, FP, and FN.

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

8. If we build a classifier and evaluate it on the training set and the test set:

a) Which data set would we expect to have the higher accuracy: training set test set

b) Which data set provides best accuracy estimate on new data: training set test set
Page 4 of 8

9. A learning curve shows the performance of a classifier as the training set size increases. Assume that
training set size is plotted on the x-axis and accuracy is plotted on the y axis.

a) On the figure below, plot a typical/expected learning curve when the accuracy is measured on
the 1) training set data and 2) the test set data (i.e., draw two curves). Should there be any
difference? If so, comment on the expected difference.
Accuracy

Training set size

The most important thing is that for the test set data it increases, initially steeply but then begins to
plateau. Do not worry too much about the training set curve, since the answer is not really clear.

10. You need to split on attribute a1 in your decision tree. The attribute has 8 values. Why might a two
way split be better than an 8-way split? What might be a problem with the 8-way split?

The 8-way split can lead to the problem of data fragmentation. The data will be split up excessively
leaving smaller amounts of data available for future splits.

Entropy (t ) = − p( j | t ) log p( j | t ) GINI (t ) = 1 − [ p( j | t )]2

j j

11. Given a training set with 5+ and 10- examples,

a) What is the entropy value associated with this data set? You need not simplify your answer to
get a numerical answer.

-(1/3)log(1/3) - (2/3)log(2/3)

b) What is the Gini associated with this data set? In this case you should simplify your result,
although you may express the answer the answer as a fraction rather than a decimal.

1 – [(1/3)2 + (2/3) 2] = 1 – 1/9 – 4/9 = 1 – 5/9 = 4/9

Page 5 of 8

c) If you generated a decision tree with just the root node for the examples in this data set, what
class value would you assign and what would be the training-set error rate associated with this
(very short) decision tree?

Majority class is negative class, so classify it as negative. Training error rate is then 5/15 1/3.

12. The nearest neighbor algorithms relies on having a good notion of similarity, or distance. In class we
discussed several factors that can make it non-trivial to have a good similarity metric. What were two
of the factors?

A good similarity metric requires that the scales of the features are similar. For example, if one
feature varies from 1 to 100 and another from 1, to 1,000,000, then there is a problem and the values
should be rescaled. Another problem is that some features may be much less important than others,
and yet by default all features are considered equally important. Also, redundant or highly
correlated features will through off the distance metric, because the related features will be
overvalued.

13. What classifier induction algorithm can effectively generate the most expressive classifiers, in terms
of the decision boundaries that can be formed? Which is the least expressive. Rank order them from
most to least expressive. Briefly justify your ordering.

The induction algorithms are: decision trees, linear classifiers, and nearest neighbor.

Most expressive: nearest neighbor

Middle: decision trees

Least expressive: linear classifier

14. What is the curse of dimensionality?

The curse of dimensionality is that when the number of features increases, the concentration of the data
points within the instance space decreases, which makes it harder to find patterns. For example, if you
have 100 data points and one variable, it is likely the space is dense, but if we have 100 features, the
space will be quite sparse.

15. What does it mean if the rule set for a rule learner is exhaustive?

It means that the rules will collectively cover every possible example.

16. Does the Ripper rule learner build rules from general to specific or specific to general?

It builds rules from general to specific. It starts with a rule where the antecedent has no conditions and
then adds conditions one at a time.
Page 6 of 8

17. (4 points) We generally will be more interested in association rules with high confidence. However,
often we will not be interested in association rules that have a confidence of 100%. Why? Then
specifically explain why association rules with 99% confidence may be interesting (i.e., what might
they indicate)?

While we generally prefer association rules with high confidence, a rule with 100% confidence most
likely represents some already know fact or policy (e.g., checking account → savings account may
just indicate that all customers are required to have a checking account if they have a savings
account). Rules with 99% confidence are interesting not because of the 99% part but because of the
1% part. These are the exceptions to the rule. They may indicate, for example, that a policy is being
violated. They might also indicate that there is a data entry error. Either way, it would be interesting
to understand why the 1% do not follow the general pattern.\
18. (4 points) The algorithm that we used to do association rule mining is the Apriori algorithm This
algorithm is efficient because it relies on and exploits the Apriori property. What is the Apriori
property?
The Apriori property state that if an itemset is frequent then all of its subsets must also be frequent.

19. (4 points) Discuss the basic difference between the agglomerative and divisive hierarchical
clustering algorithms and mention which type of hierarchical clustering algorithm is more commonly
used.

Agglomerative methods start with each object as and individual cluster and then incrementally
builds larger clusters by merging clusters. Divisive methods, on the other hand, start with all points
belonging to one cluster and then splits apart a cluster each iteration. The agglomerative method is
more common.

20. (4 points) Are the two clusters shown below well separated? Circle an answer: Yes No
Now in one or two sentences justify your answer.

It is not well separated because some points in each cluster are closer to points in another cluster
than to points in the same cluster.

+ +
+
+ +
+
+
Page 7 of 8

Long Problem (33 points)

1. A database has 4 transactions, shown below.

TID Date items_bought

T100 10/15/04 {K, A, D, B}
T200 10/15/04 {D, A, C, E, B}
T300 10/19/04 {C, A, B, E}
T400 10/22/04 {B, A, D}

Assuming a minimum level of support min_sup = 60% and a minimum level of confidence
min_conf = 80%:
(a) Find all frequent itemsets (not just the ones with the maximum width/length) using the
Apriori algorithm. Show your work—just showing the final answer is not acceptable. For
each iteration show the candidate and acceptable frequent itemsets. You should show your
work similar to the way the example was done in the PowerPoint slides.

Answer:

C1/L1 C2/L2

Itemset Support Count Itemset Support Count

{A} 4 {A, B} 4

{B} 4 {A, D} 3

{C} 2 {B, D} 3

{D} 3
C3/L3
{E} 2
Itemset Support Count
{K} 1
{A, B, D} 3

The final answer is: {{A}, {B}, {D}, {A, B}, {B, D}, {A, B, D}}
(include {A,D} above)
Page 8 of 8

(b) List all of the strong association rules, along with their support and confidence values, which
match the following metarule, where X is a variable representing customers and itemi denotes
variables representing items (e.g., “A”, “B”, etc.).
x  transaction, buys(X, item1)  buys(X, item2)  buys(X, item3)
Hint: don’t worry about the fact that the statement above uses relations. The point of the
metarule is to tell you to only worry about association rules of the form X  Y  Z (or {X,
Y}  Z if you prefer that notation). That is, you don’t need to worry about rules of the form
X  Z.
Grading: This part is worth 4 points. Each of the strong association rules is worth 2
points.

Answer:
buys(X, A)  buys(X, B) → buys(X, D) (75%, 75%) Not Strong
buys(X, A)  buys(X, D) → buys(X, B) (75%, 100%) Strong
buys(X, B)  buys(X, D) → buys(X, A) (75%, 100%) Strong

Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
DataMining - Workbook MCQ
No ratings yet
DataMining - Workbook MCQ
16 pages
Batt Mobile - Digital Strategy Deck
No ratings yet
Batt Mobile - Digital Strategy Deck
72 pages
Data Mining Sample Midterm Questions (Last Modified 2/17/19)
No ratings yet
Data Mining Sample Midterm Questions (Last Modified 2/17/19)
4 pages
Test Bank
No ratings yet
Test Bank
55 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
PCCCS504 Module 4
No ratings yet
PCCCS504 Module 4
4 pages
Data Mining Comprehensive Exam - Regular PDF
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
3 pages
DMT MCQ
No ratings yet
DMT MCQ
15 pages
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
100% (1)
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
5 pages
Data Science Final Mock Test
No ratings yet
Data Science Final Mock Test
47 pages
ML Exam
No ratings yet
ML Exam
11 pages
MLRECT2 Solution
No ratings yet
MLRECT2 Solution
9 pages
B. Sc. H Computer S 3OWYH6v
No ratings yet
B. Sc. H Computer S 3OWYH6v
6 pages
DM 2019
No ratings yet
DM 2019
7 pages
Nptel ML Questions
No ratings yet
Nptel ML Questions
12 pages
DM-I Q Paper 2024
No ratings yet
DM-I Q Paper 2024
12 pages
MCQ - Pattern Recognition
No ratings yet
MCQ - Pattern Recognition
20 pages
Khoi KHDL - de On
No ratings yet
Khoi KHDL - de On
6 pages
MCQs Dumps 2
No ratings yet
MCQs Dumps 2
15 pages
Unit4 Mcqs
No ratings yet
Unit4 Mcqs
7 pages
DMW MCQ
No ratings yet
DMW MCQ
388 pages
Aml Mid-2 Objective
No ratings yet
Aml Mid-2 Objective
17 pages
Ai Fundamental Midterm Quizzes - Jei
No ratings yet
Ai Fundamental Midterm Quizzes - Jei
48 pages
Unit 4 - Question Bank
No ratings yet
Unit 4 - Question Bank
11 pages
ML MCQs Set
No ratings yet
ML MCQs Set
18 pages
Multiple Choice Questions
No ratings yet
Multiple Choice Questions
56 pages
UNIT 1 Practice Quiz - MCQs - ML
100% (1)
UNIT 1 Practice Quiz - MCQs - ML
10 pages
Data Mining Exam Questions
No ratings yet
Data Mining Exam Questions
25 pages
QCM DL
No ratings yet
QCM DL
7 pages
ML 2023a Midsem Solution
No ratings yet
ML 2023a Midsem Solution
9 pages
Unit 3
No ratings yet
Unit 3
19 pages
1
No ratings yet
1
4 pages
NASHEEEEYYYYYY
No ratings yet
NASHEEEEYYYYYY
30 pages
Test Bank
No ratings yet
Test Bank
55 pages
DW & DM Questions & Answers
No ratings yet
DW & DM Questions & Answers
12 pages
HW1
No ratings yet
HW1
4 pages
Questions-For-Data-Mining-2020 Eng Marwan
No ratings yet
Questions-For-Data-Mining-2020 Eng Marwan
19 pages
Department of Computer Science & Engineering Machine Learning Quiz - I Set - I
No ratings yet
Department of Computer Science & Engineering Machine Learning Quiz - I Set - I
3 pages
MLfinal 1
No ratings yet
MLfinal 1
7 pages
Assignment Data Mining
No ratings yet
Assignment Data Mining
27 pages
Short Quizzes 13-15
No ratings yet
Short Quizzes 13-15
9 pages
DataMining - Workbook TF
No ratings yet
DataMining - Workbook TF
8 pages
Data Mining Exam
No ratings yet
Data Mining Exam
14 pages
ML MID-1 Question Bank
No ratings yet
ML MID-1 Question Bank
6 pages
DM 2023
No ratings yet
DM 2023
8 pages
AiMidterm Exam - Attempt Review
No ratings yet
AiMidterm Exam - Attempt Review
17 pages
ML Suggestion 2
No ratings yet
ML Suggestion 2
11 pages
Final 2019
No ratings yet
Final 2019
15 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
examBD2223 January Solutions
No ratings yet
examBD2223 January Solutions
7 pages
d3 PDF
No ratings yet
d3 PDF
7 pages
Objectives Questions For Data Mining
No ratings yet
Objectives Questions For Data Mining
4 pages
It ML
No ratings yet
It ML
10 pages
Data!
No ratings yet
Data!
19 pages
ML QB Ans
No ratings yet
ML QB Ans
48 pages
Machine Learning - AKTU PAPER (Session 2019 - 2020)
No ratings yet
Machine Learning - AKTU PAPER (Session 2019 - 2020)
10 pages
IML-IITKGP - Assignment 1 Solution
No ratings yet
IML-IITKGP - Assignment 1 Solution
7 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Phases of A Compiler
No ratings yet
Phases of A Compiler
6 pages
Lab Manual Artificial Intelligence and Neural Network
No ratings yet
Lab Manual Artificial Intelligence and Neural Network
3 pages
Muhammad Kashif - CV
No ratings yet
Muhammad Kashif - CV
1 page
Diode 1N4148 - WPS Office
No ratings yet
Diode 1N4148 - WPS Office
2 pages
0008-BSCS-19 MUHAMMAD JUNAID TARIQ (DLD Lab) Word File
No ratings yet
0008-BSCS-19 MUHAMMAD JUNAID TARIQ (DLD Lab) Word File
6 pages
Emails
No ratings yet
Emails
1 page
DLD Final Paper - Final (2020)
No ratings yet
DLD Final Paper - Final (2020)
2 pages
116 BSCS 19
No ratings yet
116 BSCS 19
3 pages
Discrete Structure: CS-1203 Syed Ijaz Hussain Bukhari
No ratings yet
Discrete Structure: CS-1203 Syed Ijaz Hussain Bukhari
27 pages
Assets, Liablies, Exp, Revenue
No ratings yet
Assets, Liablies, Exp, Revenue
4 pages
Ideational Taxonomic Relation of Surah Al-Baqarah in English Translation
No ratings yet
Ideational Taxonomic Relation of Surah Al-Baqarah in English Translation
5 pages
LAB 2: Classes and Objects: 1 Problem Statement
No ratings yet
LAB 2: Classes and Objects: 1 Problem Statement
5 pages
Exsective Summary - Child Labour
No ratings yet
Exsective Summary - Child Labour
1 page
Bab e Arqam Adab
No ratings yet
Bab e Arqam Adab
12 pages
Muhammad Kashif: Objective
No ratings yet
Muhammad Kashif: Objective
2 pages
Workbook: Variable-Length Subnet Mask
No ratings yet
Workbook: Variable-Length Subnet Mask
29 pages
Introduction To Web Technologies
No ratings yet
Introduction To Web Technologies
8 pages
Helmet Detection and License Plate Recognition
No ratings yet
Helmet Detection and License Plate Recognition
5 pages
Single Linked List
No ratings yet
Single Linked List
14 pages
OOSD All Units Notes by MultiAtoms
No ratings yet
OOSD All Units Notes by MultiAtoms
93 pages
Etd 02 01
No ratings yet
Etd 02 01
7 pages
Pinhole Cameras and Eyes
No ratings yet
Pinhole Cameras and Eyes
5 pages
At 7030
No ratings yet
At 7030
2 pages
Chapter 2 - Parallel Programming Platforms
No ratings yet
Chapter 2 - Parallel Programming Platforms
33 pages
Resume - Taha - Taha Jamal
No ratings yet
Resume - Taha - Taha Jamal
1 page
MCQ Module 1 RGPV Mathematics III
No ratings yet
MCQ Module 1 RGPV Mathematics III
7 pages
LESSON How Hubspot Uses Blogging To Rank SCRIPT
No ratings yet
LESSON How Hubspot Uses Blogging To Rank SCRIPT
8 pages
Blue and White Modern Digital Marketing Agency Presentation
No ratings yet
Blue and White Modern Digital Marketing Agency Presentation
9 pages
Multiple Xing
No ratings yet
Multiple Xing
25 pages
CLASS 7 CSS Handout 1
No ratings yet
CLASS 7 CSS Handout 1
10 pages
Mobile Applications in Children With Cerebral Palsy
No ratings yet
Mobile Applications in Children With Cerebral Palsy
14 pages
Belden Copper Catalog 12.13
No ratings yet
Belden Copper Catalog 12.13
84 pages
Allia IGS 3 Allia IGS 5 - MIS Maps - SM - 5871315-1EN - 3
No ratings yet
Allia IGS 3 Allia IGS 5 - MIS Maps - SM - 5871315-1EN - 3
2 pages
Safety Systems and Accident Theory SSAT Reader 2021 09 29
No ratings yet
Safety Systems and Accident Theory SSAT Reader 2021 09 29
283 pages
Q.A. Basic Provison and Salary Package
No ratings yet
Q.A. Basic Provison and Salary Package
2 pages
End Term Assessment PGP 2021
No ratings yet
End Term Assessment PGP 2021
3 pages
BSBPEF501 - Assessment Task 2 2024
No ratings yet
BSBPEF501 - Assessment Task 2 2024
14 pages
Separation and Gathering Facilities in Kuwait
No ratings yet
Separation and Gathering Facilities in Kuwait
3 pages
A Project Report On: Restaurant Management System
No ratings yet
A Project Report On: Restaurant Management System
25 pages
QESV1138 01 AP1055D Slides
No ratings yet
QESV1138 01 AP1055D Slides
185 pages
Code 188 - Punto Classic
No ratings yet
Code 188 - Punto Classic
5 pages
EAO MC 61 Main-Catalogue en
No ratings yet
EAO MC 61 Main-Catalogue en
110 pages
E9100 Gryphon z97 Ug For Web Only
No ratings yet
E9100 Gryphon z97 Ug For Web Only
170 pages
Memorial On Behalf of Appellant (Team Code - 03)
No ratings yet
Memorial On Behalf of Appellant (Team Code - 03)
29 pages

Data Mining f20 Practice Final Solutions

Uploaded by

Data Mining f20 Practice Final Solutions

Uploaded by

Page 1 of 8

Data Mining Practice Final Exam Solutions

3. T F A density-based clustering algorithm can generate non-globular clusters.

4. T F In association rule mining the generation of the frequent itermsets is the

Multiple Choice Questions

Short Form Questions

2. Are decision trees easy to interpret (circle one): Yes No

No, since noise must be random, not systematic.

a) What do each of these terms refer to?

TP: True Positive

TN: True Negative

FP: False Positive

FN: False Negative

(TP + TN) / (TP + TN + FP + FN)

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

Training set size

Entropy (t ) = − p( j | t ) log p( j | t ) GINI (t ) = 1 − [ p( j | t )]2

11. Given a training set with 5+ and 10- examples,

1 – [(1/3)2 + (2/3) 2] = 1 – 1/9 – 4/9 = 1 – 5/9 = 4/9

Most expressive: nearest neighbor

Middle: decision trees

Least expressive: linear classifier

14. What is the curse of dimensionality?

Long Problem (33 points)

1. A database has 4 transactions, shown below.

TID Date items_bought

Itemset Support Count Itemset Support Count

You might also like