0% found this document useful (0 votes)

224 views

Data Mining Exercises - Solutions

The document discusses 10 questions related to data mining concepts like Python properties for data mining, code output, necessary changes to code, output of code snippets, differences between supervised and unsupervised techniques, overfitting, predictive modeling process, k-means clustering example, number of splits possible in a dataset, and information gain calculation. It provides explanations, examples, and step-by-step workings for each question.

Uploaded by

Mehmet Zirek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

224 views

Data Mining Exercises - Solutions

Uploaded by

Mehmet Zirek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

1. Give 5 properties of Python and explain why Python is suitable for Data Mining?

Python is easy to use, object oriented, easy to read, expressive, open source, portable programming
language which has a lot of libraries for data mining algorithms

2. Write the output of the following code.

[ 2.0, 100, 5]

3. (10 pts) Please make the necessary change in the given code so that it doesn’t give the following
error message and works as commented:

Line 1 must be → import pandas as pd

4. What is the output of the following code? (Hint if the loop test is False then the execution jumps to
the else: row.)

4 320
5. What is the output of the following code?

True

False

None

6. We are writing a sublist function which compares two lists and returns true if the first list (lst1) is
a sublist of (is contained inside) the second list (lst2). We created a version of the second list as ls2
where we eliminated all elements of lst2 which are not in the in the first list to e if the final lists are
the same. However even though the final lists contain the same elements,

This output needs to be True, since elements of

list [2,1] are also elements of list [1,2,5,3]

What property of lists can we use in the comparison ( ?==? ) so that function gives correct result:
(True) in the given example above.
Line 4 must be → return sorted(lst1) == sorted(ls2)

Note: Another sublist function given in the Apriori algorithm code runs faster.
7. What is the output of the following code?

25
81
75

8. a) Describe the difference between unsupervised and supervised

techniques of Data Mining, give an example for each. b) define overfitting
(o.f.), for which of the above techniques o.f. is a problem?

a) Supervised techniques can be used when labeled dataset is available for

training and testing where as unsupervised techniques doesn’t have/need a
labeled dataset. The unsupervised techniques are used to detect new patterns
and clusters in relatively unknown or unstructured data and supervised
techniques are used to predict future data when there is enough structured
and analyzed past information. K means clustering is an unsupervised
technique, decision tree analysis is an example of a supervised technique.

b) Overfitting is a problem of supervised techniques where the model is too

much customized for the specific training data at hand where as it doesn’t
perform well on the test data and future data.

9. Describe predictive modeling process. Which techniques are most suitable

to model datasets with nominal categories? Give 2 examples for these
techniques.

In predictive modeling a labeled dataset is split into two parts as training and
test datasets. A model is built using the training data set and test dataset is
fed into the model to predict their labels. Actual labels of test dataset and
predicted labels are compared to evaluate the performance of the model.
Classification techniques are most suited for predicting or describing data sets
with binary or nominal categories. Decision Trees and Rule Based Classifiers
are examples.
10. At one stage in K-Means Clustering of the given data set with two
attributes, distances of the points to each centroid are given in the following
table: What will be the centroid coordinates in the next stage?

Id x y distance_from_1 distance_from_2 distance_from_3

0 12 39 26,93 56,08 56,73
1 28 30 14,14 41,76 53,34
2 29 54 38,12 40,80 34,06
3 24 55 39,05 45,88 37,44
4 45 63 50,70 31,14 16,40
5 52 70 59,93 32,25 6,71
6 52 63 53,71 26,40 13,34
7 55 58 51,04 20,62 18,00
8 53 23 27,89 24,21 53,04
9 55 14 29,07 30,87 62,00
10 64 19 38,12 23,35 57,71
11 69 7 43,93 35,01 70,41

To find the updated centroid coordinates we first assign points to the

existing centroids (check which point is closest to which
centroid,ex:points 0,1 and 9 are closest to C1 now we take arithmetic
mean of x and y coordinates of these points to find updated
coordinates of Centroid1)

C1=[a.m.(X0,X1,X9), a.m.(Y0,Y1,Y9)] =
[(12+28+55)/3,(39+30+14)/3]

Similarly

C2=[a.m.(X8,X10,X11), a.m.(Y8,Y10,Y11) ]=

[(53+64+69)/3,(23+19+7)/3]

and

C3=[a.m.(X2,X3,X4, X5,X6,X7), a.m.(Y2,Y3,Y4, Y5,Y6,Y7) ]=

[(29+24+45+52+52+55)/6,(54+55+63+70+63+58)/6]

Answer:

[31.67, 27.67], [62.0, 16.33], [42.83, 60.5] ]

11. How many different splits can be made on the dataset given below.
Note: Use the “Entropy” measure for information gain given by the following
formula:
Id a1 a2 a3 Class
1 T T 1.0 +
2 T T 6.0 +
3 T F 5.0 -
4 F F 4.0 +
5 F T 7.0 -
6 F T 3.0 -
7 F F 8.0 -
8 T F 7.0 +
9 F T 5.0 -

8 splits are possible:

Entropy original: -4/9 * log2(4/9) – 5/9* log2(5/9) = 0.9911

Split information gains

Ex. a1 children entropy= 4/9 * Entropy(3+,1-) + 5/9 * Entropy(1+,4-)

= 4/9* [-3/4 * log2(3/4) – 1/4* log2(1/4)] + 5/9 * [-1/5 * log2(1/5) – 4/5*

log2(4/5)] = 0.7616

Ex. a2 children entropy = 5/9 * Entropy(2+,3-) + 4/9 * Entropy(2+,2-)

= 5/9* [-2/5 * log2(2/5) – 3/5* log2(3/5)] + 4/9 * [-1/2 * log2(1/2) – 1/2*

log2(1/2)] =0.9839

a1 info gain: E.O.- (a1 children entropy) = E.O.- (0.7616)= 0.2294

a2 info gain: E.O.- (a2 children entropy) = E.O.- (0.9839)= 0.0072
a3 >=3.0 i.g.: E.O.- (a3 >=3.0 children entropy) = E.O.- (0.8484)= 0.1427
a3 >=4.0 i.g.: E.O.- (a3 >=4.0 children entropy) = E.O.- (0.9885)= 0.0026
a3 >=5.0 i.g.: E.O.- (a3 >=5.0 children entropy) = E.O.- (0.9183)= 0.0728
a3 >=6.0 i.g.: E.O.- (a3 >=6.0 children entropy) = E.O.- (0.9839)= 0.0072
a3 >=7.0 i.g.: E.O.- (a3 >=7.0 children entropy) = E.O.- (0.9728) = 0.0183
a3 >=8.0 i.g.: E.O.- (a3 >=8.0 children entropy) = E.O.- (0.8889) = 0.1022

CCNA 2_ SRWE Practice PT Skills Assessment (PTSA) - Part 1 Answers
No ratings yet
CCNA 2_ SRWE Practice PT Skills Assessment (PTSA) - Part 1 Answers
42 pages
Linked List Exercises
No ratings yet
Linked List Exercises
1 page
HW 05 PDF
100% (1)
HW 05 PDF
2 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
Microsoft Certified: Power Platform Developer Associate - Skills Measured
No ratings yet
Microsoft Certified: Power Platform Developer Associate - Skills Measured
4 pages
Module 2: Switching Concepts: Instructor Materials
100% (1)
Module 2: Switching Concepts: Instructor Materials
20 pages
Data Mining
No ratings yet
Data Mining
6 pages
Decision Tables Exercises
100% (1)
Decision Tables Exercises
3 pages
Machine Learning CA 2
No ratings yet
Machine Learning CA 2
19 pages
Final Exam Data Mining and Machine Learning
No ratings yet
Final Exam Data Mining and Machine Learning
5 pages
Assignment - 1 Lab Exercises: DDL Commands Creating A Table Lab-01
No ratings yet
Assignment - 1 Lab Exercises: DDL Commands Creating A Table Lab-01
20 pages
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
No ratings yet
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
4 pages
Sample Final AI
No ratings yet
Sample Final AI
9 pages
DSA Final Fall 2022
No ratings yet
DSA Final Fall 2022
2 pages
Dsa Assignment: 1. Define Binary Tree
No ratings yet
Dsa Assignment: 1. Define Binary Tree
6 pages
DC Lab Exp6 17l238 Rep
No ratings yet
DC Lab Exp6 17l238 Rep
12 pages
DDL DML DQL TCL DCL Practice1
50% (4)
DDL DML DQL TCL DCL Practice1
9 pages
Worksheet Binary Search Tree
No ratings yet
Worksheet Binary Search Tree
4 pages
Consider The Following Points:: Exercise 1
No ratings yet
Consider The Following Points:: Exercise 1
4 pages
Pycryptodome Master
100% (1)
Pycryptodome Master
82 pages
IS328 Final Exam
No ratings yet
IS328 Final Exam
12 pages
OA TD2 Correction 2019 2020 - Compressed
No ratings yet
OA TD2 Correction 2019 2020 - Compressed
3 pages
System Analysis Assignment
No ratings yet
System Analysis Assignment
62 pages
Association Rules FP Growth
No ratings yet
Association Rules FP Growth
32 pages
Delay Numericals
No ratings yet
Delay Numericals
32 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
3 pages
COSC 3100 Brute Force and Exhaustive Search: Instructor: Tanvir
No ratings yet
COSC 3100 Brute Force and Exhaustive Search: Instructor: Tanvir
44 pages
Distributed Databases: Solutions To Practice Exercises
No ratings yet
Distributed Databases: Solutions To Practice Exercises
4 pages
CSE 231 Final - Summer 2021
No ratings yet
CSE 231 Final - Summer 2021
2 pages
Exams 2024 Python For Beginners
No ratings yet
Exams 2024 Python For Beginners
22 pages
DATA MODELING Notes
No ratings yet
DATA MODELING Notes
8 pages
DAA Unit - 1
No ratings yet
DAA Unit - 1
68 pages
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
No ratings yet
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
9 pages
Os Super-Imp-Tie-22 (1) PDF
No ratings yet
Os Super-Imp-Tie-22 (1) PDF
4 pages
Fractional Knapsack Problem
No ratings yet
Fractional Knapsack Problem
16 pages
Discrete Cosine Transform PDF
No ratings yet
Discrete Cosine Transform PDF
4 pages
Cse-Vii-Advanced Computer Architectures (10CS74) - Assignment PDF
No ratings yet
Cse-Vii-Advanced Computer Architectures (10CS74) - Assignment PDF
6 pages
SVM Questions
No ratings yet
SVM Questions
7 pages
Network Security University Question Papers
No ratings yet
Network Security University Question Papers
24 pages
Week-1 Assessment-1 Answers
No ratings yet
Week-1 Assessment-1 Answers
3 pages
Daa Question Paper Winter 2024
No ratings yet
Daa Question Paper Winter 2024
8 pages
Data Mining and Warehousing
100% (3)
Data Mining and Warehousing
30 pages
Calculating Function Points
No ratings yet
Calculating Function Points
10 pages
Computer Networks 159.334 Answers Tutorial No. 5 Professor Richard Harris
No ratings yet
Computer Networks 159.334 Answers Tutorial No. 5 Professor Richard Harris
3 pages
Social Network Analysis Answers
No ratings yet
Social Network Analysis Answers
165 pages
Quizizz: IT202-MIDTERM EXAM
No ratings yet
Quizizz: IT202-MIDTERM EXAM
55 pages
DAA First-Internal Question Paper (2019) VI Sem
0% (1)
DAA First-Internal Question Paper (2019) VI Sem
2 pages
Recursion Problems
No ratings yet
Recursion Problems
7 pages
SE-6104 Data Mining and Analytics: Lecture # 12 Rule Based Classification
No ratings yet
SE-6104 Data Mining and Analytics: Lecture # 12 Rule Based Classification
62 pages
2009 S Pre Exam2 Review 6up PDF
No ratings yet
2009 S Pre Exam2 Review 6up PDF
9 pages
Draw A Dependency Graph Between Any Two Courses...
No ratings yet
Draw A Dependency Graph Between Any Two Courses...
2 pages
CH 5 Answers
No ratings yet
CH 5 Answers
6 pages
Assignment 5 - Stacks and Queues
No ratings yet
Assignment 5 - Stacks and Queues
22 pages
Subsets, Graph Coloring, Hamiltonian Cycles, Knapsack Problem. Traveling Salesperson Problem
No ratings yet
Subsets, Graph Coloring, Hamiltonian Cycles, Knapsack Problem. Traveling Salesperson Problem
22 pages
Sample Questions Answers
No ratings yet
Sample Questions Answers
8 pages
A Report of Six Weaks Industrial Training at BBSBEC, Fatehgarh Sahib
No ratings yet
A Report of Six Weaks Industrial Training at BBSBEC, Fatehgarh Sahib
24 pages
CS3353 Unit 2
No ratings yet
CS3353 Unit 2
26 pages
CS3401-ALGORITHMS QB Original
No ratings yet
CS3401-ALGORITHMS QB Original
51 pages
1.2.2 Security Aspects ANSWERS
No ratings yet
1.2.2 Security Aspects ANSWERS
16 pages
ST Microelectronics Interview Questions
No ratings yet
ST Microelectronics Interview Questions
4 pages
Numerical Analysis With - Matlab
No ratings yet
Numerical Analysis With - Matlab
76 pages
Data Mining Comprehensive Exam - Regular PDF
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
3 pages
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
No ratings yet
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
2 pages
Another SAT Practice Test
No ratings yet
Another SAT Practice Test
20 pages
Himni I Flamurit - Wikipedia
No ratings yet
Himni I Flamurit - Wikipedia
3 pages
Decision Tree Case Study: Options
No ratings yet
Decision Tree Case Study: Options
5 pages
Support Vector Machine
No ratings yet
Support Vector Machine
7 pages
PM Glossary of Terms
No ratings yet
PM Glossary of Terms
10 pages
Financial Engineering Lecture Notes
No ratings yet
Financial Engineering Lecture Notes
12 pages
Financial Engineering Lecture Notes
No ratings yet
Financial Engineering Lecture Notes
24 pages
Class - Ix Subject - Computer Application Chapter 1-Introduction To Object Oriented Programming Concept
No ratings yet
Class - Ix Subject - Computer Application Chapter 1-Introduction To Object Oriented Programming Concept
7 pages
Mobile Computing (KCS 713) unit-5
No ratings yet
Mobile Computing (KCS 713) unit-5
38 pages
Ankit Kumar Resume
No ratings yet
Ankit Kumar Resume
1 page
Dbms Notes
No ratings yet
Dbms Notes
48 pages
-ADVANCE COMPUTER NETWORK
No ratings yet
-ADVANCE COMPUTER NETWORK
6 pages
Module 01 - Living in The IT Era
No ratings yet
Module 01 - Living in The IT Era
9 pages
LIST of Important ICT Computer Acronyms For UGC NET Paper 1
No ratings yet
LIST of Important ICT Computer Acronyms For UGC NET Paper 1
11 pages
Functional Dependency and Normalization: Chapter Four
No ratings yet
Functional Dependency and Normalization: Chapter Four
16 pages
CS602 Solved MCQs Final Term by JUNAID-1
100% (3)
CS602 Solved MCQs Final Term by JUNAID-1
39 pages
Oracle R12 TableView Changes
80% (10)
Oracle R12 TableView Changes
130 pages
JMP SUMMIT EUROPE 2018 - Data Mining Under The Curse of Dimensionality (Gianpaolo Polsinelli - LFoundry Italy)
No ratings yet
JMP SUMMIT EUROPE 2018 - Data Mining Under The Curse of Dimensionality (Gianpaolo Polsinelli - LFoundry Italy)
7 pages
S&s Question
No ratings yet
S&s Question
10 pages
Guideline Operator Workplace and Process Graphics
No ratings yet
Guideline Operator Workplace and Process Graphics
14 pages
PathWave System Design (SystemVue)
No ratings yet
PathWave System Design (SystemVue)
19 pages
Word 03 # 06 in Series Create A Business LetterHead
No ratings yet
Word 03 # 06 in Series Create A Business LetterHead
7 pages
Information and Communication Technology
No ratings yet
Information and Communication Technology
18 pages
Manual WTW Oxi 3210 PDF
No ratings yet
Manual WTW Oxi 3210 PDF
52 pages
Resume - Quiboy, Lady Love
No ratings yet
Resume - Quiboy, Lady Love
2 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Intellectual Property Rights and The Digital World 3 PDF
No ratings yet
Intellectual Property Rights and The Digital World 3 PDF
5 pages
AI Notes
No ratings yet
AI Notes
10 pages
Patterns
No ratings yet
Patterns
102 pages
Nmap Best One
No ratings yet
Nmap Best One
5 pages
Binary Trees
No ratings yet
Binary Trees
37 pages
10 Reason Your Digital Marketing Is Not Effective
No ratings yet
10 Reason Your Digital Marketing Is Not Effective
12 pages
With Instant Response Technology Frank S (055 109)
No ratings yet
With Instant Response Technology Frank S (055 109)
55 pages
Blockchain Technology and Startup Financing: A Transaction Cost Economics Perspective
No ratings yet
Blockchain Technology and Startup Financing: A Transaction Cost Economics Perspective
26 pages

Data Mining Exercises - Solutions

Uploaded by

Data Mining Exercises - Solutions

Uploaded by

1. Give 5 properties of Python and explain why Python is suitable for Data Mining?

2. Write the output of the following code.

Line 1 must be → import pandas as pd

This output needs to be True, since elements of

8. a) Describe the difference between unsupervised and supervised

a) Supervised techniques can be used when labeled dataset is available for

b) Overfitting is a problem of supervised techniques where the model is too

9. Describe predictive modeling process. Which techniques are most suitable

Id x y distance_from_1 distance_from_2 distance_from_3

To find the updated centroid coordinates we first assign points to the

C3=[a.m.(X2,X3,X4, X5,X6,X7), a.m.(Y2,Y3,Y4, Y5,Y6,Y7) ]=

[31.67, 27.67], [62.0, 16.33], [42.83, 60.5] ]

8 splits are possible:

Entropy original: -4/9 * log2(4/9) – 5/9* log2(5/9) = 0.9911

Ex. a1 children entropy= 4/9 * Entropy(3+,1-) + 5/9 * Entropy(1+,4-)

= 4/9* [-3/4 * log2(3/4) – 1/4* log2(1/4)] + 5/9 * [-1/5 * log2(1/5) – 4/5*

Ex. a2 children entropy = 5/9 * Entropy(2+,3-) + 4/9 * Entropy(2+,2-)

= 5/9* [-2/5 * log2(2/5) – 3/5* log2(3/5)] + 4/9 * [-1/2 * log2(1/2) – 1/2*

a1 info gain: E.O.- (a1 children entropy) = E.O.- (0.7616)= 0.2294

You might also like