0% found this document useful (0 votes)
224 views

Data Mining Exercises - Solutions

The document discusses 10 questions related to data mining concepts like Python properties for data mining, code output, necessary changes to code, output of code snippets, differences between supervised and unsupervised techniques, overfitting, predictive modeling process, k-means clustering example, number of splits possible in a dataset, and information gain calculation. It provides explanations, examples, and step-by-step workings for each question.

Uploaded by

Mehmet Zirek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
224 views

Data Mining Exercises - Solutions

The document discusses 10 questions related to data mining concepts like Python properties for data mining, code output, necessary changes to code, output of code snippets, differences between supervised and unsupervised techniques, overfitting, predictive modeling process, k-means clustering example, number of splits possible in a dataset, and information gain calculation. It provides explanations, examples, and step-by-step workings for each question.

Uploaded by

Mehmet Zirek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1. Give 5 properties of Python and explain why Python is suitable for Data Mining?

Python is easy to use, object oriented, easy to read, expressive, open source, portable programming
language which has a lot of libraries for data mining algorithms

2. Write the output of the following code.

<class ‘tuple’>

[ 2.0, 100, 5]

3. (10 pts) Please make the necessary change in the given code so that it doesn’t give the following
error message and works as commented:

Line 1 must be → import pandas as pd

4. What is the output of the following code? (Hint if the loop test is False then the execution jumps to
the else: row.)

4 320
5. What is the output of the following code?

True

False

None

6. We are writing a sublist function which compares two lists and returns true if the first list (lst1) is
a sublist of (is contained inside) the second list (lst2). We created a version of the second list as ls2
where we eliminated all elements of lst2 which are not in the in the first list to e if the final lists are
the same. However even though the final lists contain the same elements,

This output needs to be True, since elements of


list [2,1] are also elements of list [1,2,5,3]

What property of lists can we use in the comparison ( ?==? ) so that function gives correct result:
(True) in the given example above.
Line 4 must be → return sorted(lst1) == sorted(ls2)

Note: Another sublist function given in the Apriori algorithm code runs faster.
7. What is the output of the following code?

25
81
75

8. a) Describe the difference between unsupervised and supervised


techniques of Data Mining, give an example for each. b) define overfitting
(o.f.), for which of the above techniques o.f. is a problem?

a) Supervised techniques can be used when labeled dataset is available for


training and testing where as unsupervised techniques doesn’t have/need a
labeled dataset. The unsupervised techniques are used to detect new patterns
and clusters in relatively unknown or unstructured data and supervised
techniques are used to predict future data when there is enough structured
and analyzed past information. K means clustering is an unsupervised
technique, decision tree analysis is an example of a supervised technique.

b) Overfitting is a problem of supervised techniques where the model is too


much customized for the specific training data at hand where as it doesn’t
perform well on the test data and future data.

9. Describe predictive modeling process. Which techniques are most suitable


to model datasets with nominal categories? Give 2 examples for these
techniques.

In predictive modeling a labeled dataset is split into two parts as training and
test datasets. A model is built using the training data set and test dataset is
fed into the model to predict their labels. Actual labels of test dataset and
predicted labels are compared to evaluate the performance of the model.
Classification techniques are most suited for predicting or describing data sets
with binary or nominal categories. Decision Trees and Rule Based Classifiers
are examples.
10. At one stage in K-Means Clustering of the given data set with two
attributes, distances of the points to each centroid are given in the following
table: What will be the centroid coordinates in the next stage?

Id x y distance_from_1 distance_from_2 distance_from_3


0 12 39 26,93 56,08 56,73
1 28 30 14,14 41,76 53,34
2 29 54 38,12 40,80 34,06
3 24 55 39,05 45,88 37,44
4 45 63 50,70 31,14 16,40
5 52 70 59,93 32,25 6,71
6 52 63 53,71 26,40 13,34
7 55 58 51,04 20,62 18,00
8 53 23 27,89 24,21 53,04
9 55 14 29,07 30,87 62,00
10 64 19 38,12 23,35 57,71
11 69 7 43,93 35,01 70,41

To find the updated centroid coordinates we first assign points to the


existing centroids (check which point is closest to which
centroid,ex:points 0,1 and 9 are closest to C1 now we take arithmetic
mean of x and y coordinates of these points to find updated
coordinates of Centroid1)

C1=[a.m.(X0,X1,X9), a.m.(Y0,Y1,Y9)] =
[(12+28+55)/3,(39+30+14)/3]

Similarly

C2=[a.m.(X8,X10,X11), a.m.(Y8,Y10,Y11) ]=

[(53+64+69)/3,(23+19+7)/3]

and

C3=[a.m.(X2,X3,X4, X5,X6,X7), a.m.(Y2,Y3,Y4, Y5,Y6,Y7) ]=

[(29+24+45+52+52+55)/6,(54+55+63+70+63+58)/6]

Answer:

[31.67, 27.67], [62.0, 16.33], [42.83, 60.5] ]

11. How many different splits can be made on the dataset given below.
Note: Use the “Entropy” measure for information gain given by the following
formula:
Id a1 a2 a3 Class
1 T T 1.0 +
2 T T 6.0 +
3 T F 5.0 -
4 F F 4.0 +
5 F T 7.0 -
6 F T 3.0 -
7 F F 8.0 -
8 T F 7.0 +
9 F T 5.0 -

8 splits are possible:

Entropy original: -4/9 * log2(4/9) – 5/9* log2(5/9) = 0.9911


Split information gains

Ex. a1 children entropy= 4/9 * Entropy(3+,1-) + 5/9 * Entropy(1+,4-)

= 4/9* [-3/4 * log2(3/4) – 1/4* log2(1/4)] + 5/9 * [-1/5 * log2(1/5) – 4/5*


log2(4/5)] = 0.7616

Ex. a2 children entropy = 5/9 * Entropy(2+,3-) + 4/9 * Entropy(2+,2-)

= 5/9* [-2/5 * log2(2/5) – 3/5* log2(3/5)] + 4/9 * [-1/2 * log2(1/2) – 1/2*


log2(1/2)] =0.9839

a1 info gain: E.O.- (a1 children entropy) = E.O.- (0.7616)= 0.2294


a2 info gain: E.O.- (a2 children entropy) = E.O.- (0.9839)= 0.0072
a3 >=3.0 i.g.: E.O.- (a3 >=3.0 children entropy) = E.O.- (0.8484)= 0.1427
a3 >=4.0 i.g.: E.O.- (a3 >=4.0 children entropy) = E.O.- (0.9885)= 0.0026
a3 >=5.0 i.g.: E.O.- (a3 >=5.0 children entropy) = E.O.- (0.9183)= 0.0728
a3 >=6.0 i.g.: E.O.- (a3 >=6.0 children entropy) = E.O.- (0.9839)= 0.0072
a3 >=7.0 i.g.: E.O.- (a3 >=7.0 children entropy) = E.O.- (0.9728) = 0.0183
a3 >=8.0 i.g.: E.O.- (a3 >=8.0 children entropy) = E.O.- (0.8889) = 0.1022

You might also like