PS2 Sol
PS2 Sol
Assignment 2
Data Mining (CSE4052)
1. Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal..)
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data? What
is the interquartile range?
Solution:
The mean of the data x = N i=1 xi = 809 27 = 30 . The median of the data is the
1 N
a)
middle value of the ordered set which is 25.
b) Mode of data refers to the value with highest frequency among others. In this
example 25 and 35 both are having the same highest frequency and hence the data is
bimodal in nature.
c) The midrange of the data is the average of the largest (70) and smallest (13) values
(70+13)
in the data set. 2 = 41.5
d) First Quartile(Q1)=((n+1)/4)th=((27+1)/4)th=7th term which is 20.It is also known as
the lower quartile.
-The second quartile or the 50th percentile or the Median is given as: Second
Quartile(Q2)=((n+1)/2)th Term=25
-The third Quartile of the 75th Percentile (Q3) is given as: Third Quartile(Q3)=(3(n+1)/4)th
Term=35 also known as the upper quartile.
-The interquartile range is calculated as: Upper Quartile – Lower Quartile=35-20=15
smoothing by means
bin 1-13.83,13.83,13.83,13.83,13.83,13.83
bin 2-20.16,20.16,20.16,20.16,20.16
bin 3-30.67,30.67,30.67,30.67,30.67,30.67
bin 4-63.5,63.5,63.5,63.5,63.5,63.5,63.5
smoothing by boundaries
bin 1:11,11,11,16,16,16
bin 2:19,19,19,21,21,21
bin3:22,22,22,22,45,45
bin4:45,45,75,75,75,75
smoothing by median
bin 1:14,14,14,14,14,14
bin 2:20,20,20,20,20,20
bin3: 27,27,27,27,27,27
bin4:71.5,71.5,71.5,71.5,71.5,71.5
4. Find Q1, Q2, and Q3 for the following data set, and draw a box-and-whisker plot.
{2,6,7,8,8,11,12,13,14,15,22,23}
Solution: There are 12 data points. The middle two are 11 and 12. So the median, Q2, is 11.5.
The "lower half" of the data set is the set {2,6,7,8,8,11}. The median here is 7.5. So Q1=7.5.
The "upper half" of the data set is the set {12,13,14,15,22,23}. The median here is 14.5.
So Q3=14.5.
A box-and-whisker plot displays the values Q1, Q2, and Q3, along with the extreme values of
the data set ( 2 and 23, in this case):
A box & whisker plot shows a "box" with left edge at Q1, right edge at Q3, the "middle" of the
box at Q2 (the median) and the maximum and minimum as "whiskers"
5. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using h = 3.
Solution: (a) Compute the Euclidean distance between the two objects.
The Euclidean distance is computed using Equation
(22 − 20)2 + (1 − 0)2 + (42 − 36)2 + (10 − 8)2 = 45 = 6.7082.
(c) Compute the Minkowski distance between the two objects, using h = 3.
The Minkowski disance is
d(i, j) = h
|xi1 − xj1 |h + |xi2 − xj2 |h + . . . + |xip − xjp |h
so,
3.63 53.1
3.02 49.7
3.82 48.4
3.42 54.2
3.59 54.9
Weight (kg) Length (cm)
2.87 43.7
3.03 47.2
3.46 45.2
3.36 54.4
3.3 50.4
Solution:
3.63 + 3.02 + 3.82 + 3.42 + 3.59 + 2.87 + 3.03 + 3.46 + 3.36 + 3.3 33.5
�= =
10 10
= 3.35
53.1 + 49.7 + 48.4 + 54.2 + 54.9 + 43.7 + 47.2 + 45.2 + 54.4 + 50.4 501.2
�= =
10 10
= 50.12
(��
� � �� (�� − �)2
− �)2
A B
��� = �
= =
��� �� (�� −�)2 (�� −�)2 0.8182 143.856
10 ∗ 10 ∗
10 10 10 10
= 0.47
8. Perform the chi-square test for correlation for the following observation of survey where 256
peopople shared the month of their birth where the expected distribution of moths are evenly
distributed.
January 29
February 24
March 22
April 19
May 21
June 18
July 19
August 20
September 23
October 18
November 20
December 23