Data Mining Mid Term
Data Mining Mid Term
Data Mining Mid Term
-Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or
knowledge from huge amount of data.
2. List two reasons why data mining is popular now and it wasn’t as popular 20 years ago.
Data mining has become more popular in recent years for several reasons, and a comparison
with two decades ago highlights some key factors contributing to its increased popularity
Increased Data Availability and Advancements in Computing Power and Technology
3. Describe the steps involved in data mining when viewed as a process of knowledge discovery.
-KDD is used to establish the procedure for recognizing valid, useful, and understandable patterns within
huge and complex data sets.
-The seven steps are cleansing, integration, selection, transformation, mining, measuring, and
visualization.
4.Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in
increasing order) 11, 15, 16, 16, 19, 20, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70, 75.
-To find the mean and median of the given data, let's start by organizing the data in increasing
order:
11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,7511,15,16,16,19,
20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,75
-Mean:
The mean (or average) is calculated by summing up all the values and then dividing by the total
number of values.
-
Mean=11+15+16+16+19+20+20+20+21+22+22+25+25+25+25+30+33+33+35+35+35+35+36+40+45+4
6+52+70+7529Mean=2911+15+16+16+19+20+20+20+21+22+22+25+25+25+25+30+33+33+35+35+35
+35+36+40+45+46+52+70+75
-Mean=78429 Mean=29784
-Mean≈27.03 Mean≈27.03
-Median:
-The median is the middle value of the dataset when it is ordered. If there is an even number of
values, the median is the average of the two middle values.
-In this case, there are 29 values, so the median is the value at position 29+12=302=15
229+1=230=15, which is the 15th value in the ordered list.
-Median=30 Median=30 So, the mean is approximately 27.03, and the median is 30.
b) What is the mode of the data? Comment on the data’s modality (i.e bimodal, trimodal, etc.).
-The mode is the value that appears most frequently in a dataset. Let's examine the
given data to find the mode:
11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46
,52,70,7511,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,3
6,40,45,46,52,70,75
-In this dataset, the value 25 appears the most frequently (four times), making it the
mode. Therefore, the mode of the data is 25.
-As for the modality of the data, it is unimodal because it has one mode. If there were
multiple modes (two modes - bimodal, three modes - trimodal, etc.), we would describe
the data as such. In this case, with only one mode, the data is unimodal.
- The midrange is the average of the maximum and minimum values in a dataset. Let's
calculate the midrange for the given data:
Midrange = 75+112275+11
Midrange = 862286
Certainly, to find the first quartile (Q1) and the third quartile (Q3), we can use the following
steps:
11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,7511,15,16,16,19,
20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,75
Therefore, the first quartile (Q1) is roughly 20, and the third quartile (Q3) is roughly 36 for this
dataset.
e) Give the five number summary of the data.
11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,7511,15,16,16,19,
20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,75
1. Minimum: 11
2. First Quartile (Q1): 20
3. Median (Q2): 30
4. Third Quartile (Q3): 36
5. Maximum: 75
Please note that the length of the box and whiskers is proportional to the spread of the
data. This textual representation should give you an idea of how the boxplot looks for
the given dataset. If you have a tool or software that allows you to create boxplots, you
can input the data values to generate a visual representation.
5. Suppose that the values for a given set of data are grouped into intervals. The intervals and
corresponding frequencies are as fllows.
Age frequency
1-5 350
5-15 450
15-20 300
20-50 1500
50-80 700
80-110 30
What is the mode of the data? Compute an approximate median value for the data.
To find the mode of grouped data, we look for the interval with the highest frequency. In this case,
the interval with the highest frequency is the "20-50" age group with a frequency of 1500.
Therefore, the mode of the data is the midpoint of the modal interval. For the "20-50" interval, the
midpoint is:
Midpoint=20+502=35Midpoint=220+50=35
Now, to compute an approximate median value for grouped data, we can use the following formula:
Median ≈ L+ ( )−F × w
where:
L is the lower limit of the median group (the group containing the median),
N is the total frequency,
F is the cumulative frequency of the group before the median group,
f is the frequency of the median group,
w is the width of the median group.
6. Given 5 – dimensional numeric samples A=(1, 0, 2, 5, 3) and B=(2, 1, 0, 3, -1) find
a) The Euclidean distance between points.
b) The Manhattan distance.
The equal-frequency method involves dividing the data into bins such that each bin contains approximately the same number of
data points. In this case, with 12 students, we want to create 4 bins.
The equal-width method involves creating bins of equal width. We need to determine the width of each bin and assign data points
accordingly.
(c) Clustering:
The clustering method involves using clustering algorithms to group data points into bins based on similarity or proximity. For
simplicity, let's use k-means clustering with k=4.
Note: The actual clustering process may vary based on the specific algorithm and implementation used. The above steps provide a
simplified representation.