Data Mining Mid Term

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

1.What is data mining?

-Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or
knowledge from huge amount of data.

2. List two reasons why data mining is popular now and it wasn’t as popular 20 years ago.

Data mining has become more popular in recent years for several reasons, and a comparison
with two decades ago highlights some key factors contributing to its increased popularity
Increased Data Availability and Advancements in Computing Power and Technology
3. Describe the steps involved in data mining when viewed as a process of knowledge discovery.

-KDD is used to establish the procedure for recognizing valid, useful, and understandable patterns within
huge and complex data sets.

-The seven steps are cleansing, integration, selection, transformation, mining, measuring, and
visualization.

4.Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in
increasing order) 11, 15, 16, 16, 19, 20, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70, 75.

a) What is the mean of the data? What is the median?

-To find the mean and median of the given data, let's start by organizing the data in increasing
order:

11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,7511,15,16,16,19,
20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,75

-Mean:

The mean (or average) is calculated by summing up all the values and then dividing by the total
number of values.

-
Mean=11+15+16+16+19+20+20+20+21+22+22+25+25+25+25+30+33+33+35+35+35+35+36+40+45+4
6+52+70+7529Mean=2911+15+16+16+19+20+20+20+21+22+22+25+25+25+25+30+33+33+35+35+35
+35+36+40+45+46+52+70+75

-Mean=78429 Mean=29784

-Mean≈27.03 Mean≈27.03

-Median:

-The median is the middle value of the dataset when it is ordered. If there is an even number of
values, the median is the average of the two middle values.
-In this case, there are 29 values, so the median is the value at position 29+12=302=15
229+1=230=15, which is the 15th value in the ordered list.

-Median=30 Median=30 So, the mean is approximately 27.03, and the median is 30.
b) What is the mode of the data? Comment on the data’s modality (i.e bimodal, trimodal, etc.).

-The mode is the value that appears most frequently in a dataset. Let's examine the
given data to find the mode:

11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46
,52,70,7511,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,3
6,40,45,46,52,70,75

-In this dataset, the value 25 appears the most frequently (four times), making it the
mode. Therefore, the mode of the data is 25.

-As for the modality of the data, it is unimodal because it has one mode. If there were
multiple modes (two modes - bimodal, three modes - trimodal, etc.), we would describe
the data as such. In this case, with only one mode, the data is unimodal.

c) What is the midrange of the data?

- The midrange is the average of the maximum and minimum values in a dataset. Let's
calculate the midrange for the given data:

Given data (in increasing order):


11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46
,52,70,7511,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,3
6,40,45,46,52,70,75

Maximum value: 75 Minimum value: 11

Midrange = Maximum value+Minimum value22Maximum value+Minimum value

Midrange = 75+112275+11

Midrange = 862286

Midrange = 43 Therefore, the midrange of the given data is 43.


d) Can u find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?

Certainly, to find the first quartile (Q1) and the third quartile (Q3), we can use the following
steps:

1. Find the position of Q1 and Q3:


 Q1 is the median of the lower half of the data.
 Q3 is the median of the upper half of the data.
Since the data set has 29 values, Q1 is at the position 29+14=304=7.5429+1=430=7.5, and
Q3 is at 3×(29+1)4=3×304=22.543×(29+1)=43×30=22.5. We'll round these to the nearest
whole numbers.
2. Identify the values at Q1 and Q3:
 Q1 is the value at the 7th position.
 Q3 is the value at the 23rd position.

Let's find these values in the ordered dataset:

11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,7511,15,16,16,19,
20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,75

Q1 ≈ 20Q1≈20 (the 7th value)

Q3 ≈ 36Q3≈36 (the 23rd value)

Therefore, the first quartile (Q1) is roughly 20, and the third quartile (Q3) is roughly 36 for this
dataset.
e) Give the five number summary of the data.

The five-number summary consists of the following values:

1. Minimum: The smallest value in the dataset.


2. First Quartile (Q1): The value below which 25% of the data falls.
3. Median (Q2): The middle value of the dataset.
4. Third Quartile (Q3): The value below which 75% of the data falls.
5. Maximum: The largest value in the dataset.

For the given data:

11,15,16,16,19,20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,7511,15,16,16,19,
20,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70,75

The five-number summary is:

1. Minimum: 11
2. First Quartile (Q1): 20
3. Median (Q2): 30
4. Third Quartile (Q3): 36
5. Maximum: 75

Therefore, the five-number summary of the given data is {11,20,30,36,75}{11,20,30,36,75}.


f) Show a boxplot of the data.

Creating a boxplot involves visualizing the five-number summary (minimum, first


quartile (Q1), median (Q2), third quartile (Q3), and maximum) to provide a graphical
representation of the distribution of the data. The boxplot displays the central tendency
and spread of the data.

Here's a textual representation of the boxplot for the given data:

Each component of the boxplot corresponds to a specific part of the five-number


summary. The box represents the interquartile range (Q3 - Q1), and the line inside the
box represents the median (Q2). The "whiskers" extend from the box to the minimum
and maximum values, and any potential outliers may be plotted individually.

Please note that the length of the box and whiskers is proportional to the spread of the
data. This textual representation should give you an idea of how the boxplot looks for
the given dataset. If you have a tool or software that allows you to create boxplots, you
can input the data values to generate a visual representation.
5. Suppose that the values for a given set of data are grouped into intervals. The intervals and
corresponding frequencies are as fllows.

Age frequency

1-5 350

5-15 450

15-20 300

20-50 1500

50-80 700

80-110 30

What is the mode of the data? Compute an approximate median value for the data.

To find the mode of grouped data, we look for the interval with the highest frequency. In this case,
the interval with the highest frequency is the "20-50" age group with a frequency of 1500.

Therefore, the mode of the data is the midpoint of the modal interval. For the "20-50" interval, the
midpoint is:

Midpoint=Lower limit+Upper limit2Midpoint=2Lower limit+Upper limit

Midpoint=20+502=35Midpoint=220+50=35

So, the mode of the data is approximately 35.

Now, to compute an approximate median value for grouped data, we can use the following formula:

Median ≈ L+ ( )−F × w

where:

 L is the lower limit of the median group (the group containing the median),
 N is the total frequency,
 F is the cumulative frequency of the group before the median group,
 f is the frequency of the median group,
 w is the width of the median group.
6. Given 5 – dimensional numeric samples A=(1, 0, 2, 5, 3) and B=(2, 1, 0, 3, -1) find
a) The Euclidean distance between points.
b) The Manhattan distance.

c) The Minkowski distance for q=3


7.Suppose a group of 12 students with the test scores listed as follows:
19, 71, 48, 63, 35, 85, 69, 81, 72, 88, 99, 95,
Partition them into four bins by (a) equal-frequency (equi-depth) method, (b) equal
width method (c) clustering.

(a) Equal-Frequency (Equi-Depth) Method:

The equal-frequency method involves dividing the data into bins such that each bin contains approximately the same number of
data points. In this case, with 12 students, we want to create 4 bins.

1. Sort the data in ascending order: 19,35,48,63,69,71,72,81,85,88,95,9919,35,48,63,69,71,72,81,85,88,95,99


2. Assign the data points to the bins:
 Bin 1: 19, 35
 Bin 2: 48, 63, 69
 Bin 3: 71, 72, 81, 85
 Bin 4: 88, 95, 99

(b) Equal Width Method:

The equal-width method involves creating bins of equal width. We need to determine the width of each bin and assign data points
accordingly.

1. Find the range of the data: Range=Max−Min=99−19=80Range=Max−Min=99−19=80


2. Determine the width of each bin: Bin Width=RangeNumber of Bins=804=20Bin Width=Number of BinsRange=480=20
3. Create bins and assign data points:
 Bin 1: 19 - 38
 Bin 2: 39 - 58
 Bin 3: 59 - 78
 Bin 4: 79 - 99

(c) Clustering:

The clustering method involves using clustering algorithms to group data points into bins based on similarity or proximity. For
simplicity, let's use k-means clustering with k=4.

1. Initial centroids (randomly or based on data points):


 Centroid 1: 25
 Centroid 2: 50
 Centroid 3: 75
Centroid 4: 90
2. Assign each data point to the nearest centroid:
 Bin 1: 19, 35
 Bin 2: 48, 63, 69
 Bin 3: 71, 72, 81, 85
Bin 4: 88, 95, 99
3. Update centroids based on the mean of each cluster. Repeat the assignment and update steps until convergence.

Note: The actual clustering process may vary based on the specific algorithm and implementation used. The above steps provide a
simplified representation.

You might also like