0% found this document useful (0 votes)
124 views

Data Mining Notes C3

This document summarizes key concepts from Chapter 3 of an introduction to data mining textbook. It discusses summary statistics such as frequency, mode, percentiles, mean, median, range and variance that can describe a dataset. It also covers common visualization techniques including histograms, box plots, scatter plots and heatmaps to explore univariate and multivariate data. Visualization is important for representing data objects and relationships through graphical elements.

Uploaded by

wuziqi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

Data Mining Notes C3

This document summarizes key concepts from Chapter 3 of an introduction to data mining textbook. It discusses summary statistics such as frequency, mode, percentiles, mean, median, range and variance that can describe a dataset. It also covers common visualization techniques including histograms, box plots, scatter plots and heatmaps to explore univariate and multivariate data. Visualization is important for representing data objects and relationships through graphical elements.

Uploaded by

wuziqi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Notes on Introduction to Data Mining:

Chapter3 Data Exploration


wuziqing
5th November 2020

1 Summary Statistics
Summary statistics are quantities, such as the mean and standard deviation,
that capture various characteristics of a potentially large set of values with a
single number or a small set of numbers.

1.1 Frequencies and Mode


Frequency and mode are often useful only for categorical objects.
No.objects with value vi
f requencyvi = (1)
Total No.objects
Mode of a categorical attribute is the value with the highest frequency.

1.2 Percentiles
Percentiles is useful to show the distribution of an ordered attribute.
pth percentile xp is a value of x such that p% of the observed values of x are
less than xp . For example, x50% = 3.0 means that 50% of x is less than 3.0.

1.3 Mean and Median


Mean and Median measures the location of a continuous attribute.
m
1 X
mean(x) = xi (2)
m i=1
(
xr+1 if m is odd, m = 2r + 1
median(x) = 1 (3)
2 (xr + xr+1 ) if m is even, m = 2r
Median shows the middle position of an attribute, while mean sometimes
could be affected by outliers.
Trimmed mean with a specified percentage p can be used to reduce the
effect of outliers on mean. The top and bottom p/2% of data are thrown out
before calculating the mean.

1
1.4 Range and Variance
Range and Variance measures the spread of an attribute.

range(x) = max(x) − min(x) (4)

m
1 X
variance(x) = s2x = (xi − x̄)2 (5)
m − 1 i=1
Variance is particularly sensitive to ourliers, as the difference between the
mean is squared.
Other more robust measures of spread are:
1. Absolute average deviation:
m
1 X
ADD(x) = |xi − ¯(x)| (6)
m i=1

2. Median absolute deviation:


M AD(x) = median(|x1 − x̄, ..., |xm | − x̄|) (7)

3. Interquartile range:
IR(x) = x75% − x25% (8)

1.5 Multivariate Statistics


For data with multiple continuous attributes, the location of the data can be
calculated separately:
x̄ = (x̄1 , ..., x̄n ) (9)
The spread of the data can be shown in the Covariance Matrix S. Covari-
ance matrix of two attributes measures the degree to which the two attributes
vary together, depending on the magnitude.
m
1 X
sij = convariance(xi , xj ) = (xki − x̄i )(xkj − x¯j ) (10)
m−1
k=1

It should be noted that convariance(xi , xi ) = variance(xi ).


Based on the covariance, we can calculate the degree where two attributes
vary together in terms of trend independent of magnitude, i.e., they are linearly
related. The Correlation matrix R is calculated by:
covariance(xi , xj )
rij = correlation(xi , xj ) = (11)
si sj
rij ranges from [−1, 1], where 0 indicates no linear relationship, and 1/−1 in-
dicate perfect positive/negative linear relationship. It should be noted that
correlation(xi , xi ) = 1.

2
2 Visualization
2.1 General Concepts: Representing data to graphical el-
ements
When mapping an data object to a graph, general considerations should be:
1. If the object contains only a single attribute, then the attribute could be
mapped based on its type. For ordered continuous or ordinal attributes,
ordered graphical features such as axis should be used. For categorical
features, each category should be mapped to a unique representation like
color or position.
2. If the object have multiple attributes, it can be represented as a row/column
in the table, or a line in the graph
3. Objects with multiple attributes can also be represented as a point in a
2D or 3D diagram, if the number of attributes are 2 or 3. In this case, we
may need to select a subset of the attributes.
4. Relationships between data objects can usually shown by standard graph
representations, like nodes and edges. Sometimes the relationship can
also be shown implicitly on the graph, like distance in a 2D coordination
system, or on a real-world map.

2.2 Plot Examples


Some common graph techniques are shown in the section.

1. Stem and Leaf Plots: It can show the distribution of one-dimensional


integer and continuous data, as shown in Figure 1. However, it is hard to
scale and only works for small integers.

Stem Leaf
1 1 1 2 3 3 4 4
1 5 6 6 8
2 0 3
2 7 8
3
3 5 7 8 8
4 0 0 0 1 2 4 4 4
4 5 5 6 7 7 7 8 8 9

Figure 1: Stem and leaf plot (1|1 = 1.1)

2. Histogram: It shows the distribution of values of a single attribute by


dividing objects into bins and showing the number of objects that falls into

3
50

40

object count 30

20

10

0
0 5 10 15 20 25
attribute value

Figure 2: Histogram

the bin. For categorical attributes, each type can be a bin. For continuous
attributes, some arbitrary interval can be a bin, as shown in Figure 2.
Pareto histogram sort the category bin by their counts.
Sometimes we can also use 3-d histogram to show the count for two at-
tributes together.
3. Box Plots: It is another methods to display the distribution of a sin-
gle continuous attribute. It shows the 90th, 75th, 50th, 25th and 10th
percentile of the data. An example is shown in Figure 3.
4. Pie Chart: It can show the distribution of each category in a categorical
attribute. It shows how many percent each category takes up in the data.
An example is shown in Figure 4
5. Percentile plots and Empirical Cumulative Distribution Func-
tions (ECDF): It is able to show the distribution of an ordered attribute
more quantitatively.
A Cumulative Distribution Functions (CDF), which CDF(x) shows
the percentage of data less than value x.
A Empirical Cumulative Distribution Functions (ECDF) shows
the fraction of points which are less than the current observed value. Since
the observe value is finite, it is a step function.
An example of ECDF is shown in Figure 5.
On the other hand, a percentile plot draws P (x), which is the value of
xth percentile for all percentiles.

4
Index 2

Index 1

Index 0

0 0.5 1 1.5 2 2.5

Figure 3: Box Plot

20%
C A
10%
30%

40%

Figure 4: Pie Chart

5
1

0.8

0.6

0.4

0.2

0
0 1 2 3 4 5 6

Figure 5: Empirical Cumulative Distribution Function

6. Scatter Plots: It draws each data object as point in a 2-d or 3-d coor-
dinate. It is able to show the relationship between data objects.
If the data has an categorical attribute, it can be shown as different styles
of the dots. An example is shown in Figure 6.

0.5

0.4

0.3

0.2

0.1

0 0.2 0.4 0.6 0.8

Figure 6: Scatter Plot

7. Contour Plots: It is useful to display continuous values related to ge-


ographical locations, such as temperature, pressure etc. An example is
shown in Figure 7.

6
Figure 7: Contour Plot

8. Surface Plots: It can be used to plot geographical information, or math-


ematical functions. An example is shown in Figure 8.
9. Vector Field Plots: It can be used to show values with not only mag-
nitude, but also directions, such as wind flow, gradient change etc. An
example is shown in Figure 9.

10. Heatmap: We can visualize a m ∗ n-dimension matrix by regarding each


value as a point and adjust the brightness/color of the point according to
its value. An example of heatmap is shown in Figure 10.
11. Parallel Coordinates: We can make each attribute as a x value and plot
each data object as a line. It is able to sometimes reveal interesting pat-
terns between attributes or different classes of data objects. An example
is shown in 11.
12. Radar Plot: It can be used to represent multi-dimensional data of a
single data object. Each coordinate is the value of one attribute. An
example is shown in Figure 12.

7
Figure 8: Surface Plot

Figure 9: Vector Field Plot

8
Figure 10: Heatmap for data matrix

Figure 11: Parallel Coordinates

9
Figure 12: Radar Plot

10
3 Multidimensional Data Analysis
If we view the data as multi-dimensional array, we are able to proceed with
aggregating data in various ways and perform analysis accordingly.
As an example, we assume a transaction data with date, product, store and
transaction amount. We could do the following manipulations on the multi-
dimensional array:

1. Select a target quantity in which we are interested to investigate. It


could be the total amount of transaction, number of related unique product
or store, etc.

2. Select relevant attributes/dimensions. For example, we may want to


see the transaction amount for each store, or the transaction amount on
each date for each store.
3. For the rest of the attributes, we can perform dimension reduction
using aggregation. For example, if we want to see the transaction amount
on each date for each product, we need to sum transactions of all stores
for each products on each date.
If we decided to aggregate over k dimensions, we can obtain a data if n − k
dimension. If we only keep 2 dimensions, like the previous example, it is
called pivoting.

4. Sometimes we want to select data of a specified value for an attribute


(Slicing), or a range of data (Dicing).
5. Sometimes we want to aggregated over rows. For example, instead of daily
sales, we may want weekly, or monthly sales. This row aggregation is called
(Roll-Up). Likewise, we could split aggregated data into more specific
data, like from monthly data to daily data. It is called (Drill-Down).

11

You might also like