Data Mining Notes C3
Data Mining Notes C3
1 Summary Statistics
Summary statistics are quantities, such as the mean and standard deviation,
that capture various characteristics of a potentially large set of values with a
single number or a small set of numbers.
1.2 Percentiles
Percentiles is useful to show the distribution of an ordered attribute.
pth percentile xp is a value of x such that p% of the observed values of x are
less than xp . For example, x50% = 3.0 means that 50% of x is less than 3.0.
1
1.4 Range and Variance
Range and Variance measures the spread of an attribute.
m
1 X
variance(x) = s2x = (xi − x̄)2 (5)
m − 1 i=1
Variance is particularly sensitive to ourliers, as the difference between the
mean is squared.
Other more robust measures of spread are:
1. Absolute average deviation:
m
1 X
ADD(x) = |xi − ¯(x)| (6)
m i=1
3. Interquartile range:
IR(x) = x75% − x25% (8)
2
2 Visualization
2.1 General Concepts: Representing data to graphical el-
ements
When mapping an data object to a graph, general considerations should be:
1. If the object contains only a single attribute, then the attribute could be
mapped based on its type. For ordered continuous or ordinal attributes,
ordered graphical features such as axis should be used. For categorical
features, each category should be mapped to a unique representation like
color or position.
2. If the object have multiple attributes, it can be represented as a row/column
in the table, or a line in the graph
3. Objects with multiple attributes can also be represented as a point in a
2D or 3D diagram, if the number of attributes are 2 or 3. In this case, we
may need to select a subset of the attributes.
4. Relationships between data objects can usually shown by standard graph
representations, like nodes and edges. Sometimes the relationship can
also be shown implicitly on the graph, like distance in a 2D coordination
system, or on a real-world map.
Stem Leaf
1 1 1 2 3 3 4 4
1 5 6 6 8
2 0 3
2 7 8
3
3 5 7 8 8
4 0 0 0 1 2 4 4 4
4 5 5 6 7 7 7 8 8 9
3
50
40
object count 30
20
10
0
0 5 10 15 20 25
attribute value
Figure 2: Histogram
the bin. For categorical attributes, each type can be a bin. For continuous
attributes, some arbitrary interval can be a bin, as shown in Figure 2.
Pareto histogram sort the category bin by their counts.
Sometimes we can also use 3-d histogram to show the count for two at-
tributes together.
3. Box Plots: It is another methods to display the distribution of a sin-
gle continuous attribute. It shows the 90th, 75th, 50th, 25th and 10th
percentile of the data. An example is shown in Figure 3.
4. Pie Chart: It can show the distribution of each category in a categorical
attribute. It shows how many percent each category takes up in the data.
An example is shown in Figure 4
5. Percentile plots and Empirical Cumulative Distribution Func-
tions (ECDF): It is able to show the distribution of an ordered attribute
more quantitatively.
A Cumulative Distribution Functions (CDF), which CDF(x) shows
the percentage of data less than value x.
A Empirical Cumulative Distribution Functions (ECDF) shows
the fraction of points which are less than the current observed value. Since
the observe value is finite, it is a step function.
An example of ECDF is shown in Figure 5.
On the other hand, a percentile plot draws P (x), which is the value of
xth percentile for all percentiles.
4
Index 2
Index 1
Index 0
20%
C A
10%
30%
40%
5
1
0.8
0.6
0.4
0.2
0
0 1 2 3 4 5 6
6. Scatter Plots: It draws each data object as point in a 2-d or 3-d coor-
dinate. It is able to show the relationship between data objects.
If the data has an categorical attribute, it can be shown as different styles
of the dots. An example is shown in Figure 6.
0.5
0.4
0.3
0.2
0.1
6
Figure 7: Contour Plot
7
Figure 8: Surface Plot
8
Figure 10: Heatmap for data matrix
9
Figure 12: Radar Plot
10
3 Multidimensional Data Analysis
If we view the data as multi-dimensional array, we are able to proceed with
aggregating data in various ways and perform analysis accordingly.
As an example, we assume a transaction data with date, product, store and
transaction amount. We could do the following manipulations on the multi-
dimensional array:
11