02data Part4
02data Part4
http://turing.cs.pub.ro/mas_11
curs.cs.pub.ro
[email protected]
University of Management and Technology
Fall 2016
Data Mining:
Concepts and Techniques
— Chapter 2 —
https://www2.stat.duke.edu/courses/Fall98/sta11
0b/minitab/mean-var.html
http
://www.mathsisfun.com/data/standard-deviation.
html
3
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
4
Similarity and Dissimilarity
6
Similarity and Dissimilarity
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects
are
Lower when objects are more alike
7
Data Matrix and Dissimilarity Matrix
Data matrix
(object-by-attribute structure):
(used to store the data objects)
This structure stores the n data objects in the form of a
relational table,
or n-by-p matrix (n objects p attributes)
10
Proximity Measure for Nominal Attributes
p: total # of variables
d (i, j) p
p
m
11
Example
13
Proximity Measure for Binary Attributes
15
Dissimilarity between Binary Variables
Example
16
Dissimilarity of Numeric Data: Euclidean Distance
Euclidean Distance:
18
Example:
Data Matrix and Dissimilarity Matrix
Let x1= (1,2)
X2= (3,5) Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x2 x4
x3 2 0
4 x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
2 x1
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 x3 5.1 5.1 0
0 2 4 x4 4.24 1 5.39 0
19
Dissimilarity of Numeric Data: Minkowski Distance
20
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
21
Both the Euclidean and the Manhattan distance satisfy the following
mathematical properties:
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
22
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
x3
0 2 4
23
Ordinal Variables
24
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
26
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
27
Example 2.23
The measure computes the cosine of the angle between vectors x and y. A
cosine value of 0 means that the two vectors are at 90 degrees to
each other (orthogonal) and have no match.
The closer the cosine value to 1, the smaller the angle and the greater the
match between vectors
Therefore, if we were using the cosine similarity measure to compare these
documents, they would be considered quite similar.
28
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
29
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
Many types of data sets, e.g., numerical, text, graph, Web, image.
Gain insight into the data by:
Basic statistical data description: central tendency, dispersion,
graphical displays
Data visualization: map data onto graphical primitives
Measure data similarity
Above steps are the beginning of data preprocessing.
Many methods have been developed but still an active area of research.
30
References
W. Cleveland, Visualizing Data, Hobart Press, 1993
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
31