0% found this document useful (0 votes)
23 views

02data Part4

This document summarizes Chapter 2 from the book "Data Mining: Concepts and Techniques" by Jiawei Han, Micheline Kamber, and Jian Pei. Chapter 2 discusses getting to know your data, including data objects and attribute types, basic statistical descriptions of data, data visualization, and measuring data similarity and dissimilarity. It describes different ways to calculate similarity and dissimilarity between data objects, including using proximity measures for nominal, binary, and numeric attribute types. Examples are provided to illustrate dissimilarity matrices calculated using Euclidean and Minkowski distances.

Uploaded by

baigsalman251
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

02data Part4

This document summarizes Chapter 2 from the book "Data Mining: Concepts and Techniques" by Jiawei Han, Micheline Kamber, and Jian Pei. Chapter 2 discusses getting to know your data, including data objects and attribute types, basic statistical descriptions of data, data visualization, and measuring data similarity and dissimilarity. It describes different ways to calculate similarity and dissimilarity between data objects, including using proximity measures for nominal, binary, and numeric attribute types. Examples are provided to illustrate dissimilarity matrices calculated using Euclidean and Minkowski distances.

Uploaded by

baigsalman251
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Mining

Dr. Shahid Mahmood Awan

http://turing.cs.pub.ro/mas_11
curs.cs.pub.ro
[email protected]
University of Management and Technology

Fall 2016
Data Mining:
Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
2
 https://en.wikipedia.org/wiki/Standard_deviation

 https://www2.stat.duke.edu/courses/Fall98/sta11
0b/minitab/mean-var.html

 http
://www.mathsisfun.com/data/standard-deviation.
html

3
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

4
Similarity and Dissimilarity

 clustering, outlier analysis, and nearest-neighbor


classification,
 we need ways to assess how alike or unalike

objects are in comparison to one another.


 A store may want to search for clusters of

customer objects, resulting in groups of customers


with similar characteristics (e.g., similar income,
area of residence, and age)
 A cluster is a collection of data objects such that

the objects within a cluster are similar to one


another and dissimilar to the objects in other
clusters.
5
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 1 means object are identical, 0 for unalike

6
Similarity and Dissimilarity
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects

are
 Lower when objects are more alike

 Minimum dissimilarity is often 0, identical

 Upper limit varies

 Proximity refers to a similarity or dissimilarity


(closeness, nearness in space, time, or relationship.)

7
Data Matrix and Dissimilarity Matrix
 Data matrix
 (object-by-attribute structure):
 (used to store the data objects)
 This structure stores the n data objects in the form of a
relational table,
 or n-by-p matrix (n objects p attributes)

 n data points with p dimensions  x11 ... x1f ... x1p 


 
 ... ... ... ... ... 
 Each row corresponds to an object x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x xnp 
 Two modes  n1 ... xnf ...

Two entities or “things,” namely rows (for objects) and columns (for attributes)
8
Data Matrix and Dissimilarity Matrix
 Dissimilarity matrix
 (object-by-object structure)
(used to store dissimilarity values for pairs of objects
 represented by an n-by-n table

 n data points, but registers only the distance


 0 
 A triangular matrix  d(2,1) 0 
 
 Symmetric  d(3,1) d ( 3,2) 0 
 
 : : : 
 Single mode (dissimilarities) d ( n,1) d ( n,2) ... ... 0

 Note that d(i, i)= 0; difference between an object and itself is 0.


9
 Similarity can be obtained by other way

10
Proximity Measure for Nominal Attributes

 Can take 2 or more states, e.g., red, yellow, blue,


green

 Method 1: Simple matching


 m: # of matches, the number of attributes for which i and j are in
the same state

 p: total # of variables

d (i, j)  p 
p
m

11
Example

13
Proximity Measure for Binary Attributes

 A table for binary data

 Distance measure for


symmetric binary variables:

 The total number of attributes


is p, where p = q +r +s +t .

 Distance measure for


asymmetric binary variables:
14
Proximity Measure for Binary Attributes

 Jaccard coefficient (similarity measure for asymmetric binary


variables):

15
Dissimilarity between Binary Variables
 Example

 Gender is a symmetric attribute


 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N is 0

16
Dissimilarity of Numeric Data: Euclidean Distance

 Euclidean Distance:

 Manhattan (or city block) distance:

18
Example:
Data Matrix and Dissimilarity Matrix
 Let x1= (1,2)
 X2= (3,5) Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x2 x4
x3 2 0
4 x4 4 5

Dissimilarity Matrix
(with Euclidean Distance)
2 x1
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 x3 5.1 5.1 0
0 2 4 x4 4.24 1 5.39 0

19
Dissimilarity of Numeric Data: Minkowski Distance

 Minkowski distance: It is a generalization of the Euclidean


and Manhattan distances.
 It is defined as:

 It represents the Manhattan distance when h = 1 (i.e., L1 norm) and


Euclidean distance when h = 2 (i.e., L2 norm).

20
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)

21
 Both the Euclidean and the Manhattan distance satisfy the following
mathematical properties:
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric

22
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1

x3
0 2 4
23
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank
if
rif {1,..., M f }
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables

24
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

26
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

27
 Example 2.23

 The measure computes the cosine of the angle between vectors x and y. A
cosine value of 0 means that the two vectors are at 90 degrees to
each other (orthogonal) and have no match.
 The closer the cosine value to 1, the smaller the angle and the greater the
match between vectors
 Therefore, if we were using the cosine similarity measure to compare these
documents, they would be considered quite similar.

28
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

29
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of research.

30
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
31

You might also like