0% found this document useful (0 votes)

23 views

02data Part4

This document summarizes Chapter 2 from the book "Data Mining: Concepts and Techniques" by Jiawei Han, Micheline Kamber, and Jian Pei. Chapter 2 discusses getting to know your data, including data objects and attribute types, basic statistical descriptions of data, data visualization, and measuring data similarity and dissimilarity. It describes different ways to calculate similarity and dissimilarity between data objects, including using proximity measures for nominal, binary, and numeric attribute types. Examples are provided to illustrate dissimilarity matrices calculated using Euclidean and Minkowski distances.

Uploaded by

baigsalman251

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

02data Part4

Uploaded by

baigsalman251

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Data Mining

Dr. Shahid Mahmood Awan

http://turing.cs.pub.ro/mas_11
curs.cs.pub.ro
[email protected]
University of Management and Technology

Fall 2016
Data Mining:
Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
2
 https://en.wikipedia.org/wiki/Standard_deviation

 https://www2.stat.duke.edu/courses/Fall98/sta11
0b/minitab/mean-var.html

 http
://www.mathsisfun.com/data/standard-deviation.
html

3
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

4
Similarity and Dissimilarity

 clustering, outlier analysis, and nearest-neighbor

classification,
 we need ways to assess how alike or unalike

objects are in comparison to one another.

 A store may want to search for clusters of

customer objects, resulting in groups of customers

with similar characteristics (e.g., similar income,
area of residence, and age)
 A cluster is a collection of data objects such that

the objects within a cluster are similar to one

another and dissimilar to the objects in other
clusters.
5
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 1 means object are identical, 0 for unalike

6
Similarity and Dissimilarity
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects

are
 Lower when objects are more alike

 Minimum dissimilarity is often 0, identical

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

(closeness, nearness in space, time, or relationship.)

7
Data Matrix and Dissimilarity Matrix
 Data matrix
 (object-by-attribute structure):
 (used to store the data objects)
 This structure stores the n data objects in the form of a
relational table,
 or n-by-p matrix (n objects p attributes)

 n data points with p dimensions  x11 ... x1f ... x1p 

 
 ... ... ... ... ... 
 Each row corresponds to an object x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x xnp 
 Two modes  n1 ... xnf ...

Two entities or “things,” namely rows (for objects) and columns (for attributes)
8
Data Matrix and Dissimilarity Matrix
 Dissimilarity matrix
 (object-by-object structure)
(used to store dissimilarity values for pairs of objects
 represented by an n-by-n table

 n data points, but registers only the distance

 0 
 A triangular matrix  d(2,1) 0 
 
 Symmetric  d(3,1) d ( 3,2) 0 
 
 : : : 
 Single mode (dissimilarities) d ( n,1) d ( n,2) ... ... 0

 Note that d(i, i)= 0; difference between an object and itself is 0.

9
 Similarity can be obtained by other way

10
Proximity Measure for Nominal Attributes

 Can take 2 or more states, e.g., red, yellow, blue,

green

 Method 1: Simple matching

 m: # of matches, the number of attributes for which i and j are in
the same state

 p: total # of variables

d (i, j)  p 
p
m

11
Example

13
Proximity Measure for Binary Attributes

 A table for binary data

 Distance measure for

symmetric binary variables:

 The total number of attributes

is p, where p = q +r +s +t .

 Distance measure for

asymmetric binary variables:
14
Proximity Measure for Binary Attributes

 Jaccard coefficient (similarity measure for asymmetric binary

variables):

15
Dissimilarity between Binary Variables
 Example

 Gender is a symmetric attribute

 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N is 0

16
Dissimilarity of Numeric Data: Euclidean Distance

 Euclidean Distance:

 Manhattan (or city block) distance:

18
Example:
Data Matrix and Dissimilarity Matrix
 Let x1= (1,2)
 X2= (3,5) Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x2 x4
x3 2 0
4 x4 4 5

Dissimilarity Matrix
(with Euclidean Distance)
2 x1
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 x3 5.1 5.1 0
0 2 4 x4 4.24 1 5.39 0

19
Dissimilarity of Numeric Data: Minkowski Distance

 Minkowski distance: It is a generalization of the Euclidean

and Manhattan distances.
 It is defined as:

 It represents the Manhattan distance when h = 1 (i.e., L1 norm) and

Euclidean distance when h = 2 (i.e., L2 norm).

20
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)

21
 Both the Euclidean and the Manhattan distance satisfy the following
mathematical properties:
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric

22
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1

x3
0 2 4
23
Ordinal Variables

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank
if
rif {1,..., M f }
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables

24
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …

 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

26
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

27
 Example 2.23

 The measure computes the cosine of the angle between vectors x and y. A
cosine value of 0 means that the two vectors are at 90 degrees to
each other (orthogonal) and have no match.
 The closer the cosine value to 1, the smaller the angle and the greater the
match between vectors
 Therefore, if we were using the cosine similarity measure to compare these
documents, they would be considered quite similar.

28
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

29
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of research.

30
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization
and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
31

Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Lec 5
No ratings yet
Lec 5
24 pages
Similarity
No ratings yet
Similarity
19 pages
Lec2 2-Dataset2
No ratings yet
Lec2 2-Dataset2
29 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
02 Data
No ratings yet
02 Data
35 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Slides of Lecture 2 of CS3319 SJTU
No ratings yet
Slides of Lecture 2 of CS3319 SJTU
35 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
4
No ratings yet
4
26 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Data ch2
No ratings yet
Data ch2
16 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
STAT243 Chapter 2 - Section 2.4 (1)
No ratings yet
STAT243 Chapter 2 - Section 2.4 (1)
41 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
02 Data
No ratings yet
02 Data
47 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Lect 3
No ratings yet
Lect 3
51 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
Unit 4
No ratings yet
Unit 4
65 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Data Mining Lecture 1 - Summary
No ratings yet
Data Mining Lecture 1 - Summary
3 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Unit 1 Ganeshk e
No ratings yet
Unit 1 Ganeshk e
24 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
02data Part1
No ratings yet
02data Part1
19 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Lecture No 1 Introduction
No ratings yet
Lecture No 1 Introduction
77 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Artificial life: Random walk
From Everand
Artificial life: Random walk
Mietek Szyszkowicz
No ratings yet
Ch06-Color Image Processing
No ratings yet
Ch06-Color Image Processing
40 pages
Ch10-Image Segmentation
No ratings yet
Ch10-Image Segmentation
22 pages
Ch03-Intensity Transformations and Spatial Filtering
No ratings yet
Ch03-Intensity Transformations and Spatial Filtering
65 pages
Ch05-Image Restoration
No ratings yet
Ch05-Image Restoration
49 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
02data Part2
No ratings yet
02data Part2
34 pages
1.0 Introduction To Android PDF
No ratings yet
1.0 Introduction To Android PDF
20 pages
G991P11U-datasheet HTML
No ratings yet
G991P11U-datasheet HTML
2 pages
AIAA Aviation 2024 Full Paper CometDevelopmentOfUnstructuredMeshDirectSimulationMonteCarloCodeOnGPUs
No ratings yet
AIAA Aviation 2024 Full Paper CometDevelopmentOfUnstructuredMeshDirectSimulationMonteCarloCodeOnGPUs
12 pages
Broadcast Receiver and Service
No ratings yet
Broadcast Receiver and Service
42 pages
Drts 33: Automatic Relay Test System
No ratings yet
Drts 33: Automatic Relay Test System
8 pages
Sony VGN-FZ MBX-165 - MS90 - Rev - 1.0 PDF
No ratings yet
Sony VGN-FZ MBX-165 - MS90 - Rev - 1.0 PDF
72 pages
FPGA Based Elevator Controller With Improved Reliability
No ratings yet
FPGA Based Elevator Controller With Improved Reliability
6 pages
CCNP ENARSI v4.0 (5 Nov 2021)
No ratings yet
CCNP ENARSI v4.0 (5 Nov 2021)
198 pages
To Network Layer
100% (1)
To Network Layer
130 pages
Gas Channel GS500 Trolley
No ratings yet
Gas Channel GS500 Trolley
2 pages
Info-Sec Landscape
No ratings yet
Info-Sec Landscape
14 pages
PMC240 Quiz 2
No ratings yet
PMC240 Quiz 2
4 pages
Vsphere Esxi Vcenter Server 601 Host Profiles Guide
No ratings yet
Vsphere Esxi Vcenter Server 601 Host Profiles Guide
17 pages
A Hybrid BlockChain-Based Identity Authentication Scheme For Multi-WSN
No ratings yet
A Hybrid BlockChain-Based Identity Authentication Scheme For Multi-WSN
11 pages
Warrick 26m Control de Nivel Auxiliar
No ratings yet
Warrick 26m Control de Nivel Auxiliar
3 pages
Components of System Unit
88% (16)
Components of System Unit
8 pages
E+h Liquiphant FTL20H
No ratings yet
E+h Liquiphant FTL20H
24 pages
19cs413 Artificial Intelligence
No ratings yet
19cs413 Artificial Intelligence
3 pages
Artificial Intelligence Search Algorithms in Travel Planning
No ratings yet
Artificial Intelligence Search Algorithms in Travel Planning
50 pages
Abhishek Raja - Resume
No ratings yet
Abhishek Raja - Resume
1 page
Module 1 Lesson 1 EC3 - Introduction To Networking
No ratings yet
Module 1 Lesson 1 EC3 - Introduction To Networking
18 pages
BCA514 Unit 1 Internet Programming
No ratings yet
BCA514 Unit 1 Internet Programming
90 pages
O240 Crash Logs
No ratings yet
O240 Crash Logs
16 pages
Mphil Thesis in Computer Science Networking
100% (4)
Mphil Thesis in Computer Science Networking
8 pages
Lesson 2 String in Python
100% (1)
Lesson 2 String in Python
27 pages
SUNDAY JOBS100121 FINAL - Unlocked
No ratings yet
SUNDAY JOBS100121 FINAL - Unlocked
8 pages
Pec 16M4#1
No ratings yet
Pec 16M4#1
54 pages
Devi Ahilya Vishwavidyalaya, Indore: Guidelines
No ratings yet
Devi Ahilya Vishwavidyalaya, Indore: Guidelines
20 pages
Hayden Cooper - Managing Your Digital Footprint
No ratings yet
Hayden Cooper - Managing Your Digital Footprint
1 page
Coa Unit 1 Notes Coa
No ratings yet
Coa Unit 1 Notes Coa
36 pages

02data Part4

Uploaded by

02data Part4

Uploaded by

Data Mining

Dr. Shahid Mahmood Awan

Jiawei Han, Micheline Kamber, and Jian Pei

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 clustering, outlier analysis, and nearest-neighbor

objects are in comparison to one another.

customer objects, resulting in groups of customers

the objects within a cluster are similar to one

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 1 means object are identical, 0 for unalike

 Minimum dissimilarity is often 0, identical

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

 n data points with p dimensions  x11 ... x1f ... x1p 

 n data points, but registers only the distance

 Note that d(i, i)= 0; difference between an object and itself is 0.

 Can take 2 or more states, e.g., red, yellow, blue,

 Method 1: Simple matching

 A table for binary data

 Distance measure for

 The total number of attributes

 Distance measure for

 Jaccard coefficient (similarity measure for asymmetric binary

 Gender is a symmetric attribute

 Manhattan (or city block) distance:

 Minkowski distance: It is a generalization of the Euclidean

 It represents the Manhattan distance when h = 1 (i.e., L1 norm) and

 An ordinal variable can be discrete or continuous

 Other vector objects: gene features in micro-arrays, …

 Ex: Find the similarity between documents 1 and 2.

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

You might also like