02 KnowYourData
02 KnowYourData
Information Management
course
Teacher: Alberto Ceselli
Lecture 02 : 03/10/2012
Data Mining:
Concepts and
Techniques
— Chapter 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2012 Han, Kamber, and Pei. All rights
reserved. 2
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
3
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
crosstabs
Document data: text documents:
term-frequency vector
Transaction data
Graph and network
World Wide Web
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images 1 Bread, Coke, Milk
Temporal data: time-series 2 Beer, Bread
Sequential Data: transaction
3 Beer, Coke, Diaper, Milk
sequences
Genetic sequence data 4 Beer, Bread, Diaper, Milk
Spatial, image and multimedia: 5 Coke, Diaper, Milk
Spatial data: maps
Image data: .bmp
Video data: .avi 4
Important Characteristics of
Structured Data
Dimensionality
Curse of dimensionality
(the volume of the space grows fast with the number
of dimensions, and the available data becomes sparse)
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
5
Data Objects
Types:
Nominal
Binary
Ordinal
Numeric: quantitative
Interval-scaled
Ratio-scaled
7
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red,
white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome
(e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude
between successive values is not known.
Size = {small, medium, large}, grades, army rankings
8
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
values
E.g., zip codes, profession, or the set of words
in a collection of documents
Sometimes, represented as integer variables
discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured
as floating-point variables 10
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
11
Basic Statistical Descriptions of
Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance...
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple
granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed
cube 12
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x = ∑xi
Mean: 58 4 50
5 52
Median: (52+56)/2 = 54 6 52
7 56
Mode: 52 and 70 (bimodal) 8 60
9 63
Midrange: (30+110) /2 = 70 10 70
11 70
12 110
Symmetric vs.
Skewed Data
Median, mean and mode of symmetric
positively negatively
skewed skewed
19
Graphic Displays of Basic Statistical
Descriptions
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-axis repres.
frequencies
Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are ≤ xi
Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane 20
Histogram Analysis
21
Histograms Often Tell More than
Boxplots
22
Quantile Plot
Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)
Plots quantile information
For a data x data sorted in increasing order, f
i i
indicates that approximately 100 fi% of the data
are below or equal to the value xi
24
Scatter plot
Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
25
Positively and Negatively Correlated
Data
26
Uncorrelated Data
27
Chapter 2: Getting to Know Your
Data
Data Visualization
Summary
28
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects
are
Value is higher when objects are more alike
objects are
Lower when objects are more alike
[ ]
n data points
x 11 .. . x 1f . . . x 1p
(objects) with p .. . .. . .. . ... ...
dimensions x i1 .. . x if . . . x ip
(features) .. . .. . .. . ... ...
Two modes x n1 .. . x nf . . . x np
[ ]
Dissimilarity matrix 0
n data points, but d ( 2,1) 0
registers only the d ( 3,1) d ( 3,2) 0
distance : : :
A triangular matrix d ( n , 1) d (n , 2 ) .. . . .. 0
Single mode
30
Proximity Measures for Binary
Attributes
Object j
A contingency table for binary data
31
Dissimilarity between Binary
Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
34
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d(i , j )=∣x i1−x j 1∣+∣xi 2−x j 2∣+...+∣x i p−x j p∣
h = 2: (L2 norm) Euclidean distance
d(i , j )=√(∣x 1−x 1∣2 +∣x 2−x 2∣2 +...+∣x p−x p∣2 )
i j i j i j
35
Example: Minkowski Distance
Manhattan (L1)
Euclidean (L2)
Dissimilarity Matrices
Supremum (Linf)
36
Standardizing Numeric Data
Z-score:
x
z= σ − µ
xif − m f
standardized measure (z-score): zif = sf
mean absolute deviation is more robust than std dev
37
Ordinal Variables
numeric, ordinal
One may use a weighted formula to combine their
effects p (f) (f)
Σ f =1 δ ij d ij
d (i , j )=
Σ pf =1 δ(ij f )
Choice of δ(ijf )
Set δ
(f)
ij =0 if
x or x is missing
if jf
when f is ordinal
r −1
Compute ranks rif and zif = if
M −1 f
40
Cosine Similarity
A document can be represented by thousands of attributes,
each recording the frequency of a particular word (such as
keywords) or phrase in the document.
41
Cosine Similarity
Cosine measure: If x and y are two vectors (e.g., term-frequency
vectors), then
where
• indicates vector dot product,
||x||: the L2 norm (length) of vector x ∥x∥= √ x 21 + x 22 +...+ x 2p
x = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
y = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
x • y = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1=
= 25
||x||=(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=
= 6.481
||y||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=
= 4.12
cos(x, y) = 25 / (6.481 * 4.12) = 0.94
43
References
W. Cleveland, Visualizing Data, Hobart Press, 1993
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning.
John Wiley, 2003
U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in
Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an
Introduction to Cluster Analysis. John Wiley & Sons, 1990.
H. V. Jagadish et al., Special Issue on Data Reduction Techniques.
Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997
D. A. Keim. Information visualization and visual data mining, IEEE
trans. on Visualization and Computer Graphics, 8(1), 2002
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern
Analysis and Machine Intelligence, 21(9), 1999
E. R. Tufte. The Visual Display of Quantitative Information, 2 nd ed.,
Graphics Press, 2001
C. Yu et al., Visual data mining of multimedia data for social and
behavioral studies, Information Visualization, 8(1), 2009