0% found this document useful (0 votes)
2 views31 pages

Mod 4 Types of Data in Cluster Analysis

The document provides an overview of data types, including structured and unstructured data, and their attributes. It discusses statistical descriptions of data, methods for measuring similarity and dissimilarity, and various distance measures such as Minkowski distance. Additionally, it covers the importance of data visualization and the characteristics of data objects.

Uploaded by

pobocow192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views31 pages

Mod 4 Types of Data in Cluster Analysis

The document provides an overview of data types, including structured and unstructured data, and their attributes. It discusses statistical descriptions of data, methods for measuring similarity and dissimilarity, and various distance measures such as Minkowski distance. Additionally, it covers the importance of data visualization and the characteristics of data objects.

Uploaded by

pobocow192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and


Dissimilarity

◼ Summary
1
Types of D a t a Sets
◼ Record
◼ Relational records
◼ Data matrix, e.g., numerical

timeout

season
coach

game
score
team

pla y
matrix, crosstabs

wi n
ball

lost
◼ Document data: text documents:
term- frequency vector
◼ Transaction data Document 1 3 0 5 0 2 6 0 2 0 2
◼ Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
◼ World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
◼ Social or information networks
◼ Molecular Structures
◼ Ordered TID Items
◼ Video data: sequence of images 1 Bread, Coke, Milk
◼ Temporal data: time-series
2 Beer, Bread
◼ Sequential Data: transaction
sequences 3 Beer, Coke, Diaper, Milk
◼ Genetic sequence data 4 Beer, Bread, Diaper, Milk
◼ Spatial, image and multimedia: 5 Coke, Diaper, Milk
◼ Spatial data: maps
◼ Image data:
◼ Video data: 2
Important Characteristics of Structured D a t a

◼ Dimensionality
◼ Curse of dimensionality
◼ Sparsity
◼ Only presence counts
◼ Resolution
◼ Patterns depend on the
scale
◼ Distribution
◼ Centrality and dispersion
3
D a t a Objects

◼ Data sets are made up of data objects.


◼ A data object represents an entity.
◼ Examples:
◼ sales database: customers, store items,
sales
◼ medical database: patients, treatments
◼ university database: students, professors,
courses
◼ Also called samples , examples, instances, data
points, objects, tuples.
◼ Data objects are described by attributes.
4
Attributes
◼ Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
◼ E.g., customer _ID, name, address
◼ Types:
◼ Nominal

◼ Binary

◼ Numeric: quantitative

◼ Interval-scaled

◼ Ratio-scaled

5
Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome
(e.g., HIV positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but
magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
◼ Quantity (integer or real-valued)
◼ Interval
◼ Measured on a scale of equal-sized
units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar
dates
◼ No true zero-point
◼ Ratio
◼ Inherent zero-point
◼ We can speak of values as being an order
of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5
K˚).
7
Discrete vs. Continuous Attributes
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of

values
◼ E.g., zip codes, profession, or the set of

words in a
collection of documents
◼ Sometimes, represented as integer variables

◼ Note: Binary attributes are a special case of

discrete attributes
◼ Continuous Attribute
◼ Has real numbers as attribute values

◼ E.g., temperature, height, or weight

◼ Practically, real values can only be

measured and represented using a finite


number of digits 8
Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and


Dissimilarity

◼ Summary
9
Basic Statistical Descriptions of D a t a
◼ Motivation
◼ To better understand the data: central
tendency, variation and spread
◼ Data dispersion characteristics
◼ median, max, min, quantiles, outliers,
variance, etc.
◼ Numerical dimensions correspond to sorted
intervals
◼ Data dispersion: analyzed with multiple
granularities
of precision
◼ Boxplot or quantile analysis on sorted
intervals
10
Measuring the Central Tendency
n
◼ Mean (algebraic measure) (sample vs. 1
x =
population): Note: n is sample size and N n
i
μ=∑
N
is population size.
∑ n x i=
1 x
∑ wixi
◼ Weightedmean:
Trimmed arithmetic mean:
chopping extreme i=1
values x = n


◼ Median:
◼ Middle value if odd number of values, or wi
average of the middle two values otherwise i=1

◼ Estimated by interpolation (for grouped data):

◼ Mode

◼ Value that occurs most frequently in


the data
◼ Unimodal, bimodal, trimodal
◼ Empirical mean − mode = 3×(mean −
formula:
median) 11
Symmetric vs. Skewed D a t a
ta
◼ Median, mean and symmetr
ic
mode of symmetric,
positively and
negatively skewed data

positively negatively
skewed skewed

March 7, 2023 Data Mining: ncepts and Techniques


Measuring the Dispersion of D a t a
◼ Quartiles, outliers and boxplots
◼ Quartiles: Q1 (25th percentile), Q3 (75th percentile)
◼ Inter-quartile range: IQR = Q3 – Q1
◼ Five number summary: min, Q1, median, Q3, max
◼ Boxplot: ends of the box are the quartiles; median is
marked; add whiskers, and plot outliers individually
◼ Outlier: usually, a value higher/lower than 1.5 x IQR
◼ Variance and standard deviation (sample: s, population: σ)
◼ Variance: (algebraic, scalable computation)

◼ Standard deviation s (or σ) is the square root of variance


s2 (or σ2)

13
Chapter 2: Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and


Dissimilarity

◼ Summary

14
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data

objects are
◼ Value is higher when objects are more alike

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)


◼ Numerical measure of how different two data

objects are
◼ Lower when objects are more alike

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity


15
D a t a Matrix and Dissimilarity Matrix
◼ Data matrix
◼ n data points
⎡ x 11 ... x 1f ... x 1p ⎤
with p ⎢
⎢ ... ... ⎥⎥
... ... ...
dimensions ⎢⎢ xi1 ⎥
... xif ... ip ⎥
◼ Two modes
⎢ ... x ... ⎥
⎢⎢ x ... ...
x nf ... x ⎥⎥
n1 np
... ...
◼ Dissimilarity ⎣ ⎦
matrix ⎡ 0 ⎤
⎢ d(2,1) 0 ⎥
◼ n data points, ⎢ ⎥
⎢ d(3,1) d (3,2) 0 ⎥
but registers ⎥
only the ⎢⎢ : : : ⎥
distance ⎢⎣ d ... 0 ⎥
d (n,2) ⎦
◼ A triangular (n,1)
... 16
Proximity Measure for Nominal Attributes

◼ Can take 2 or more states, e.g., red, yellow,


blue, green (generalization of a binary
attribute)
◼ Method 1: Simple matching
◼ m: # of matches, p:ptotal
− # of variables
d(i, j) = p
◼ Method 2: Use
m a large number of binary
attributes
◼ creating a new binary attribute for each
of the
M nominal states 17
Proximity Measure for Binary Attributes
Object
j
◼ A contingency table for binary data
Object i

◼ Distance measure for symmetric


binary variables:
◼ Distance measure for asymmetric
binary variables:
◼ Jaccard coefficient
(similarity measure for
asymmetric binary
variables):
◼ Note: Jaccard coefficient is
the same as “coherence”:

18
Dissimilarity between Binary Variables
◼ Exampl
e Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ Gender is a symmetric attribute
◼ The remaining attributes are asymmetric
binary
◼ Let the values Y and P be 1, and
0 + 1 the
value Nd0( j a c k , m a r y ) = 2 + 0 + 1 = 0 . 3 3
1 +1
d ( jack, jim) = = 0.67
1 +1 +1
1 +2
d ( jim, mary) = = 0.75
1 +1 +2
19
Standardizing Numeric D a t a
◼ Z- z = xσ−
score:
◼ X: raw score to be standardized, μ: mean of the
μ
population, σ:
standard deviation
◼ the distance between the raw score and the population
mean in units of the standard deviation
◼ negative when the raw score is below the mean, “+”
when above
◼ s f =nCalculate
An alternative way: 1 (|1 xf − m|+| 2xf − absolute
fthe mean m|+...+|
f x− m
nf f
deviation
wher
e m f =n1|)
(x1 f + 2xf +...+ nf
. )
x −
if f
x zif
ms
◼ standardized measure (z-
score = f
): absolute deviation is more robust than using
◼ Using mean
standard deviation

20
Example:
D a t a Matrix and Dissimilarity Matrix
Data Matrix
point attribute1
attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean
Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

21
Distance on Numeric Data: Minkowski Distance
◼ Minkowski distance: A popular distance
measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)


are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h
norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality) 22
Special Cases of Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are
different between two binary vectors

d(i, j) =| xi1 − x j1 |+| xi2 − x j2 |+...+| xip − x jp |

◼ h = 2: (L2 norm) Euclidean distance


d(i, j) = (| x − x |2 +| x − x |2
+...+| x − x |2 )
i1 j1
i2 j2
ip jp
◼ h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
◼ This is the maximum difference between any component
(attribute) of the vectors

23
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3
x2 3 5 x4
x3 2 0 x1 0
x4 4 5 x2 5 0
x3 3 6 0
x4 6 1 7 0
Euclidean
(L2)L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39
0
Supremu
m L∞ x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
24
Ordinal Variables

◼ An ordinal variable can be discrete or


continuous
◼ Order is important, e.g., rank
◼ Can be treated
◼ replace like interval-scaled
by their r i f ∈{1, . . . , M
◼ x
map
if rank of each variable
the range f
} onto [0, 1] by
replacing
i-th object in the f-th variable
− 1 by
z i f = r if
M f − 1
◼ compute the dissimilarity using methods for
interval-
scaled variables
25
Attributes of Mixed Type
◼ A database may contain all attribute types
◼ Nominal, symmetric binary, asymmetric binary,

numeric, ordinal
◼ One may use a weighted formula to combine their
effects
Σ p δ ( f )d
d(i, j) = ( f )f = 1 ij ij
Σ fp = 1δij
◼ f is binary or nominal: (f)
dij (f) = 0 if xif = xjf , or d (f) = 1
◼ f otherwise
is
ij numeric: use the normalized

distance
◼ f ◼isCompute
ordinal ranks rif and −
zif r 1
◼ Treat z as interval-
if = Mif
f

−1
scaled 26
Cosine Similarity
◼ A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.

◼ Other vector objects: gene features in micro-arrays, …


◼ Applications: information retrieval, biologic taxonomy,
gene feature mapping, ...
◼ Cosine measure: If d1 and d2 are two vectors (e.g., term-
frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of
vector d 27
Example: Cosine Similarity
◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of
vector d

◼ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||=
(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||=
(3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
2
Chapter 2: Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and


Dissimilarity

◼ Summary
29
Summary
◼ Data attribute types: nominal, binary, ordinal, interval-
scaled, ratio- scaled
◼ Many types of data sets, e.g., numerical, text, graph, Web,
image.
◼ Gain insight into the data by:
◼ Basic statistical data description: central tendency,
dispersion, graphical displays
◼ Data visualization: map data onto graphical primitives
◼ Measure data similarity
◼ Above steps are the beginning of data preprocessing.
◼ Many methods have been developed but still an active
area of research.

30
References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics
Press,

2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009 31

You might also like