0% found this document useful (0 votes)
13 views51 pages

Lect 3

Uploaded by

kisan patro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views51 pages

Lect 3

Uploaded by

kisan patro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Concepts and Techniques

— Chapter 2 —

1
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

2
Types of Data Sets
 Record

Relational records

Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla

wi
crosstabs

n
y

Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2

Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0

Social or information networks

Molecular Structures
 Ordered TID Items

Video data: sequence of images
1 Bread, Coke, Milk

Temporal data: time-series

Sequential Data: transaction 2 Beer, Bread
sequences 3 Beer, Coke, Diaper, Milk

Genetic sequence data 4 Beer, Bread, Diaper, Milk
 Spatial, image and multimedia:
5 Coke, Diaper, Milk

Spatial data: maps

Image data:

Video data:
3
Important Characteristics of
Structured Data

 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution

Patterns depend on the scale
 Distribution
 Centrality and dispersion

4
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales

medical database: patients, treatments

university database: students, professors, courses
 Also called samples , examples, instances, data
points, objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns -
>attributes.
5
We want to know the following
 What are types of attributes or fields of data?
 What are kind of values?
 Which attributes are discrete or continuous-
valued?
 What do the data look like?
 How are the values distributed?
 Are there ways we can visualize the data to
get a better sense of it all?
 Can we spot any outliers?
 Can we measure the similarity of some data
objects w.r.t. others?
6
Data/Attributes
 Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal
 Binary
 Ordinal
 Numeric: quantitative

Interval-scaled

Ratio-scaled

7
Binary Attribute Types
 Binary Variables: attribute with only 2 states (0 and
1)
 However, it can be symmetric/asymmetric

 Symmetric binary: both outcomes equally important



e.g., gender
 Asymmetric binary: outcomes not equally important.

e.g., outcome of medical test (positive vs. negative)

Convention: assign 1 to most important outcome (e.g.,
HIV positive)
Nominal Attribute Types
 Nominal Variables: categories, states, or
“names of things” :marital status,occupation,
ID numbers
 A generalization of the binary variable in

that it can take on more than two states.


 For example, a color be white, green, blue,

red.
 How is dissimilarity computed?


Matching approach d(i,j)=(p-m)/p

M is the number of similar attributes between I
and j

P is the number of total attributes between I and
j
Ordinal Attribute Types
 Ordinal
 Values have a meaningful order (ranking)

but magnitude between successive


values is not known.
 Size = {small, medium, large}, grades,

army rankings

10
Numeric Attribute Types
 Interval-Scaled Variables
 Continuous measurements of a roughly

linear scale
 Weight, height, latitude, temperature

 How to compute their differences?


Numeric Attribute Types
 Ratio-Scaled Variables
 A positive measurement on a nonlinear scale, such as an
exponential scale
 Growth of bacteria population
 Decay of radioactive element
 How to compute dissimilarity?

Just like Interval-based variables

But needs a transformation:
 Apply logarithmic transformation to a linearly ratio-scaled
variable
 Some times we may need to use log-log, log-log-log, and so on...
Very exciting!
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval

Measured on a scale of equal-sized units

Values have order
 E.g., temperature in C˚or F˚, calendar dates

No true zero-point
 Ratio

Inherent zero-point

We can speak of values as being an order of
magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

13
Discrete vs. Continuous
Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values


E.g., zip codes, profession, or the set of words in
a collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of

discrete attributes
 Continuous Attribute
 Has real numbers as attribute values


E.g., temperature, height, or weight
 Practically, real values can only be measured and

represented using a finite number of digits


 Continuous attributes are typically represented as

floating-point variables
14
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

15
Basic Statistical Descriptions of
Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed
cube
16
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N
n
 Weighted arithmetic mean:

w x i i
Trimmed mean: chopping extreme values x  i 1
n
 Median: w
i 1
i
 Middle value if odd number of values, or average
of the middle two values otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq )l

median L1  ( ) width
Mode freq median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:
mean  mode 3 (mean  median)
17
Symmetric vs.
Skewed Data
 Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data

positively negatively
skewed skewed

Data Mining: Concepts and


December 5, 2024 Techniques 18
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and
third quartiles, i.e., the height of the
box is IQR
 The median is marked by a line within
the box
 Whiskers: two lines outside the box
extended to Minimum and Maximum
 Outliers: points beyond a specified
outlier threshold, plotted individually
19
Graphic Displays of Basic Statistical
Descriptions

 Boxplot: graphic display of five-number summary


 Histogram: x-axis are values, y-axis repres.
frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
20
Histogram Analysis
 Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
 It shows what proportion of cases
30
fall into each of several categories
25
 Differs from a bar chart in that it
is the area of the bar that denotes 20
the value, not the height as in bar 15
charts, a crucial distinction when
the categories are not of uniform 10
width 5
 The categories are usually 0
specified as non-overlapping 10000 30000 50000 70000 90000

intervals of some variable. The


categories (bars) must be
adjacent
21
Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data
are below or equal to the value xi

Data Mining: December 5, 2024


22
Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
 Allows the user to view whether there is a shift in
going from one distribution to another

Data Mining: December 5, 2024


23
Scatter plot

 Provides a first look at bivariate data to see


clusters of points, outliers, etc
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

Data Mining: December 5, 2024


24
Histograms Often Tell More than
Boxplots

 The two histograms


shown in the left
may have the same
boxplot
representation
 The same values
for: min, Q1,
median, Q3, max
 But they have
rather different data
distributions

25
Positively and Negatively Correlated
Data

 The left half fragment is positively


correlated
 The right half is negative correlated

26
Uncorrelated Data

27
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

28
Data Visualization
 Why data visualization?
 Gain insight into an information space by mapping data onto
graphical primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships
among data
 Help find interesting regions and suitable parameters for further
quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations
29
Pixel-Oriented Visualization
Techniques
 For a data set of m dimensions, create m windows on the
screen, one for each dimension
 The m dimension values of a record are mapped to m pixels
at the corresponding positions in the windows
 The colors of the pixels reflect the corresponding values

(a) Income (b) Credit (c) transaction (d) age


Limit volume 30
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

31
Measure of Similarity and Dissimilarity
 In clustering, outlier analysis and nearest-
neighbor
We need to assess how alike and unalike
objects are in comparison to one another.
Example: A store may want to search for clusters of customer
objects, resulting in groups of customers with similar
characteristics (e.g. similar income, area of residence, and
age). Such information can then be used for marketing.
A cluster is a collection of data objects such that the objects
within a cluster are similar to one another and dissimilar to
the objects in other clusters.
Knowledge of similarities can be used in nearest-neighbor
classification scheme.
Knowledge of disimilarities can be used in outlier analysis.

32
Similarity and Dissimilarity
 Similarity

Numerical measure of how alike two data objects i and j

Value is higher when objects are more alike
0 if the objects are unalike
1 if the objects are alike (complete similarity)

Often falls in the range [0,1]
 Dissimilarity (e.g., distance , opposite to similarity)

Numerical measure of how different two data objects i and j

Value Lower when objects are more alike
0 if the objects are same
>0 if the objects are dissimilar

Minimum dissimilarity is often 0

Upper limit varies. The higher the dissimilarity value , the
more dissimilar the two objects are.
 Proximity refers to measure similarity or dissimilarity
33
Data Matrix
 Data matrix(object-by-  x11 ... x
1f
... x 
1p 
attribute structure) 
 ... ... ... ... ... 
 n data objects with p attributes. x ... x ... x 
Each row corresponds to an  i1 if ip 
object.  ... ... ... ... ... 
 
Two-mode matrix (rows for object  xn1 ... x ... x 

nf np 
and column for attr.)

34
Dissimilarity Matrix
 Dissimilarity matrix
(object-by-object stru.)  0 
 d(2,1) 0 
 n data points, but registers  
only the distance d(i,j) i.e.  d(3,1) d (3,2) 0 
the difference between  
objects i and j.  : : : 
 0 if objects are highly
 d (n,1) d (n,2) ... ... 0
similar or near each other.
d(i,j) =d(j,i)
 Larger the more they differ.
 A triangular matrix
 One-mode matrix (contains
one kind of entity

35
Similarity and Dissimilarity
Matrix
 Measures of similarity can often be
expressed as a function of measures of
dissimilarity. For nominal data
sim(i,j)= 1- d(i,j)
 Many clustering and nearest-neighbor
algorithms operate on a dissimilarity
matrix.
 Data in the form of a data matrix can be
transformed into a dissimilarity matrix
before applying such algorithms.

36
Proximity Measure for Nominal
Attributes
 Can take 2 or more states, e.g., red, yellow,
blue, green (generalization of a binary attribute)
 Method 1: Simple matching

m: # of matches, p: total # of attributes

d (i, j)  p p m
 Method 2: Use a large number of binary
attributes

creating a new binary attribute for each of the
M nominal states

37
Proximity Measure for Binary
Attributes
Object j
 A contingency table for binary data
Object i

 Distance measure for symmetric


binary variables:
 Distance measure for asymmetric
binary variables:
 Jaccard coefficient (similarity
measure for asymmetric binary
variables):

 Note: Jaccard coefficient is the same as “coherence”:

38
Dissimilarity between Binary
Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M 1 0 1 0 0 0
Mary F 1 0 1 0 1 0
Jim M 1 1 0 0 0 0
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
0 1 Likely to have a similar disease
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2 Unlikely to have a similar disease
d ( jim , mary )  0.75
11 2
39
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

40
Distance on Numeric Data

Commonly used distance measure for


computing the dissimilarity of objects
described by numeric attributes:
 Euclidean distance

 Manhattan distance

 Minkowski distance

41
Euclidean distance
 Let i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)
are two p-dimensional data objects (i.e. p
numeric attributes)
 The Euclidean distance between two
objects i and j is defined as

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

42
Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance so
defined is also called L-h norm)
Euclidean and Manhattan distance satisfy the following
properties:
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, i) =0; distance to object itself (identity of indiscernibles)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)

A distance that satisfies these properties is a metric
43
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.


 This is the maximum difference between any component
(attribute) of the vectors

44
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan
x1 1 2 (L1)L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
45
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank rif {1,..., M f }
if

 map the range of each variable onto [0, 1] by


replacing i-th object in the f-th variable by
r 1
zif  if
Mf 1

 compute the dissimilarity using methods for


interval-scaled variables

46
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary,

numeric, ordinal
 One may use a weighted formula to combine their
effects
 pf 1 ij( f ) dij( f )
d (i, j)  p
 f 1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal

Compute ranks rif and zif  rif  1


Treat zif as interval-scaled Mf 1
47
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords)
or phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene
feature mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector
d
48
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

49
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

50
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled,
ratio-scaled
 Many types of data sets, e.g., numerical, text, graph, Web,
image.
 Gain insight into the data by:
 Basic statistical data description: central tendency,
dispersion, graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of
research.

51

You might also like