0% found this document useful (0 votes)
39 views67 pages

X Chapter 02 Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views67 pages

X Chapter 02 Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Getting to Know Your Data

1
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

2
MODULE 5

Types of data in cluster analysis

3
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images
1 Bread, Coke, Milk
 Temporal data: time-series
 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:

4
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

5
Attributes
 Attribute (or dimensions (data warehousing), features
(machine learning), variables(statisticians)): a data field,

representing a characteristic or feature of a data


object.
 E.g., customer _ID, name, address
 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

6
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 No meaning full order and are not quantitative
 Mathematical operation on values of nominal attributes are not meaning ful
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)

7
Attribute Types
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings
 Central tendency can be represented by – mode and median
 But ‘mean’ cannot be defined

8
Numeric Attribute Types
 Numeric
 Quantitative - it is measurable quantity represented in

integer or real-values
 Can be interval –scaled or ratio scaled

 Interval
 Continous measurements of a rougly linear scale
 Measured on a scale of equal-sized units
 Allow us to compare and quantify the difference
between values
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Neither 0 degree Celcius/Fahrenheit indicates no
temperature
 Ratios are not valid 9
Numeric Attribute Types
 Ratio
 Inherent zero-point
 Values are ordered, being multiple (ratio), Can compute
difference, mean, mode, median,
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚, since 0 K˚ = -273.15˚C ).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

10
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits


 Continuous attributes are typically represented as

floating-point variables
11
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

12
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects
are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Proximity refers to a similarity or dissimilarity

13
Data Matrix and Dissimilarity Matrix
 Data structures
1. Data matrix  x11 ... x1f ... x1p 
 
 Stores the n data  ... ... ... ... ... 
objects with p x ... xif ... xip 
 i1 
attributes in the form  ... ... ... ... ... 
of a relational table x ... xnf ... xnp 
 n1 
 It is n by p matrix

14
Data Matrix and Dissimilarity Matrix
 Data structures  0 
2. Dissimilarity matrix  d(2,1) 0 
 
 This structure shows  d(3,1) d ( 3,2) 0 
collection of proximities  
 : : : 
that are available for all d ( n,1) d ( n,2) ... ... 0
pair of n objects.
 d(i, j) – difference

between object i and j


 D(i,j) = 0, objects are

same

15
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be


M.
 Simple matching
 m: # of matches (number of attributes for
which i and j are in same state),
 p: total # of attributes
d (i, j)  p 
p
m

16
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z

Calculate
 0 
 d(2,1)   0 
 0  1 / 2 0 
 d(3,1) d ( 3,2) 0   
  1 / 2 1 0 
 : : :   
d ( n,1) d ( n,2) ... ... 0  1 1 1 
 

17
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

18
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity


q
between object I and j qr  s
Sim(i, j) = q
qr  s

 Jaccard coefficient (similarity


measure for asymmetric binary
variables):

19
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be


M.
 Simple matching
 m: # of matches (number of attributes for
which i and j are in same state),
 p: total # of attributes
d (i, j)  p 
p
m

20
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z

Calculate
 0 
 d(2,1)   0 
 0  1 / 2 0 
 d(3,1) d ( 3,2) 0   
  1 / 2 1 0 
 : : :   
d ( n,1) d ( n,2) ... ... 0  1 1 1 
 

21
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

22
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity


q
between object I and j qr  s
Sim(i, j) = q
qr  s

 Jaccard coefficient (similarity


measure for asymmetric binary
variables):

23
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
24
Dissimilarity on Numeric Data: Minkowski
Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
25
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.


 This is the maximum difference between any component (attribute)
of the vectors

26
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
27
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank
if
rif {1,..., M f }
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables (xif-mean)/sd

28
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,

ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 δij =0, if either 1) xif or xjf is missing 2) xif =xjf and attribute is asymmetric
binary, otherwise δij =1
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal
z 
rif  1

Compute ranks rif and if M f 1

Treat zif as interval-scaled
29
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

30
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

31
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Central tendency
 Location of middle or center of dispersion
 Dispersion
 How data are spread about
 Range, quartiles, interquartile range, five-number summary.

32
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x

i i
 Trimmed mean: chopping extreme values x i 1
n
 Median: w
i 1
i
 Middle value if odd number of values, or average of the
middle two values otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq ) l Median
median  L1  ( ) width interval

freq median
 L1 –lower boundary of median interval
 n – the number of values in entire data set
 ∑freq – sum of freq of all intervals lower than median interval
 Freqmedian – frequency of median interval
 Width- width of median interval
33
Measuring the Central Tendency
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:
mean  mode  3  (mean  median)

34
Symmetric vs. Skewed
Data
 Median, mean and mode of symmetric

symmetric, positively and negatively


skewed data

positively skewed negatively skewed

June 18, 2024 Data Mining: Concepts and Techniques 35


Measuring the Dispersion of Data
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)

n n
1 1
    x  2
2
2
( xi  2
)  i
N i 1 N i 1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

36
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: sorted, splits into equal size, consecutive set
 Q1 (25th percentile), Q3 (75th percentile)

 Inter-quartile range: IQR = Q3 – Q1


 Five number summary: fuller summary of shape of distribution
 min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers (lines outside the box), and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR

37
Boxplot Analysis

 Five-number summary of a distribution


 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

38
Visualization of Data Dispersion: 3-D Boxplots

June 18, 2024 Data Mining: Concepts and Techniques 39


Properties of Normal Distribution Curve

 The normal (distribution) curve


 From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation)


 From μ–2σ to μ+2σ: contains about 95% of it

 From μ–3σ to μ+3σ: contains about 99.7% of it

40
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary


 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
41
Histogram Analysis
 Histogram: Graph display of tabulated
40
frequencies, shown as bars
35
 It shows what proportion of cases fall
into each of several categories 30
 Differs from a bar chart in that it is the25
area of the bar that denotes the 20
value, not the height as in bar charts, 15
a crucial distinction when the 10
categories are not of uniform width
5
 The categories are usually specified as
0
non-overlapping intervals of some 10000 30000 50000 70000 90000

variable. The categories (bars) must


be adjacent
42
Histograms Often Tell More than Boxplots

 The two histograms


shown in the left may
have the same boxplot
representation
 The same values for:
min, Q1, median, Q3,
max
 But they have rather
different data
distributions

43
Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
 Plots quantile information
 For a data x data sorted in increasing order, f indicates
i i
that approximately 100 fi% of the data are below or
equal to the value xi

Data Mining: Concepts and Techniques 44


Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to
another?
 Example shows unit price of items sold at Branch 1 vs. Branch
2 for each quantile. Unit prices of items sold at Branch 1 tend
to be lower than those at Branch 2.

45
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

46
Positively and Negatively Correlated Data

 The left half fragment is positively


correlated
 The right half is negative correlated

47
Uncorrelated Data

48
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects
are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Proximity refers to a similarity or dissimilarity

49
Data Matrix and Dissimilarity Matrix
 Data structures
1. Data matrix  x11 ... x1f ... x1p 
 
 Stores the n data  ... ... ... ... ... 
objects with p x ... xif ... xip 
 i1 
attributes in the form  ... ... ... ... ... 
of a relational table x ... xnf ... xnp 
 n1 
 It is n by p matrix

50
Data Matrix and Dissimilarity Matrix
 Data structures  0 
2. Dissimilarity matrix  d(2,1) 0 
 
 This structure shows  d(3,1) d ( 3,2) 0 
collection of proximities  
 : : : 
that are available for all d ( n,1) d ( n,2) ... ... 0
pair of n objects.
 d(i, j) – difference

between object i and j


 D(i,j) = 0, objects are

same

51
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be


M.
 Simple matching
 m: # of matches (number of attributes for
which i and j are in same state),
 p: total # of attributes
d (i, j)  p 
p
m

52
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z

Calculate
 0 
 d(2,1)   0 
 0  1 / 2 0 
 d(3,1) d ( 3,2) 0   
  1 / 2 1 0 
 : : :   
d ( n,1) d ( n,2) ... ... 0  1 1 1 
 

53
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

54
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity


q
between object I and j qr  s
Sim(i, j) = q
qr  s

 Jaccard coefficient (similarity


measure for asymmetric binary
variables):

55
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be


M.
 Simple matching
 m: # of matches (number of attributes for
which i and j are in same state),
 p: total # of attributes
d (i, j)  p 
p
m

56
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z

Calculate
 0 
 d(2,1)   0 
 0  1 / 2 0 
 d(3,1) d ( 3,2) 0   
  1 / 2 1 0 
 : : :   
d ( n,1) d ( n,2) ... ... 0  1 1 1 
 

57
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

58
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity


q
between object I and j qr  s
Sim(i, j) = q
qr  s

 Jaccard coefficient (similarity


measure for asymmetric binary
variables):

59
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
60
Dissimilarity on Numeric Data: Minkowski
Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
61
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.


 This is the maximum difference between any component (attribute)
of the vectors

62
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
63
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank
if
rif {1,..., M f }
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables

64
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,

ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 δij =0, if either 1) xif or xjf is missing 2) xif =xjf and attribute is asymmetric
binary, otherwise δij =1
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal
z 
rif  1

Compute ranks rif and if M f 1

Treat zif as interval-scaled
65
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

66
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

67

You might also like