0% found this document useful (0 votes)
10 views26 pages

Lecture 10

fdhdtytyr

Uploaded by

sarahgohar0308
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views26 pages

Lecture 10

fdhdtytyr

Uploaded by

sarahgohar0308
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

BIG DATA ANALYTICS

Lecture 10 --- Week 11


Content

 Overview of Clustering

 Some Applications of Clustering

 Uses of Clustering

 Similarity and Distance Measures

 Jaccard’s coefficient and distance, Simple matching coefficient and


distance, and Hamming distance
Overview of Clustering

 In general a grouping of objects such that the objects in a group


(cluster) are similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
 Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated to)
the objects in other groups
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no
predefined classes
 Clustering is used:
 As a stand-alone tool to get insight into data distribution
 Visualization of clusters may unveil important information
 As a preprocessing step for other algorithms
 Efficient indexing or compression often relies on clustering
Some Applications of Clustering

 Pattern Recognition
 Image Processing
 cluster images based on their visual content
 Bio-informatics
 WWW and IR
 document classification
 cluster Weblog data to discover groups of similar access
patterns
Uses of Clustering
Discovered Clusters Industry Group
 Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

 Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

for browsing, genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,


Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
functionality, stocks with Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN

similar price fluctuations, Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

users with same behavior 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN


Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

 Summarization
 Reduce the size of large
data sets

 Applications
 Recommendation systems
Clustering
 Search Personalization precipitation in
Australia
Outliers
 Outliers are objects that do not belong to any
cluster or form clusters of very small cardinality

cluster

outliers

 In some applications we are interested in


discovering outliers, not clusters (outlier analysis)
Data Structures
attributes/dimensions
 data matrix
 (two modes)  x11 ... x1f ... x1p 
 

tuples/objects
 ... ... ... ... ... 
x ... x if ... x ip 
the “classic” data input  i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 dissimilarity or distance  n1 
matrix objects
 (one mode)  0 
 d(2,1) 0 
 
objects
 d(3,1) d ( 3,2) 0 
 
Assuming simmetric distance  : : : 
d(i,j) = d(j, i)  d ( n,1) d ( n,2) ... ... 0
Similarity and Distance

 For many different problems we need to quantify how close two objects
are.
 Examples:
 For an item bought by a customer, find other similar items
 Group together the customers of a site so that similar customers are shown
the same ad.
 Group together web documents so that you can separate the ones that talk
about politics and the ones that talk about sports.
 Find all the near-duplicate mirrored web documents.
 Find credit card transactions that are very different from previous transactions.
 To solve these problems we need a definition of similarity, or distance.
 The definition depends on the type of data that we have
Similarity

 Numerical measure of how alike two data objects are.


 A function that maps pairs of objects to real values
 Higher when objects are more alike.
 Often falls in the range [0,1], sometimes in [-1,1]

 Desirable properties for similarity


1. s(p, q) = 1 (or maximum similarity) only if p = q. (Identity)
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
Similarity between sets

 Consider the following documents

apple apple new


releases releases apple
new new pie

ipod ipad recipe
Which ones are more similar?

 How would you quantify their similarity?


Similarity: Intersection

 Number of words in common


apple apple new
releases releases apple
new new pie
ipod ipad recipe
 Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2
 What about this document?

Vefa rereases new book


with apple pie recipes
 Sim(D,D) = Sim(D,D) = 3
Measuring Similarity in Clustering
 Dissimilarity/Similarity metric:

 The dissimilarity d(i, j) between two objects i and j is


expressed in terms of a distance function, which is typically a
metric:
 d(i, j)0 (non-negativity)
 d(i, i)=0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

 The definitions of distance functions are usually


different for interval-scaled, boolean, categorical,
ordinal and ratio-scaled variables.

 Weights may be associated with different variables


based on applications and data semantics.
Type of data in cluster analysis
 Interval-scaled variables
 e.g., salary, height

 Binary variables
 e.g., gender (M/F), has_cancer(T/F)

 Nominal (categorical) variables


 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

 Ordinal variables
 e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

 Ratio-scaled variables
 population growth (1,10,100,1000,...)

 Variables of mixed types


 multiple attributes with various types
Similarity and Dissimilarity Between
Objects
 Distance metrics are normally used to measure
the similarity or dissimilarity between two data
objects
 The most popular conform to Minkowski distance:
 p p p 1/ p
L p (i, j)  | x  x |  | x  x | ... | x  x |

 
 i1 j1 i2 j 2 in jn 

where i = (xi1, xi2, …, xin) and j = (xj1, xj2, …, xjn) are two n-
dimensional data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block)


L (i, j) | x  x |  | x  x | ... | x  x |
distance: 1 i1 j1 i2 j 2 in jn
Similarity and Dissimilarity Between
Objects (Cont.)
 If p = 2, L2 is the Euclidean distance:

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )

i1 j1 i2 j 2 in jn
Properties

d(i,j) 0
d(i,i) =0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance:

d (i, j)  (w | x  x |2 w | x  x |2 ... wn | x  x |2 )
1 i1 j1 2 i2 j 2 in jn
Jaccard Similarity
 The Jaccard similarity (Jaccard coefficient) of two sets S1, S2
is the size of their intersection divided by the size of their
union.
 JSim (C1, C2) = |C1C2| / |C1C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

 Extreme behavior:
 Jsim(X,Y) = 1, iff X = Y
 Jsim(X,Y) = 0 iff X,Y have no elements in common
 JSim is symmetric
Jaccard Similarity between sets

 The distance for the documents

apple apple new Vefa


releases releases apple rereases
new new pie new book
ipod ipad recipe with apple
pie recipes
 JSim(D,D) = 3/5
 JSim(D,D) = JSim(D,D) = 2/6
 JSim(D,D) = JSim(D,D) = 3/9
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data object j
1 0 sum
i= (0011101001)
J=(1001100110) 1 a b a b
object i 0 c d c d
sum a  c b  d p

 Simple matching coefficient distance (invariant, if the binary


d (i, j)  b c
variable is symmetric): a b  c  d

 Jaccard coefficient distance (i, j)  b ifcthe binary variable


d (noninvariant
a b  c
is asymmetric):
Binary Variables
 Another approach is to define the similarity of two
objects and not their distance.
 In that case we have the following:
 Simple matching coefficient similarity:
s(i, j)  a d
a b  c  d
 Jaccard coefficient similarity:
s(i, j)  a
a b  c

Note that: s(i,j) = 1 – d(i,j)


Dissimilarity between Binary
Variables
 Example (Jaccard coefficient)
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0

 all attributes are asymmetric binary


 1 denotes presence or positive test
 0 denotes absence or negative test
0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
A simpler definition
 Each variable is mapped to a bitmap (binary vector)
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0

 Jack: 101000
 Mary: 101010
 Jim: 110000
 Simple match distance:
number of non - common bit positions
d (i, j ) 
total number of bits

 Jaccard coefficient: number of 1' s in i  j


d (i, j ) 1 
number of 1' s in i  j
Distance

 Numerical measure of how different two data objects are


 A function that maps pairs of objects to real values
 Lower when objects are more alike
 Higher when two objects are different
 Minimum distance is 0, when comparing an object with itself.
 Upper limit varies
Distance Metric

 A distance function d is a distance metric if it is a function from pairs


of objects to real numbers such that:
1. d(x,y) > 0. (non-negativity)
2. d(x,y) = 0 iff x = y. (identity)
3. d(x,y) = d(y,x). (symmetry)
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Hamming Distance

 Hamming distance is the number of positions in which bit-vectors


differ.
 Example: p1 = 10101
p2 = 10011.
 d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th positions.
 The L1 norm for the binary vectors

 Hamming distance between two vectors of categorical attributes


is the number of positions in which they differ.
 Example: x = (married, low income, cheat),
y = (single, low income, not cheat)
d(x,y) = 2

You might also like