0% found this document useful (0 votes)

11 views26 pages

Lecture 10

fdhdtytyr

Uploaded by

sarahgohar0308

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views26 pages

Lecture 10

fdhdtytyr

Uploaded by

sarahgohar0308

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

BIG DATA ANALYTICS

Lecture 10 --- Week 11

Content

 Overview of Clustering

 Some Applications of Clustering

 Uses of Clustering

 Similarity and Distance Measures

 Jaccard’s coefficient and distance, Simple matching coefficient and

distance, and Hamming distance
Overview of Clustering

 In general a grouping of objects such that the objects in a group

(cluster) are similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
 Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated to)
the objects in other groups
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no
predefined classes
 Clustering is used:
 As a stand-alone tool to get insight into data distribution
 Visualization of clusters may unveil important information
 As a preprocessing step for other algorithms
 Efficient indexing or compression often relies on clustering
Some Applications of Clustering

 Pattern Recognition
 Image Processing
 cluster images based on their visual content
 Bio-informatics
 WWW and IR
 document classification
 cluster Weblog data to discover groups of similar access
patterns
Uses of Clustering
Discovered Clusters Industry Group
 Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

 Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

for browsing, genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
functionality, stocks with Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN

similar price fluctuations, Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

users with same behavior 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

 Summarization
 Reduce the size of large
data sets

 Applications
 Recommendation systems
Clustering
 Search Personalization precipitation in
Australia
Outliers
 Outliers are objects that do not belong to any
cluster or form clusters of very small cardinality

cluster

outliers

 In some applications we are interested in

discovering outliers, not clusters (outlier analysis)
Data Structures
attributes/dimensions
 data matrix
 (two modes)  x11 ... x1f ... x1p 
 

tuples/objects
 ... ... ... ... ... 
x ... x if ... x ip 
the “classic” data input  i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 dissimilarity or distance  n1 
matrix objects
 (one mode)  0 
 d(2,1) 0 
 
objects
 d(3,1) d ( 3,2) 0 
 
Assuming simmetric distance  : : : 
d(i,j) = d(j, i)  d ( n,1) d ( n,2) ... ... 0
Similarity and Distance

 For many different problems we need to quantify how close two objects
are.
 Examples:
 For an item bought by a customer, find other similar items
 Group together the customers of a site so that similar customers are shown
the same ad.
 Group together web documents so that you can separate the ones that talk
about politics and the ones that talk about sports.
 Find all the near-duplicate mirrored web documents.
 Find credit card transactions that are very different from previous transactions.
 To solve these problems we need a definition of similarity, or distance.
 The definition depends on the type of data that we have
Similarity

 Numerical measure of how alike two data objects are.

 A function that maps pairs of objects to real values
 Higher when objects are more alike.
 Often falls in the range [0,1], sometimes in [-1,1]

 Desirable properties for similarity

1. s(p, q) = 1 (or maximum similarity) only if p = q. (Identity)
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
Similarity between sets

 Consider the following documents

apple apple new

releases releases apple
new new pie

ipod ipad recipe
Which ones are more similar?

 How would you quantify their similarity?

Similarity: Intersection

 Number of words in common

apple apple new
releases releases apple
new new pie
ipod ipad recipe
 Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2
 What about this document?

Vefa rereases new book

with apple pie recipes
 Sim(D,D) = Sim(D,D) = 3
Measuring Similarity in Clustering
 Dissimilarity/Similarity metric:

 The dissimilarity d(i, j) between two objects i and j is

expressed in terms of a distance function, which is typically a
metric:
 d(i, j)0 (non-negativity)
 d(i, i)=0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

 The definitions of distance functions are usually

different for interval-scaled, boolean, categorical,
ordinal and ratio-scaled variables.

 Weights may be associated with different variables

based on applications and data semantics.
Type of data in cluster analysis
 Interval-scaled variables
 e.g., salary, height

 Binary variables
 e.g., gender (M/F), has_cancer(T/F)

 Nominal (categorical) variables

 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

 Ordinal variables
 e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

 Ratio-scaled variables
 population growth (1,10,100,1000,...)

 Variables of mixed types

 multiple attributes with various types
Similarity and Dissimilarity Between
Objects
 Distance metrics are normally used to measure
the similarity or dissimilarity between two data
objects
 The most popular conform to Minkowski distance:
 p p p 1/ p
L p (i, j)  | x  x |  | x  x | ... | x  x |

 
 i1 j1 i2 j 2 in jn 

where i = (xi1, xi2, …, xin) and j = (xj1, xj2, …, xjn) are two n-
dimensional data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block)

L (i, j) | x  x |  | x  x | ... | x  x |
distance: 1 i1 j1 i2 j 2 in jn
Similarity and Dissimilarity Between
Objects (Cont.)
 If p = 2, L2 is the Euclidean distance:

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )

i1 j1 i2 j 2 in jn
Properties

d(i,j) 0
d(i,i) =0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance:

d (i, j)  (w | x  x |2 w | x  x |2 ... wn | x  x |2 )
1 i1 j1 2 i2 j 2 in jn
Jaccard Similarity
 The Jaccard similarity (Jaccard coefficient) of two sets S1, S2
is the size of their intersection divided by the size of their
union.
 JSim (C1, C2) = |C1C2| / |C1C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

 Extreme behavior:
 Jsim(X,Y) = 1, iff X = Y
 Jsim(X,Y) = 0 iff X,Y have no elements in common
 JSim is symmetric
Jaccard Similarity between sets

 The distance for the documents

apple apple new Vefa

releases releases apple rereases
new new pie new book
ipod ipad recipe with apple
pie recipes
 JSim(D,D) = 3/5
 JSim(D,D) = JSim(D,D) = 2/6
 JSim(D,D) = JSim(D,D) = 3/9
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data object j
1 0 sum
i= (0011101001)
J=(1001100110) 1 a b a b
object i 0 c d c d
sum a  c b  d p

 Simple matching coefficient distance (invariant, if the binary

d (i, j)  b c
variable is symmetric): a b  c  d

 Jaccard coefficient distance (i, j)  b ifcthe binary variable

d (noninvariant
a b  c
is asymmetric):
Binary Variables
 Another approach is to define the similarity of two
objects and not their distance.
 In that case we have the following:
 Simple matching coefficient similarity:
s(i, j)  a d
a b  c  d
 Jaccard coefficient similarity:
s(i, j)  a
a b  c

Note that: s(i,j) = 1 – d(i,j)

Dissimilarity between Binary
Variables
 Example (Jaccard coefficient)
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0

 all attributes are asymmetric binary

 1 denotes presence or positive test
 0 denotes absence or negative test
0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
A simpler definition
 Each variable is mapped to a bitmap (binary vector)
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0

 Jack: 101000
 Mary: 101010
 Jim: 110000
 Simple match distance:
number of non - common bit positions
d (i, j ) 
total number of bits

 Jaccard coefficient: number of 1' s in i  j

d (i, j ) 1 
number of 1' s in i  j
Distance

 Numerical measure of how different two data objects are

 A function that maps pairs of objects to real values
 Lower when objects are more alike
 Higher when two objects are different
 Minimum distance is 0, when comparing an object with itself.
 Upper limit varies
Distance Metric

 A distance function d is a distance metric if it is a function from pairs

of objects to real numbers such that:
1. d(x,y) > 0. (non-negativity)
2. d(x,y) = 0 iff x = y. (identity)
3. d(x,y) = d(y,x). (symmetry)
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Hamming Distance

 Hamming distance is the number of positions in which bit-vectors

differ.
 Example: p1 = 10101
p2 = 10011.
 d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th positions.
 The L1 norm for the binary vectors

 Hamming distance between two vectors of categorical attributes

is the number of positions in which they differ.
 Example: x = (married, low income, cheat),
y = (single, low income, not cheat)
d(x,y) = 2

Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
K Medoids
No ratings yet
K Medoids
101 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Clustering
No ratings yet
Clustering
64 pages
Lec-3. Datamining-Similarity-Distance-Ext
No ratings yet
Lec-3. Datamining-Similarity-Distance-Ext
104 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
3 - Clustering
No ratings yet
3 - Clustering
39 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
No ratings yet
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
13 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
02data Part4
No ratings yet
02data Part4
28 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Lec 5
No ratings yet
Lec 5
24 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
2 ADA Cluster Analysis
No ratings yet
2 ADA Cluster Analysis
12 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
Clustering
No ratings yet
Clustering
47 pages
Clustering
0% (1)
Clustering
127 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Unit 4
No ratings yet
Unit 4
65 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
SLD-04 Single Line Diagram (4 of 11)
No ratings yet
SLD-04 Single Line Diagram (4 of 11)
1 page
TwoStep Cluster Analysis
No ratings yet
TwoStep Cluster Analysis
35 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Poa Sba
100% (2)
Poa Sba
14 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
SET Duct Manufacturing, Inc.: Spiral Duct Dimensional Guide
100% (1)
SET Duct Manufacturing, Inc.: Spiral Duct Dimensional Guide
20 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Aces Review Center: Ree Online Review Refresher Esas 7B by Engr. Jimmy L. Ocampo 0920 - 644 - 6246
No ratings yet
Aces Review Center: Ree Online Review Refresher Esas 7B by Engr. Jimmy L. Ocampo 0920 - 644 - 6246
5 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
CDM 400x300 en
No ratings yet
CDM 400x300 en
5 pages
Yamada Diaphragm Pump 80 Series Manual
No ratings yet
Yamada Diaphragm Pump 80 Series Manual
18 pages
Bike Generator Thesis
100% (3)
Bike Generator Thesis
6 pages
IRC Codes
No ratings yet
IRC Codes
36 pages
Apporio Taxi Uber Clone
No ratings yet
Apporio Taxi Uber Clone
5 pages
Computer Vision in Banking
No ratings yet
Computer Vision in Banking
7 pages
CPE 445-Internet of Things - Chapter 7
No ratings yet
CPE 445-Internet of Things - Chapter 7
39 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
HUAWEI FLA-LX3 9.1.0.116 (C605E5R1P1) Release Notes
No ratings yet
HUAWEI FLA-LX3 9.1.0.116 (C605E5R1P1) Release Notes
10 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Anh Do Biography
No ratings yet
Anh Do Biography
2 pages
The Use of Ultrasonic Cleaning in Dairy Industry: How Does It Work?
No ratings yet
The Use of Ultrasonic Cleaning in Dairy Industry: How Does It Work?
3 pages
Superseded
No ratings yet
Superseded
19 pages
D - 5368 - Qadnet-Motorized Smoke Damper-Ocr-26102022.
No ratings yet
D - 5368 - Qadnet-Motorized Smoke Damper-Ocr-26102022.
3 pages
MxG2wDO ReleaseNotes
No ratings yet
MxG2wDO ReleaseNotes
4 pages
PDF 132821 67441
No ratings yet
PDF 132821 67441
10 pages
Candidate Handbook
No ratings yet
Candidate Handbook
66 pages
Lecture - Arrays
No ratings yet
Lecture - Arrays
16 pages
BRAC IT Report
No ratings yet
BRAC IT Report
15 pages
Work Order Traveler - Tata Standard Work Order
No ratings yet
Work Order Traveler - Tata Standard Work Order
4 pages
Log
No ratings yet
Log
9 pages
FB Viral Page
No ratings yet
FB Viral Page
2 pages
B EMI Strategy
No ratings yet
B EMI Strategy
5 pages
Use Case Specification - Place Rush Order
No ratings yet
Use Case Specification - Place Rush Order
2 pages
Results Experiment 1: Determination of Power Input, Heat Output and Coefficient of Performance
No ratings yet
Results Experiment 1: Determination of Power Input, Heat Output and Coefficient of Performance
6 pages
Accu 204 Trabajofinal
No ratings yet
Accu 204 Trabajofinal
3 pages
ZYAROCK Artec Pot Leaflet (En)
No ratings yet
ZYAROCK Artec Pot Leaflet (En)
2 pages
Simple Multi-Gbps 60 GHZ Radio-Over-Fiber Links Employing Optical and Electrical Data Up-Convers
No ratings yet
Simple Multi-Gbps 60 GHZ Radio-Over-Fiber Links Employing Optical and Electrical Data Up-Convers
3 pages