0% found this document useful (0 votes)

114 views25 pages

Lesson 6 Similarities KNN

The document discusses distance and similarity measures which are important for many data mining and analytics tasks. It describes how objects can be represented as vectors to allow computation of distances and similarities. Several common distance measures are introduced, including Manhattan distance, Euclidean distance, and Minkowski distance. It also discusses normalization techniques and vector-based similarity measures like cosine similarity.

Uploaded by

Dyuti Islam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views25 pages

Lesson 6 Similarities KNN

Uploaded by

Dyuti Islam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Distance and Similarity Measures

CSE450: Data Mining

Summer 2018
SAH@DIU
Distance or Similarity Measures
 Many data mining and analytics tasks involve the comparison of
objects and determining their similarities (or dissimilarities)
 Clustering
 Nearest-neighbor search, classification, and prediction
 Characterization and discrimination
 Automatic categorization
 Correlation analysis
 Many of todays real-world applications rely on the computation
similarities or distances among objects
 Personalization
 Recommender systems
 Document categorization
 Information retrieval
 Target marketing

2
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)

 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Upper limit varies

 Proximity refers to a similarity or dissimilarity

3
Distance or Similarity Measures
 Measuring Distance or Similarity
 In order to group similar items, we need a way to measure the distance
between objects (e.g., records)
 Often requires the representation of objects as “feature vectors”

An Employee DB Term Frequencies for Documents

ID Gender Age Salary T1 T2 T3 T4 T5 T6
1 F 27 19,000 Doc1 0 4 0 0 0 2
2 M 51 64,000 Doc2 3 1 4 3 1 2
3 M 52 100,000 Doc3 3 0 0 0 3 0
4 F 33 55,000 Doc4 0 1 0 3 0 0
5 M 45 45,000 Doc5 2 2 2 3 1 4

Feature vector corresponding to Feature vector corresponding to Document 4:

Employee 2: <M, 51, 64000.0> <0, 1, 0, 3, 0, 0>

4
Distance or Similarity Measures
 Representation of objects as vectors:
 Each data object (item) can be viewed as an n-dimensional vector, where
the dimensions are the attributes (features) in the data
 Example (employee DB): Emp. ID 2 = <M, 51, 64000>
 Example (Documents): DOC2 = <3, 1, 4, 3, 1, 2>
 The vector representation allows us to compute distance or similarity
between pairs of items using standard vector operations, e.g.,
 Cosine of the angle between vectors
 Manhattan distance
 Euclidean distance
 Hamming Distance

 Properties of Distance Measures:

 for all objects A and B, dist(A, B)  0, and dist(A, B) = dist(B, A)
 for any object A, dist(A, A) = 0
 dist(A, C)  dist(A, B) + dist (B, C)

5
Data Matrix and Distance Matrix
 Data matrix  x 11 ... x 1f ... x 1p 
 Conceptual representation of a table  
 Cols = features; rows = data objects  ... ... ... ... ... 
 n data points with p dimensions x ... x if ... x ip 
 i1 
 Each row in the matrix is the vector  ... ... ... ... ... 
representation of a data object x ... x nf ... x np 
 n1 

 Distance (or Similarity) Matrix  0 

 n data points, but indicates only the  d(2,1) 0 
pairwise distance (or similarity)  
 d(3,1 ) d ( 3,2 ) 0 
 A triangular matrix  
 Symmetric  : : : 
 d ( n ,1) d ( n ,2 ) ... ... 0 

6
Proximity Measure for Nominal Attributes
 If object attributes are all nominal (categorical), then proximity
measures are used to compare objects

 Can take 2 or more states, e.g., red, yellow, blue, green

(generalization of a binary attribute)

 Method 1: Simple matching d ( i , j )  p p m

 m: # of matches, p: total # of variables

 Method 2: Convert to Standard Spreadsheet format

 For each attribute A create M binary attribute for the M nominal states of A
 Then use standard vector-based similarity or distance metrics

7
Normalizing or Standardizing Numeric Data
 Z-score:
 x: raw value to be standardized, μ: mean of the population,
σ: standard deviation x
z   
 the distance between the raw score and the population mean
in units of the standard deviation
 negative when the value is below the mean, “+” when above
 Min-Max Normalization

ID Gender Age Salary ID Gender Age Salary

1 F 27 19,000 1 1 0.00 0.00
2 M 51 64,000 2 0 0.96 0.56
3 M 52 100,000 3 0 1.00 1.00
4 F 33 55,000 4 1 0.24 0.44
5 M 45 45,000 5 0 0.72 0.32
8
Common Distance Measures for Numeric Data

 Consider two vectors

 Rows in the data matrix
 Common Distance Measures:
 Manhattan distance:

 Euclidean distance:

 Distance can be defined as a dual of a similarity measure

 ( xi  yi )
dist ( X , Y )  1  sim ( X , Y ) sim( X , Y )  i

 xi   yi
2 2

i i

9
Example: Data Matrix and Distance Matrix
point attribute1 attribute2
x1 1 2
Data Matrix
x2 3 5
x3 2 0
x4 4 5

Distance Matrix (Manhattan)

x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0

Distance Matrix (Euclidean)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
10
Distance on Numeric Data:
Minkowski Distance
 Minkowski distance: A popular distance measure

 where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
 Note that Euclidean and Manhattan distances are special cases
 h = 1: (L1 norm) Manhattan distance

d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp
 h = 2: (L2 norm) Euclidean distance

d (i, j )  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
11
Vector-Based Similarity Measures
 In some situations, distance measures provide a skewed view of data
 E.g., when the data is very sparse and 0’s in the vectors are not significant
 In such cases, typically vector-based similarity measures are used
 Most common measure: Cosine similarity

X  x1 , x2 , , xn Y  y1 , y2 ,, yn

 Dot product of two vectors: sim( X , Y )  X  Y  x  y

 i i
i

 Cosine Similarity = normalized dot product

 the norm of a vector X is: X  x

i
2
i

 the cosine similarity is:

X Y  (x  y )
i i
sim( X , Y )   i

X  y x
i
2
i  y
i
2
i

12
Vector-Based Similarity Measures
 Why divide by the norm?

X  x1 , x2 ,, xn X   i
x
i
2

 Example:
 X = <2, 0, 3, 2, 1, 4>

 ||X|| = SQRT(4+0+9+4+1+16) = 5.83

 X* = X / ||X|| = <0.343, 0, 0.514, 0.343, 0.171, 0.686>

 Now, note that ||X*|| = 1

 So, dividing a vector by its norm, turns it into a unit-length vector

 Cosine similarity measures the angle between two unit length vectors (i.e., the
magnitude of the vectors are ignored).

13
Example Application: Information Retrieval
 Documents are represented as “bags of words”
 Represented as vectors when used computationally
 A vector is an array of floating point (or binary in case of bit maps)
 Has direction and magnitude
 Each vector has a place for every term in collection (most are sparse)
Document Ids

a document
nova galaxy heat actor film role vector
A 1.0 0.5 0.3
B 0.5 1.0
C 1.0 0.8 0.7
D 0.9 1.0 0.5
E 1.0 1.0
F 0.7
G 0.5 0.7 0.9
H 0.6 1.0 0.3 0.2
I 0.7 0.5 0.3

14
Documents & Query in n-dimensional Space

 Documents are represented as vectors in the term space

 Typically values in each dimension correspond to the frequency of the
corresponding term in the document
 Queries represented as vectors in the same vector-space
 Cosine similarity between the query and documents is often used
to rank retrieved documents
15
Example: Similarities among Documents
 Consider the following document-term matrix

T1 T2 T3 T4 T5 T6 T7 T8
Doc1 0 4 0 0 0 2 1 3
Doc2 3 1 4 3 1 2 0 1
Doc3 3 0 0 0 3 0 3 0
Doc4 0 1 0 3 0 0 2 0
Doc5 2 2 2 3 1 4 0 2

Dot-Product(Doc2,Doc4) = <3,1,4,3,1,2,0,1> * <0,1,0,3,0,0,2,0>

0 + 1 + 0 + 9 + 0 + 0 + 0 + 0 = 10

Norm (Doc2) = SQRT(9+1+16+9+1+4+0+1) = 6.4

Norm (Doc4) = SQRT(0+1+0+9+0+0+4+0) = 3.74

Cosine(Doc2, Doc4) = 10 / (6.4 * 3.74) = 0.42

16
Correlation as Similarity

 In cases where there could be high mean variance across data

objects (e.g., movie ratings), Pearson Correlation coefficient is
the best option
 Pearson Correlation

 Often used in recommender systems based on Collaborative

Filtering

17
Distance-Based Classification
 Basic Idea: classify new instances based on their similarity to or
distance from instances we have seen before
 Sometimes called “instance-based learning”

 Basic Idea:
 Save all previously encountered instances
 Given a new instance, find those instances that are most similar to the new one
 Assign new instance to the same class as these “nearest neighbors”

 “Lazy” Classifiers
 The approach defers all of the real work until new instance is obtained; no
attempt is made to learn a generalized model from the training set
 Less data preprocessing and model evaluation, but more work has to be done at
classification time

18
Nearest Neighbor Classifiers
 Basic idea:
 If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute
Distance Test
Record

Training
Records Choose k of the
“nearest” records

19
K-Nearest-Neighbor Strategy
 Given object x, find the k most similar objects to x
 The k nearest neighbors
 Variety of distance or similarity measures can be used to identify and rank
neighbors
 Note that this requires comparison between x and all objects in the database
 Classification:
 Find the class label for each of the k neighbor
 Use a voting or weighted voting approach to determine the majority class
among the neighbors (a combination function)
 Weighted voting means the closest neighbors count more
 Assign the majority class label to x
 Prediction:
 Identify the value of the target attribute for the k neighbors
 Return the weighted average as the predicted value of the target attribute for x

20
Combination Functions
 Once the Nearest Neighbors are identified, the “votes” of these
neighbors must be combined to generate a prediction

 Voting: the “democracy” approach

 poll the neighbors for the answer and use the majority vote
 the number of neighbors (k) is often taken to be odd in order to avoid ties
 works when the number of classes is two
 if there are more than two classes, take k to be the number of classes plus 1

 Impact of k on predictions
 in general different values of k affect the outcome of classification
 we can associate a confidence level with predictions (this can be the % of
neighbors that are in agreement)
 problem is that no single category may get a majority vote
 if there is strong variations in results for different choices of k, this an
indication that the training set is not large enough
21
Voting Approach - Example
ID Gender Age Salary Respond?
1 F 27 19,000 no
Will a new customer 2 M 51 64,000 yes
respond to solicitation? 3 M 52 105,000 yes
4 F 33 55,000 yes
5 M 45 45,000 no
new F 45 100,000 ?

Using the voting method without confidence

Neighbors Answers k =1 k = 2 k = 3 k = 4 k = 5
D_man 4,3,5,2,1 Y,Y,N,Y,N yes yes yes yes yes
D_euclid 4,1,5,2,3 Y,N,N,Y,Y yes ? no ? yes

Using the voting method with a confidence

k =1 k=2 k=3 k=4 k=5
D_man yes, 100% yes, 100% yes, 67% yes, 75% yes, 60%
D_euclid yes, 100% yes, 50% no, 67% yes, 50% yes, 60%

22
Combination Functions
 Weighted Voting: not so “democratic”
 similar to voting, but the vote some neighbors counts more
 “shareholder democracy?”
 question is which neighbor’s vote counts more?

 How can weights be obtained?

 Distance-based
 closer neighbors get higher weights
 “value” of the vote is the inverse of the distance (may need to add a small constant)
 the weighted sum for each class gives the combined score for that class
 to compute confidence, need to take weighted average
 Heuristic
 weight for each neighbor is based on domain-specific characteristics of that neighbor

 Advantage of weighted voting

 introduces enough variation to prevent ties in most cases
 helps distinguish between competing neighbors

23
KNN and Collaborative Filtering
 Collaborative Filtering Example
 A movie rating system
 Ratings scale: 1 = “hate it”; 7 = “love it”
 Historical DB of users includes ratings of movies by Sally, Bob, Chris, and Lynn
 Karen is a new user who has rated 3 movies, but has not yet seen “Independence
Day”; should we recommend it to her?
 Approach: use kNN to find similar users, then combine their ratings to get
prediction for Karen.

Sally Bob Chris Lynn Karen

Star Wars 7 7 3 4 7
Jurassic Park 6 4 7 4 4
Terminator II 3 4 7 6 3
Independence Day 7 6 2 2 ?

Will Karen like “Independence Day?”

24
Collaborative Filtering
(k Nearest Neighbor Example)

Star Wars Jurassic Park Terminator 2 Indep. Day Average Cosine Distance Euclid Pearson
Sally 7 6 3 7 5.33 0.983 2 2.00 0.85
Bob 7 4 4 6 5.00 0.995 1 1.00 0.97
Chris 3 7 7 2 5.67 0.787 11 6.40 -0.97
Lynn 4 4 6 2 4.67 0.874 6 4.24 -0.69

Karen 7 4 3 ? 4.67 1.000 0 0.00 1.00

K Pearson
Prediction
1 6 K is the number of nearest
2 6.5 neighbors used in to find the
3 5 average predicted ratings of
Karen on Indep. Day.

Example computation:
Pearson(Sally, Karen) = ( (7-5.33)*(7-4.67) + (6-5.33)*(4-4.67) + (3-5.33)*(3-4.67) )
/ SQRT( ((7-5.33)2 +(6-5.33)2 +(3-5.33)2) * ((7- 4.67)2 +(4- 4.67)2 +(3- 4.67)2)) = 0.85

06 Data Similarity and Data Dissimilarity
No ratings yet
06 Data Similarity and Data Dissimilarity
10 pages
ML Unit 2
No ratings yet
ML Unit 2
22 pages
Notes On Functional Analysis by Rajendra Bhatia
57% (7)
Notes On Functional Analysis by Rajendra Bhatia
249 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Numerical Analysis Lecture Notes: 7. Iterative Methods For Linear Systems
100% (1)
Numerical Analysis Lecture Notes: 7. Iterative Methods For Linear Systems
28 pages
UNIT-2 ML Notes
No ratings yet
UNIT-2 ML Notes
15 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
ShortCourse QTT Lecture1
No ratings yet
ShortCourse QTT Lecture1
40 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
IDS4
No ratings yet
IDS4
50 pages
Banach Algebras An Introduction Pure and Applied Mathematics
100% (1)
Banach Algebras An Introduction Pure and Applied Mathematics
358 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
CSC 522 Lecture10
No ratings yet
CSC 522 Lecture10
30 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
DSB - Unit3
No ratings yet
DSB - Unit3
87 pages
K-Nearest Neighbors: Nipun Batra July 5, 2020
No ratings yet
K-Nearest Neighbors: Nipun Batra July 5, 2020
66 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Physics - Ch5 Vectors
No ratings yet
Physics - Ch5 Vectors
25 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
9.introduction To Artificial Intelligence
No ratings yet
9.introduction To Artificial Intelligence
14 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
CH 2
No ratings yet
CH 2
49 pages
LinearAlgebra Lect2 Karan
No ratings yet
LinearAlgebra Lect2 Karan
62 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Lecture Slides-Week15,16
No ratings yet
Lecture Slides-Week15,16
50 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
AIML-Unit 4 Notes-Assignment 4
No ratings yet
AIML-Unit 4 Notes-Assignment 4
21 pages
Pugh, Charles - Real Mathematical Analysis (Back Matter)
No ratings yet
Pugh, Charles - Real Mathematical Analysis (Back Matter)
12 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
PDF 1
No ratings yet
PDF 1
31 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Here Are The Basic Properties of Vector Addition, Along With Simple Explanations and Visual Representations
No ratings yet
Here Are The Basic Properties of Vector Addition, Along With Simple Explanations and Visual Representations
3 pages
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbor Algorithm
25 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Vec MCQ
No ratings yet
Vec MCQ
10 pages
Analysis in Vector Spaces A Course in Advanced Calculus 1st Edition Mustafa A. Akcoglu Download
No ratings yet
Analysis in Vector Spaces A Course in Advanced Calculus 1st Edition Mustafa A. Akcoglu Download
52 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
CE201 Statics Chap2
No ratings yet
CE201 Statics Chap2
49 pages
CUHK MATH 3060 Lecture Notes by KS Chou
No ratings yet
CUHK MATH 3060 Lecture Notes by KS Chou
98 pages
Chapter 8: Vectors
No ratings yet
Chapter 8: Vectors
5 pages
02data Part4
No ratings yet
02data Part4
28 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Chapter # 03: Vectors in 2 - Space and 3 - Space
No ratings yet
Chapter # 03: Vectors in 2 - Space and 3 - Space
10 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Orthogonal Functions and Fourier Series
No ratings yet
Orthogonal Functions and Fourier Series
16 pages
The Space of P-Summable Sequences
No ratings yet
The Space of P-Summable Sequences
11 pages
Lec 5
No ratings yet
Lec 5
24 pages
Linear System Theory: Dr. Vali Uddin
No ratings yet
Linear System Theory: Dr. Vali Uddin
44 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
CFD Lec5
No ratings yet
CFD Lec5
40 pages
Lecture Notes OperatorML 080424
No ratings yet
Lecture Notes OperatorML 080424
65 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Basic Mathematics and Vectors
No ratings yet
Basic Mathematics and Vectors
4 pages
BookSlides 5A Similarity Based Learning
No ratings yet
BookSlides 5A Similarity Based Learning
40 pages
2-Linear System PDF
No ratings yet
2-Linear System PDF
73 pages
Cloning Dectection
No ratings yet
Cloning Dectection
5 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Parallelogram Method
No ratings yet
Parallelogram Method
5 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Similarity
No ratings yet
Similarity
19 pages
Vectors and Vector Spaces
No ratings yet
Vectors and Vector Spaces
45 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
L6 Recommendation
No ratings yet
L6 Recommendation
56 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Topological Vector Spaces and Their Applications 1st Edition V.I. Bogachev - Download The Ebook Now and Read Anytime, Anywhere
No ratings yet
Topological Vector Spaces and Their Applications 1st Edition V.I. Bogachev - Download The Ebook Now and Read Anytime, Anywhere
56 pages
Pardoux Stochastic Calculus With Anticipating Integrands
No ratings yet
Pardoux Stochastic Calculus With Anticipating Integrands
47 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Identification of Fingerprint Image With Minkowski Distance Algorithm Approach
No ratings yet
Identification of Fingerprint Image With Minkowski Distance Algorithm Approach
10 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Geometric Properties of Grassmannian Frames For And: John J. Benedetto and Joseph D. Kolesar
No ratings yet
Geometric Properties of Grassmannian Frames For And: John J. Benedetto and Joseph D. Kolesar
17 pages
Bardaro C., Musielak J., Vinti G. - Nonlinear Integral Operators and Applications (2003)
No ratings yet
Bardaro C., Musielak J., Vinti G. - Nonlinear Integral Operators and Applications (2003)
215 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages

Lesson 6 Similarities KNN

Uploaded by

Lesson 6 Similarities KNN

Uploaded by

Distance and Similarity Measures

CSE450: Data Mining

 Dissimilarity (e.g., distance)

 Proximity refers to a similarity or dissimilarity

An Employee DB Term Frequencies for Documents

Feature vector corresponding to Feature vector corresponding to Document 4:

 Properties of Distance Measures:

 Distance (or Similarity) Matrix  0 

 Can take 2 or more states, e.g., red, yellow, blue, green

 Method 1: Simple matching d ( i , j )  p p m

 Method 2: Convert to Standard Spreadsheet format

ID Gender Age Salary ID Gender Age Salary

 Consider two vectors

 Distance can be defined as a dual of a similarity measure

Distance Matrix (Manhattan)

Distance Matrix (Euclidean)

 Dot product of two vectors: sim( X , Y )  X  Y  x  y

 Cosine Similarity = normalized dot product

 the norm of a vector X is: X  x

 the cosine similarity is:

 ||X|| = SQRT(4+0+9+4+1+16) = 5.83

 X* = X / ||X|| = <0.343, 0, 0.514, 0.343, 0.171, 0.686>

 Now, note that ||X*|| = 1

 So, dividing a vector by its norm, turns it into a unit-length vector

 Documents are represented as vectors in the term space

Dot-Product(Doc2,Doc4) = <3,1,4,3,1,2,0,1> * <0,1,0,3,0,0,2,0>

Norm (Doc2) = SQRT(9+1+16+9+1+4+0+1) = 6.4

Cosine(Doc2, Doc4) = 10 / (6.4 * 3.74) = 0.42

 In cases where there could be high mean variance across data

 Often used in recommender systems based on Collaborative

 Voting: the “democracy” approach

Using the voting method without confidence

Using the voting method with a confidence

 How can weights be obtained?

 Advantage of weighted voting

Sally Bob Chris Lynn Karen

Will Karen like “Independence Day?”

Karen 7 4 3 ? 4.67 1.000 0 0.00 1.00

You might also like