0% found this document useful (0 votes)
9 views10 pages

Recommender Systems Assignment

Uploaded by

Metang Metagame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Recommender Systems Assignment

Uploaded by

Metang Metagame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA SIMILARITY

Data similarity measures are techniques used to quan3fy the similarity or


dissimilarity between two sets of data. These measures are commonly
employed in various fields, including data mining, machine learning, pa>ern
recogni3on, and informa3on retrieval. The goal is to assess the degree of
resemblance or closeness between two data sets, which could be vectors, 3me
series, images, text documents, or any other type of data.

Some commonly used data similarity measures are:


1. Euclidean Distance:
• Measures the straight-line distance between two points in Euclidean
space.
• d =√[(x2 – x1)2 + (y2 – y1)2]

• Advantages:
o Simple and easy to compute.
o Works well for con3nuous numeric data.
• Disadvantages:
o Sensi3ve to the scale of the data.
o May not perform well with high-dimensional data.

2. Manha:an Distance (L1 Norm):


• Computes the sum of absolute differences between corresponding
elements of two vectors.
• Distance(x,y)=∑i=1n∣xi−yi∣=∣x2−x1∣+∣y2−y1∣+...+∣xn−x1∣+∣yn−y1∣
• Advantages:
o Similar to Euclidean distance but less sensitive to outliers.
o Suitable for data with different scales.
• Disadvantages:
o May not be appropriate for datasets with complex structures.

3. Cosine Similarity:
• Measures the cosine of the angle between two vectors. It is commonly
used for text data and is robust to the vector's length.
• cos θ = (A · B) / (||A|| * ||B||)
• Advantages:
o Effective for text data and high-dimensional spaces.
o Insensitive to the magnitude of the vectors.
• Disadvantages:
o Ignores non-linear relationships in the data.
o Assumes that the data is represented as vectors.

4. Jaccard Similarity:
• Calculates the size of the intersec3on divided by the size of the union of
two sets. Commonly used for comparing sets.
• J(A, B) = |A∩B| / |A∪B|
• Advantages:
o Suitable for comparing sets, especially in binary or categorical
data.
o Handles sparsity well.
• Disadvantages:
o Ignores the magnitude of the elements in the sets.
o Not suitable for cases where the order of elements matters.

5. Hamming Distance:
• Measures the number of posi3ons at which corresponding bits are
different in two binary strings of equal length.
• d(a,b) = a⊕ b
• Advantages:
o Specifically designed for binary data.
o Simple and easy to interpret.
• Disadvantages:
o Only applicable to data of equal length.
o Limited to binary data.

6. Pearson CorrelaTon Coefficient:


• Measures the linear correla3on between two variables, providing a value
between -1 and 1.
• ρ (X,Y) = cov (X,Y) / σX.σY.
• Advantages:
o Captures linear relationships between variables.
o Sensitive to both scale and location differences.
• Disadvantages:
o Assumes a linear relationship and may not capture non-linear
patterns.
o Affected by outliers.
The choice of similarity measure depends on the nature of the data and the
specific task at hand. Each measure has its strengths and weaknesses, and
selec3ng the appropriate one is crucial for meaningful comparisons.

IMPLEMENTING DATA SIMILARITY MEASURES USING PYTHON

1. Cosine Similarity
# import required libraries
import numpy as np
from numpy.linalg import norm

# define two lists or array


A = np.array([2,1,2,3,2,9])
B = np.array([3,4,2,4,5,5])

print("A:", A)
print("B:", B)

# compute cosine similarity


cosine = np.dot(A,B)/(norm(A)*norm(B))
print("Cosine Similarity:", cosine)

2. Jaccard Similarity

A = {1,2,3,4,6}
B = {1,2,5,8,9}
C = A.intersection(B)
D = A.union(B)
print('AnB = ', C)
print('AUB = ', D)
print('J(A,B) = ', float(len(C))/float(len(D)))

3.Hamming Distance

#func3on to calculate Hamming Distance


def hammingDist(str1, str2):
i=0
count = 0

while(i < len(str1)):


if(str1[i] != str2[i]):
count += 1
i += 1
return count

# Driver code
str1 = "geeksprac3ce"
str2 = "nerdsprac3se"

# func3on call
print(hammingDist(str1, str2))

4. Manha:an Distance

from math import sqrt

#create func3on to calculate Manha>an distance


def manha>an(a, b):
return sum(abs(val1-val2) for val1, val2 in zip(a,b))
#define vectors
A = [2, 4, 4, 6]
B = [5, 5, 7, 8]

#calculate Manha>an distance between vectors


manha>an(A, B)

5. Euclidean Distance

# Python code to find Euclidean distance


# using linalg.norm()

import numpy as np

# ini3alizing points in
# numpy arrays
point1 = np.array((1, 2, 3))
point2 = np.array((1, 1, 1))

# calcula3ng Euclidean distance


# using linalg.norm()
dist = np.linalg.norm(point1 - point2)

# prin3ng Euclidean distance


print(dist)

6. Pearson CorrelaTon Coefficient

import pandas as pd
from scipy.stats import pearsonr

# Import your data into Python


df = pd.read_csv("Auto.csv")

# Convert dataframe into series


list1 = df['weight']
list2 = df['mpg']

# Apply the pearsonr()


corr, _ = pearsonr(list1, list2)
print('Pearsons correla3on: %.3f' % corr)

SINGLE VALUE DECOMPOSITION (SVD)

• Singular Value DecomposiTon (SVD) is a mathema3cal technique used


in linear algebra to decompose a matrix into three other matrices. It has
wide applica3ons in various fields, including signal processing, image
compression, data analysis, and machine learning. For a given matrix AA,
the SVD is represented as:

• A=UΣVTA=UΣVT
• Here, UU and VV are orthogonal matrices (i.e., UUT=IUUT=I and
VVT=IVVT=I), and ΣΣ is a diagonal matrix with singular values on its
diagonal.
• Let's break down each component of the SVD:
1. Matrix UU:
o The columns of UU are called ler singular vectors.
o The columns of UU form an orthonormal basis for the column
space of AA.
o If A is an m×n matrix, UU is m×m.
2. Diagonal Matrix ΣΣ:
o The diagonal elements of ΣΣ are the singular values of AA, denoted
as σ1,σ2,…,σrσ1,σ2,…,σr, where r is the rank of AA.
o The singular values are always non-nega3ve and represent the
magnitude of the singular vectors in UU and VV.
o The remaining elements of ΣΣ are zero.
o If AA is m×n, ΣΣ is m×n with zeros outside the main diagonal.
3. Matrix VTVT:
o The rows of VTVT are called right singular vectors.
o The columns of VV form an orthonormal basis for the row space of
AA.
o If AA is m×n, VTVT is n×n.

The SVD provides a powerful way to represent and analyze a matrix. The
singular values in ΣΣ indicate the importance of each singular vector in
capturing the overall structure of the data. Higher singular values
correspond to more significant contribu3ons to the matrix.

ADVANTAGES OF SINGULAR VALUE DECOMPOSITION (SVD):

1. Dimensionality ReducTon:
- SVD allows for dimensionality reduc3on by selec3ng a subset of the most
significant singular values and corresponding vectors. This is useful in
reducing storage requirements and computa3onal complexity.

2. Noise ReducTon
- In the context of data analysis, retaining only the most significant
singular values can help filter out noise and focus on the most essen3al
features of the data.

3. Data Compression:
- SVD is used in data compression techniques, where it helps represent
data in a more compact form by capturing the dominant pa>erns and
rela3onships.

4. Numerical Stability:
- SVD is a numerically stable method for decomposing matrices, making it
robust in various numerical applica3ons.

5. Unique RepresentaTon:
- SVD provides a unique and op3mal decomposi3on for any matrix,
allowing for a clear representa3on of its structure.

6. ApplicaTons in Signal Processing and Image Compression:


- SVD is widely used in signal processing and image compression, providing
efficient representa3ons for these types of data.

7. Solving Linear Systems:


- SVD can be used to solve systems of linear equa3ons and find solu3ons
to overdetermined or underdetermined systems through the use of
pseudoinverse.

8. Principal Component Analysis (PCA):


- PCA, which relies on SVD, is a powerful technique for iden3fying and
analyzing the principal components in a dataset.

DISADVANTAGES OF SINGULAR VALUE DECOMPOSITION (SVD):

1. ComputaTonal Complexity:
- The computa3onal cost of performing the full SVD can be high, especially
for large matrices. Efficient algorithms and approxima3ons are oren used to
address this issue.

2. Storage Requirements:
- Storing the en3re decomposi3on, especially for large matrices, may
require significant memory. However, in many applica3ons, only a subset of
the singular values and vectors needs to be retained.

3. Interpretability:
- While SVD provides a unique decomposi3on, interpre3ng the meaning of
the singular values and vectors in real-world terms may not always be
straighworward, especially in high-dimensional spaces.

4. SensiTvity to Outliers:
- SVD can be sensi3ve to outliers in the data, poten3ally affec3ng the
accuracy of the decomposi3on.
5. Limited Applicability to Sparse Matrices:
- SVD is not directly applicable to sparse matrices, which have a large
number of zero entries. However, there are variants of SVD designed for
sparse matrices.

6. Assumes Linearity:
- SVD assumes that rela3onships in the data are linear. In cases where
non-linear rela3onships dominate, other techniques may be more
appropriate.

APPLICATIONS OF SINGULAR VALUE DECOMPOSITION (SVD)

Singular Value Decomposi3on (SVD) finds applica3ons in various fields :

1. Image Compression:
o SVD compresses images by capturing essen3al features with fewer
singular values and vectors.

2. RecommendaTon Systems:
o SVD factorizes user-item matrices for collabora3ve filtering,
enabling personalized recommenda3ons.

3. Principal Component Analysis (PCA):


o SVD aids PCA, reducing data dimensionality while preserving
variability.

4. Latent SemanTc Analysis (LSA) in NLP:


o SVD uncovers hidden rela3onships in document-term matrices for
tasks like clustering and topic modeling.

5. Signal Processing and System IdenTficaTon:


o SVD analyzes signals, iden3fying dominant frequencies, and aids
system iden3fica3on in control theory.

These applications highlight the versatility of Singular Value Decomposition in


extracting meaningful information from diverse types of data, making it a
valuable tool in fields ranging from computer vision and natural language
processing to recommendation systems and signal processing.

You might also like