VectorApplicationsInDS
VectorApplicationsInDS
Science
Correlation and Cosine Similarity
• Correlation is one of the most fundamental and important analysis
methods in statistics and machine learning.
Output:
Correlation and Cosine Similarity
• The L1 norm is calculated as the sum of the absolute vector values,
where the absolute value of a scalar uses the notation |a1|.
Output:
Correlation and Cosine Similarity
• Formula for Pearson correlation coefficient
• So there you go: the famous and widely used Pearson correlation
coefficient is simply the dot product between two variables,
normalized by the magnitudes of the variables
Correlation and Cosine Similarity
• Correlation is not the only way to assess similarity between two
variables.
• Another method is called cosine similarity.
• The formula for cosine similarity is:
• Solution:
• First, we create a word table
Correlation and Cosine Similarity
Correlation and Cosine Similarity
• Then, we calculate cosine similarity
• Dot product is:
• x = (1, 3, 5, 10)
•
• y = (2, 4, 6, 20)
Correlation and Cosine Similarity
• Correlation Versus Cosine Similarity, which one is better?
• Pearson correlation and cosine similarity can give different results for
the same data because they start from different assumptions.
• Example: the variables [0, 1, 2, 3] and [100, 101, 102, 103] are
perfectly correlated with ρ= 1
• the cosine similarity between those variables is .808—they are not in
the same numerical scale and are therefore not perfectly related.
Correlation and Cosine Similarity
• Neither measure is incorrect nor better than the other;
• we compute the dot product between the kernel and a short snippet
of the data of the same length as the kernel
• This procedure produces one time point in the filtered signal
• then the kernel is moved one time step to the right to compute the
dot product with a different (overlapping) signal segment
Time Series Filtering and Feature
Detection Explain the difference!
k-Means Clustering
• k-means clustering is an unsupervised method of classifying
multivariate data into a relatively small number of groups, or
categories, based on minimizing distance to the group center
• Finally, Step 5 is to put the previous steps into a loop that iterates
until a good solution is obtained
k-Means Clustering