0% found this document useful (0 votes)
22 views31 pages

VectorApplicationsInDS

This document discusses vector applications in data science, including correlation, cosine similarity, time series filtering, and k-means clustering. It explains that correlation and cosine similarity are methods to assess the relationship between two variables, while time series filtering uses dot products to detect features in signals by matching a template. It also outlines the k-means clustering algorithm, which classifies multivariate data into groups by minimizing distance to cluster centroids.

Uploaded by

Sara Nukho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views31 pages

VectorApplicationsInDS

This document discusses vector applications in data science, including correlation, cosine similarity, time series filtering, and k-means clustering. It explains that correlation and cosine similarity are methods to assess the relationship between two variables, while time series filtering uses dot products to detect features in signals by matching a template. It also outlines the k-means clustering algorithm, which classifies multivariate data into groups by minimizing distance to cluster centroids.

Uploaded by

Sara Nukho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Vector Applications in Data

Science
Correlation and Cosine Similarity
• Correlation is one of the most fundamental and important analysis
methods in statistics and machine learning.

• A correlation coefficient is a single number that quantifies the linear


relationship between two variables.

• Correlation coefficients range from −1 to +1, with −1 indicating a


perfect negative relationship, +1 a perfect positive relationships, and
0 indicating no linear relationship
Correlation and Cosine Similarity
• A few examples of pairs of variables and their correlation coefficients.
Correlation and Cosine Similarity
• We mentioned that the dot product is involved in the correlation
coefficient, and that the magnitude of the dot product is related to
the magnitude of the numerical values in the data (remember the
discussion about using grams versus pounds for measuring weight).

• Therefore, the correlation coefficient requires some normalizations to


be in the expected range of −1 to +1.
• Those two normalizations are:
Correlation and Cosine Similarity
1. Mean center each variable
• Mean centering means to subtract the average value from each data value.

2. Divide the dot product by the product of the vector norms


• This divisive normalization cancels the measurement units and scales the
maximum possible correlation magnitude to |1|
Correlation and Cosine Similarity
• What is vector norms?
• It is the length of the vector

• How to find it in Python?


1. Using L1 norm

Output:
Correlation and Cosine Similarity
• The L1 norm is calculated as the sum of the absolute vector values,
where the absolute value of a scalar uses the notation |a1|.

• In effect, the norm is a calculation of the Manhattan distance from


the origin of the vector space.
Correlation and Cosine Similarity
2. Using Vector L2 Norm
• The L2 norm is calculated as the square root of the sum of the
squared vector values.
• ||v||2 = sqrt(a1^2 + a2^2 + a3^2)

Output:
Correlation and Cosine Similarity
• Formula for Pearson correlation coefficient

• So there you go: the famous and widely used Pearson correlation
coefficient is simply the dot product between two variables,
normalized by the magnitudes of the variables
Correlation and Cosine Similarity
• Correlation is not the only way to assess similarity between two
variables.
• Another method is called cosine similarity.
• The formula for cosine similarity is:

• where α is the dot product between x and y.


Correlation and Cosine Similarity
• Cosine similarity measures the similarity between two vectors of an
inner product space

• Specifically, it measures the similarity in the direction or orientation of


the vectors ignoring differences in their magnitude or scale

• Smaller angles between vectors produce larger cosine values,


indicating greater cosine similarity.
Correlation and Cosine Similarity
• Example:
• Suppose that our goal is to calculate the cosine similarity of the two
documents given below.
• Document 1 = 'the best data science course'
• Document 2 = 'data science is popular’

• Solution:
• First, we create a word table
Correlation and Cosine Similarity
Correlation and Cosine Similarity
• Then, we calculate cosine similarity
• Dot product is:

• Then, calculate the magnitude of the vectors:


Correlation and Cosine Similarity
• Lastly, find cosine similarity
Correlation and Cosine Similarity
• Example Pearson Correlation Coefficient (Class work)

• x = (1, 3, 5, 10)

• y = (2, 4, 6, 20)
Correlation and Cosine Similarity
• Correlation Versus Cosine Similarity, which one is better?
• Pearson correlation and cosine similarity can give different results for
the same data because they start from different assumptions.
• Example: the variables [0, 1, 2, 3] and [100, 101, 102, 103] are
perfectly correlated with ρ= 1
• the cosine similarity between those variables is .808—they are not in
the same numerical scale and are therefore not perfectly related.
Correlation and Cosine Similarity
• Neither measure is incorrect nor better than the other;

• it is simply the case that different statistical methods make different


assumptions about data, and those assumptions have implications for
the results and for proper interpretation.
Time Series Filtering and Feature
Detection
• The dot product is also used in time series filtering.
• Filtering is essentially a feature detection method
• How?
• A template—called a kernel is matched against portions of a time
series signal
• the result of filtering is another time series that indicates how much
the characteristics of the signal match the characteristics of the
kernel.
Time Series Filtering and Feature
Detection
• The mechanism of filtering is to compute the dot product between
the kernel and the time series signal

• we compute the dot product between the kernel and a short snippet
of the data of the same length as the kernel
• This procedure produces one time point in the filtered signal
• then the kernel is moved one time step to the right to compute the
dot product with a different (overlapping) signal segment
Time Series Filtering and Feature
Detection Explain the difference!
k-Means Clustering
• k-means clustering is an unsupervised method of classifying
multivariate data into a relatively small number of groups, or
categories, based on minimizing distance to the group center

• k-means clustering is an important analysis method in machine


learning, and there are sophisticated variants of k-means clustering
k-Means Clustering
• Algorithm:
1. Initialize k centroids as random points in the data space. Each
centroid is a class, or category, and the next steps will assign each
data observation to each class.
2. Compute the Euclidean distance between each data observation
and each centroid.
3. Assign each data observation to the group with the closest centroid.
4. Update each centroid as the average of all data observations
assigned to that centroid.
5. Repeat steps 2–4 until a convergence criteria is satisfied, or for N
iterations.
k-Means Clustering
k-Means Clustering
• We will test the algorithm using randomly generated 2D data to
confirm that our code is correct
• (this variable is 150 × 2, corresponding to 150 observations and 2
features)
• The data are contained in variable data
k-Means Clustering
• Let’s start with step 1: initialize k random cluster centroids.

• k is a parameter of k-means clustering; in real data, it is difficult to


determine the optimal k, but here we will fix k = 3.

• select randomly k data samples to be centroids.


k-Means Clustering
• Now for step 2: compute the distance between each data
observation and each cluster centroid.
• Here is where we use linear algebra concepts you learned previously
• For one data observation and centroid, Euclidean distance is
computed as
k-Means Clustering
• You might think that this step needs to be implemented using a
double for loop: one loop over k centroids and a second loop over N
data observations

• The sizes of these variables: data is 150 × 2 (observations by features)


and centroids[ci,:] is 1 × 2 (cluster ci by features).
k-Means Clustering
• Step 3 is to assign each data observation to the group with minimum
distance. This step is quite compact in Python, and can be
implemented using one function:

• np.argmin returns the index at which the minimum occurs.


k-Means Clustering
• Step 4 is to recompute the centroids as the mean of all data points
within the class. Here we can loop over the k clusters, and use Python
indexing to find all data points assigned to each cluster:

• Finally, Step 5 is to put the previous steps into a loop that iterates
until a good solution is obtained
k-Means Clustering

You might also like