0% found this document useful (0 votes)

22 views31 pages

VectorApplicationsInDS

This document discusses vector applications in data science, including correlation, cosine similarity, time series filtering, and k-means clustering. It explains that correlation and cosine similarity are methods to assess the relationship between two variables, while time series filtering uses dot products to detect features in signals by matching a template. It also outlines the k-means clustering algorithm, which classifies multivariate data into groups by minimizing distance to cluster centroids.

Uploaded by

Sara Nukho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views31 pages

VectorApplicationsInDS

Uploaded by

Sara Nukho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Vector Applications in Data

Science
Correlation and Cosine Similarity
• Correlation is one of the most fundamental and important analysis
methods in statistics and machine learning.

• A correlation coefficient is a single number that quantifies the linear

relationship between two variables.

• Correlation coefficients range from −1 to +1, with −1 indicating a

perfect negative relationship, +1 a perfect positive relationships, and
0 indicating no linear relationship
Correlation and Cosine Similarity
• A few examples of pairs of variables and their correlation coefficients.
Correlation and Cosine Similarity
• We mentioned that the dot product is involved in the correlation
coefficient, and that the magnitude of the dot product is related to
the magnitude of the numerical values in the data (remember the
discussion about using grams versus pounds for measuring weight).

• Therefore, the correlation coefficient requires some normalizations to

be in the expected range of −1 to +1.
• Those two normalizations are:
Correlation and Cosine Similarity
1. Mean center each variable
• Mean centering means to subtract the average value from each data value.

2. Divide the dot product by the product of the vector norms

• This divisive normalization cancels the measurement units and scales the
maximum possible correlation magnitude to |1|
Correlation and Cosine Similarity
• What is vector norms?
• It is the length of the vector

• How to find it in Python?

1. Using L1 norm

Output:
Correlation and Cosine Similarity
• The L1 norm is calculated as the sum of the absolute vector values,
where the absolute value of a scalar uses the notation |a1|.

• In effect, the norm is a calculation of the Manhattan distance from

the origin of the vector space.
Correlation and Cosine Similarity
2. Using Vector L2 Norm
• The L2 norm is calculated as the square root of the sum of the
squared vector values.
• ||v||2 = sqrt(a1^2 + a2^2 + a3^2)

Output:
Correlation and Cosine Similarity
• Formula for Pearson correlation coefficient

• So there you go: the famous and widely used Pearson correlation
coefficient is simply the dot product between two variables,
normalized by the magnitudes of the variables
Correlation and Cosine Similarity
• Correlation is not the only way to assess similarity between two
variables.
• Another method is called cosine similarity.
• The formula for cosine similarity is:

• where α is the dot product between x and y.

Correlation and Cosine Similarity
• Cosine similarity measures the similarity between two vectors of an
inner product space

• Specifically, it measures the similarity in the direction or orientation of

the vectors ignoring differences in their magnitude or scale

• Smaller angles between vectors produce larger cosine values,

indicating greater cosine similarity.
Correlation and Cosine Similarity
• Example:
• Suppose that our goal is to calculate the cosine similarity of the two
documents given below.
• Document 1 = 'the best data science course'
• Document 2 = 'data science is popular’

• Solution:
• First, we create a word table
Correlation and Cosine Similarity
Correlation and Cosine Similarity
• Then, we calculate cosine similarity
• Dot product is:

• Then, calculate the magnitude of the vectors:

Correlation and Cosine Similarity
• Lastly, find cosine similarity
Correlation and Cosine Similarity
• Example Pearson Correlation Coefficient (Class work)

• x = (1, 3, 5, 10)
•
• y = (2, 4, 6, 20)
Correlation and Cosine Similarity
• Correlation Versus Cosine Similarity, which one is better?
• Pearson correlation and cosine similarity can give different results for
the same data because they start from different assumptions.
• Example: the variables [0, 1, 2, 3] and [100, 101, 102, 103] are
perfectly correlated with ρ= 1
• the cosine similarity between those variables is .808—they are not in
the same numerical scale and are therefore not perfectly related.
Correlation and Cosine Similarity
• Neither measure is incorrect nor better than the other;

• it is simply the case that different statistical methods make different

assumptions about data, and those assumptions have implications for
the results and for proper interpretation.
Time Series Filtering and Feature
Detection
• The dot product is also used in time series filtering.
• Filtering is essentially a feature detection method
• How?
• A template—called a kernel is matched against portions of a time
series signal
• the result of filtering is another time series that indicates how much
the characteristics of the signal match the characteristics of the
kernel.
Time Series Filtering and Feature
Detection
• The mechanism of filtering is to compute the dot product between
the kernel and the time series signal

• we compute the dot product between the kernel and a short snippet
of the data of the same length as the kernel
• This procedure produces one time point in the filtered signal
• then the kernel is moved one time step to the right to compute the
dot product with a different (overlapping) signal segment
Time Series Filtering and Feature
Detection Explain the difference!
k-Means Clustering
• k-means clustering is an unsupervised method of classifying
multivariate data into a relatively small number of groups, or
categories, based on minimizing distance to the group center

• k-means clustering is an important analysis method in machine

learning, and there are sophisticated variants of k-means clustering
k-Means Clustering
• Algorithm:
1. Initialize k centroids as random points in the data space. Each
centroid is a class, or category, and the next steps will assign each
data observation to each class.
2. Compute the Euclidean distance between each data observation
and each centroid.
3. Assign each data observation to the group with the closest centroid.
4. Update each centroid as the average of all data observations
assigned to that centroid.
5. Repeat steps 2–4 until a convergence criteria is satisfied, or for N
iterations.
k-Means Clustering
k-Means Clustering
• We will test the algorithm using randomly generated 2D data to
confirm that our code is correct
• (this variable is 150 × 2, corresponding to 150 observations and 2
features)
• The data are contained in variable data
k-Means Clustering
• Let’s start with step 1: initialize k random cluster centroids.

• k is a parameter of k-means clustering; in real data, it is difficult to

determine the optimal k, but here we will fix k = 3.

• select randomly k data samples to be centroids.

k-Means Clustering
• Now for step 2: compute the distance between each data
observation and each cluster centroid.
• Here is where we use linear algebra concepts you learned previously
• For one data observation and centroid, Euclidean distance is
computed as
k-Means Clustering
• You might think that this step needs to be implemented using a
double for loop: one loop over k centroids and a second loop over N
data observations

• The sizes of these variables: data is 150 × 2 (observations by features)

and centroids[ci,:] is 1 × 2 (cluster ci by features).
k-Means Clustering
• Step 3 is to assign each data observation to the group with minimum
distance. This step is quite compact in Python, and can be
implemented using one function:

• np.argmin returns the index at which the minimum occurs.

k-Means Clustering
• Step 4 is to recompute the centroids as the mean of all data points
within the class. Here we can loop over the k clusters, and use Python
indexing to find all data points assigned to each cluster:

• Finally, Step 5 is to put the previous steps into a loop that iterates
until a good solution is obtained
k-Means Clustering

Gerald Folland - Advanced Calculus
No ratings yet
Gerald Folland - Advanced Calculus
401 pages
Minds On Physics ....
No ratings yet
Minds On Physics ....
35 pages
CHOO, Yan Min - H2 Mathematics PDF
No ratings yet
CHOO, Yan Min - H2 Mathematics PDF
1,384 pages
The Representation of Physical Motions by Various Types of Quaternions
No ratings yet
The Representation of Physical Motions by Various Types of Quaternions
127 pages
Audio Signal Processing
100% (1)
Audio Signal Processing
389 pages
Combinepdf
No ratings yet
Combinepdf
308 pages
EGP Merged
No ratings yet
EGP Merged
1,300 pages
Mathcad PDF
No ratings yet
Mathcad PDF
480 pages
Vector Calculus
No ratings yet
Vector Calculus
113 pages
SSCI Physics
No ratings yet
SSCI Physics
54 pages
Domain Time and Frequency
No ratings yet
Domain Time and Frequency
4 pages
Vectors and The Geometry of Space
No ratings yet
Vectors and The Geometry of Space
27 pages
Complete Notes On 9th Physics by Asif Rasheed
No ratings yet
Complete Notes On 9th Physics by Asif Rasheed
53 pages
Day 19 Lesson Plan
No ratings yet
Day 19 Lesson Plan
5 pages
Practice Problems Set1
No ratings yet
Practice Problems Set1
21 pages
CH 2
No ratings yet
CH 2
121 pages
4024 - Y07 - Sy Syllabus o Level 4240
No ratings yet
4024 - Y07 - Sy Syllabus o Level 4240
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Clustering
No ratings yet
Clustering
104 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Lecture 5
No ratings yet
Lecture 5
11 pages
Solving A Second Order Differential Equation With Runge Kutta
No ratings yet
Solving A Second Order Differential Equation With Runge Kutta
2 pages
Draft Writing
No ratings yet
Draft Writing
10 pages
Physics Quarter 1 Study Guide
No ratings yet
Physics Quarter 1 Study Guide
14 pages
Data Similarity
0% (1)
Data Similarity
18 pages
A Theoretical Study On Air Bubble Motion in A Centrifugal Pump Impeller
No ratings yet
A Theoretical Study On Air Bubble Motion in A Centrifugal Pump Impeller
8 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
Clustering
No ratings yet
Clustering
43 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Unit 2
No ratings yet
Unit 2
89 pages
Clustering
0% (1)
Clustering
127 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
No ratings yet
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
61 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Lec 5
No ratings yet
Lec 5
24 pages
Lec 5
No ratings yet
Lec 5
22 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
AE264 Spring2014 HW1
No ratings yet
AE264 Spring2014 HW1
3 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
02data Part4
No ratings yet
02data Part4
28 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
7 Cluster Analysis
No ratings yet
7 Cluster Analysis
62 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
1.1 Physical Quantities & Measurement Techniques
No ratings yet
1.1 Physical Quantities & Measurement Techniques
19 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
Similarity
No ratings yet
Similarity
19 pages
Clustering
No ratings yet
Clustering
104 pages
C 09 Forces in Action
No ratings yet
C 09 Forces in Action
72 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Lec-3. Datamining-Similarity-Distance-Ext
No ratings yet
Lec-3. Datamining-Similarity-Distance-Ext
104 pages
12a Vector Quantities
No ratings yet
12a Vector Quantities
22 pages
Miscelanous Exercise On Chapter-10 Class 12th
No ratings yet
Miscelanous Exercise On Chapter-10 Class 12th
19 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
1st Year Physics Chapter 2 Short Questions Notes
No ratings yet
1st Year Physics Chapter 2 Short Questions Notes
7 pages
III Clustering
No ratings yet
III Clustering
87 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Long Quiz Coverage
No ratings yet
Long Quiz Coverage
40 pages
West Bengal Board Class 12 Mathematics Syllabus
No ratings yet
West Bengal Board Class 12 Mathematics Syllabus
4 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
DM Lab 02
No ratings yet
DM Lab 02
12 pages
Clustering
No ratings yet
Clustering
15 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
1427 Physics Specimen Paper 1
No ratings yet
1427 Physics Specimen Paper 1
16 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Unit 3
No ratings yet
Unit 3
13 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Sem242 LA CC10 Group11
No ratings yet
Sem242 LA CC10 Group11
20 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet

VectorApplicationsInDS

Uploaded by

VectorApplicationsInDS

Uploaded by

Vector Applications in Data

• A correlation coefficient is a single number that quantifies the linear

• Correlation coefficients range from −1 to +1, with −1 indicating a

• Therefore, the correlation coefficient requires some normalizations to

2. Divide the dot product by the product of the vector norms

• How to find it in Python?

• In effect, the norm is a calculation of the Manhattan distance from

• where α is the dot product between x and y.

• Specifically, it measures the similarity in the direction or orientation of

• Smaller angles between vectors produce larger cosine values,

• Then, calculate the magnitude of the vectors:

• it is simply the case that different statistical methods make different

• k-means clustering is an important analysis method in machine

• k is a parameter of k-means clustering; in real data, it is difficult to

• select randomly k data samples to be centroids.

• The sizes of these variables: data is 150 × 2 (observations by features)

• np.argmin returns the index at which the minimum occurs.

You might also like