0% found this document useful (0 votes)

7 views14 pages

Cosine Similarity in Machine Learning

Cosine similarity is a mathematical measure used in machine learning to determine the similarity between two vectors by calculating the angle between them, focusing on direction rather than magnitude. It is particularly useful for applications like chatbots and recommendation systems, as it allows for semantic comparisons rather than mere keyword matching. The document also discusses the creation of vector embeddings, their use in AI models, and the architecture of vector databases, while addressing some limitations of cosine similarity.

Uploaded by

mannyabeb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views14 pages

Cosine Similarity in Machine Learning

Uploaded by

mannyabeb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

COSINE SIMILARITY IN

MACHINE LEARNING

Introduction to cosine similarity

What are Vectors? (RECAP)
What is Cosine Similarity?
Design of chat-bots?
Vector Databases

by Natnael Teklemariam
WHAT IS COSINE SIMILARITY?

Cosine similarity measures how similar two things are by calculating the angle between their vector
representations—ignoring their size and focusing only on their direction

Real-World Examples:
1. Daily.dev – Recommends articles based on what you read (not just keywords, but meaning!)
2. Spotify/Netflix – Suggests songs/movies similar to your taste
3. RAG Chatbots – Retrieves the most relevant info before generating an answer

Problem:
* Machines don’t "understand" text like humans.
* We need a way to measure semantic similarity, not just exact word matches.

Solution: Cosine Similarity – A simple yet powerful math trick to compare meanings!
Why Cosine Similarity? (The "Before & After" Story)

Before (The Problem):

❌ Keyword search fails – "Python" could mean the snake or the language.
❌ Euclidean distance misleads – A long document isn’t necessarily more relevant.
❌ Machines don’t understand meaning – They need a way to compare semantics, not
just words.

After (The Solution):

✅ Cosine similarity compares meaning – By measuring the angle between vectors.
✅ Works in high dimensions – Perfect for AI models like LLMs.
✅ Powering real-world AI – From Spotify playlists to ChatGPT’s answers.

Analogy: "Think of it like comparing two people’s music tastes. It’s not about how many songs they’ve listened to (Euclidean
distance), but how alike their preferences are (cosine similarity)."
VECTORS RECAP:

1. What’s a Vector?

Definition:
A list of numbers that represents data in multi-dimensional space.
Analogy: Like GPS coordinates, animals, images, videos and more ...
e.g., "Cat" = [0.7, -0.2, 0.4, ...]).

2. What’s a Vector Embedding?

Definition:
A vector that captures semantic meaning of text/images/etc., generated by AI
models.
Example:
→
"King" [0.8, -0.3, 0.5]
→
"Queen" [0.75, -0.25, 0.6] (close in space = similar meaning).
Key Idea: Words with similar meanings cluster together.
SIMILARITY BETWEEN
TWO OBJECTS

Cosine similarity ignores vector length — so a short tweet and

a long article can still be a match if their meaning aligns

similarity between two vectors:

Similar: Arrows pointing nearly the same way (cosine
≈ 1).
Dissimilar: Arrows at 90° (cosine = 0).
Opposite: Arrows at 180° (cosine = -1)
...continued

3. How Are Embeddings Created?

Models: Word2Vec, BERT, OpenAI embeddings.

Process:
a. Model reads massive text (e.g., Wikipedia).
b. Learns to map words/documents to vectors based on context.
Visual: Show words plotted in 3D space (e.g., "dog" near "puppy", far from "car").

4. Why Do We Need Them?

→
Machines don’t understand text Embeddings convert words to math.
Enables: Semantic search, recommendations, chatbots (RAG).
COSINE SIMILARITY IN ACTION: LION VS. CAT VS. DOG

1. Assign Embeddings (Simplified 2D Example)

Let’s pretend these are their vector coordinates:

Cat: [0.8, 0.5]
Dog: [0.7, 0.6] (similar to cat)
Lion: [0.9, 0.2] (less similar direction)

(Note: Real embeddings have 100s of dimensions, but we’ll visualize in 2D for clarity.)

2. Calculate Cosine Similarity

Formula: cos(θ) = (A · B) / (||A|| * ||B||)

Calculation Cosine Similarity calculation:

Cat vs Dog = (0.8*0.7 + 0.5*0.6) / (√(0.8²+0.5²) * √(0.7²+0.6²))
value = 0.98 (Nearly identical)

Cat vs Lion = (0.80.9 + 0.50.2) / (same denominators)

value = 0.85 (Similar but less so)
DESIGN A SMART FAQ CHATBOT WITH COSINE SIMILARITY

"Use vector embeddings and cosine similarity to match user questions with answers."

Step 1: Chatbot Blueprint

Architecture:

1. User Input: "How do I reset my password?"

2. Embedding Model: Convert question → vector
(e.g., [0.4, -0.2, 0.8])
3.Vector Database: Pre-stored FAQ embeddings
(e.g., "Reset password" = [0.38, -0.19, 0.82])
4.Cosine Similarity: Compare vectors to find closest match.
5.Response: Return best-matched answer.
CHATBOT ARCHITECTURE
VECTOR DATABASES
Definition:

A database optimized to store and query vector embeddings at scale.

Key Properties:
1. Native Vector Support: Handles high-dimensional data (e.g., 768D
embeddings).
2. Similarity Search: Finds closest vectors via cosine/L2 distance.
3. Hybrid Storage: Can also store metadata (e.g., text, timestamps).
DRAWBACKS OF COSINE SIMILARITY
1. Magnitude Ignorance

Example: Short text ("cat") vs. long text ("a large domesticated feline")
may have identical direction but different magnitudes.
Fix: Normalize vectors or combine with Euclidean distance.

2.High Dimensional Sparsity

Example: In very high dimensions (e.g., 1000D), random vectors can

appear "similar" due to the curse of dimensionality.

Fix: Use dimensionality reduction (PCA, UMAP) or switch to inner product for
normalized embeddings.
FINAL THOUGHT

"COSINE SIMILARITY TURNS DATA INTO

MEANING, VECTOR DATABASES MAKE IT
SEARCHABLE, AND RAG BRINGS IT TO LIFE—
THIS TRIO IS RESHAPING AI’S FUTURE."
THANK YOU!

Embeddings - A Simple Guide To Rag
No ratings yet
Embeddings - A Simple Guide To Rag
10 pages
SPM Essay - DW - Article - Safeguarding The Environment
58% (19)
SPM Essay - DW - Article - Safeguarding The Environment
3 pages
Embeddings, Vector Databases, and Search in LLM
No ratings yet
Embeddings, Vector Databases, and Search in LLM
38 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
Free Whale Intarsia Pattern by JGR PDF
No ratings yet
Free Whale Intarsia Pattern by JGR PDF
11 pages
Vector Databases - A Technical Primer
100% (1)
Vector Databases - A Technical Primer
68 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
49 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
06 VectorSpaceModel
No ratings yet
06 VectorSpaceModel
65 pages
PythonAI VectorEmbeddingsForSharing
No ratings yet
PythonAI VectorEmbeddingsForSharing
46 pages
ShortCourse QTT Lecture1
No ratings yet
ShortCourse QTT Lecture1
40 pages
Maths Roadmap For Machine Learning - Linear Algebra-1
No ratings yet
Maths Roadmap For Machine Learning - Linear Algebra-1
5 pages
RAGHack AzureAISearch Spanish
No ratings yet
RAGHack AzureAISearch Spanish
85 pages
Ratio and Proportion
100% (5)
Ratio and Proportion
13 pages
Is Cosine-Similarity of Embeddings Really About Similarity
No ratings yet
Is Cosine-Similarity of Embeddings Really About Similarity
9 pages
CS 3308 Learning Journal 4
No ratings yet
CS 3308 Learning Journal 4
3 pages
2013 COMP5318 Lecture1
No ratings yet
2013 COMP5318 Lecture1
21 pages
Ben 10
No ratings yet
Ben 10
15 pages
Sample Docs Format - Brgy
No ratings yet
Sample Docs Format - Brgy
14 pages
Large Language Models: Foundation of
No ratings yet
Large Language Models: Foundation of
8 pages
L04
No ratings yet
L04
35 pages
Neubert2019 Article AnIntroductionToHyperdimension
No ratings yet
Neubert2019 Article AnIntroductionToHyperdimension
12 pages
Faiss
No ratings yet
Faiss
24 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
Enhancing Document Retrieval Fine-Tuning Text Embeddings For RAG
No ratings yet
Enhancing Document Retrieval Fine-Tuning Text Embeddings For RAG
20 pages
The Faiss Library
No ratings yet
The Faiss Library
21 pages
Representing Structured Relational Data in Euclidean Vector Spaces
No ratings yet
Representing Structured Relational Data in Euclidean Vector Spaces
14 pages
Matrix-Vector Multiplication by MapReduce-V2
No ratings yet
Matrix-Vector Multiplication by MapReduce-V2
26 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Lecture 2 Introduction To Linear Algebra (Part 1)
No ratings yet
Lecture 2 Introduction To Linear Algebra (Part 1)
49 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
Nissan Ac Mr18de MTC
No ratings yet
Nissan Ac Mr18de MTC
90 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
CSE2 12200122084 SohamSarkar PR 67
No ratings yet
CSE2 12200122084 SohamSarkar PR 67
9 pages
Baking Powder
No ratings yet
Baking Powder
29 pages
Service Manual MS-GF20VA MS-GF25VA MS-GF35VA MS-GF50VA MS-GF60VA MS-GF80VA
No ratings yet
Service Manual MS-GF20VA MS-GF25VA MS-GF35VA MS-GF50VA MS-GF60VA MS-GF80VA
42 pages
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
No ratings yet
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
21 pages
Earlyyears 2
100% (8)
Earlyyears 2
176 pages
Mill Test Certificate: Jindal Stainless (Hisar) Limited
67% (6)
Mill Test Certificate: Jindal Stainless (Hisar) Limited
1 page
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Embeddings
No ratings yet
Embeddings
13 pages
You LL Learn Why They Matter What Makes Them Different How They Work The New Use Cases They Re Designed For and How To Get Started 1688203106
No ratings yet
You LL Learn Why They Matter What Makes Them Different How They Work The New Use Cases They Re Designed For and How To Get Started 1688203106
25 pages
Pratical Work
No ratings yet
Pratical Work
11 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Linear Algebra - Part 1
No ratings yet
Linear Algebra - Part 1
10 pages
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Vector Database Management Systems
No ratings yet
Vector Database Management Systems
13 pages
CIKM2022 Submission 3961
No ratings yet
CIKM2022 Submission 3961
5 pages
Linear Algebra Primer Concepts
No ratings yet
Linear Algebra Primer Concepts
50 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Gensim: A Python Library For NLP and Word Embeddings.: Import As Import As From Import From Import From Import
No ratings yet
Gensim: A Python Library For NLP and Word Embeddings.: Import As Import As From Import From Import From Import
4 pages
Text Similarity Metrics
No ratings yet
Text Similarity Metrics
10 pages
Untitled Presentation
No ratings yet
Untitled Presentation
10 pages
Introduction To Vector Embeddings and Vector Databases
No ratings yet
Introduction To Vector Embeddings and Vector Databases
11 pages
Electrical Schedule
No ratings yet
Electrical Schedule
4 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
21 pages
RFIT-PRT-0895 FilmArrayPneumoplus Instructions For Use EN PDF
No ratings yet
RFIT-PRT-0895 FilmArrayPneumoplus Instructions For Use EN PDF
112 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
Vector Database
No ratings yet
Vector Database
7 pages
Cosine Similarity
No ratings yet
Cosine Similarity
3 pages
BDA
No ratings yet
BDA
31 pages
2021 等级2：3-4年级
No ratings yet
2021 等级2：3-4年级
19 pages
Mathophilia
No ratings yet
Mathophilia
18 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
Thales Flash Dipping Sonar PDF
No ratings yet
Thales Flash Dipping Sonar PDF
2 pages
Vector Semantics 2 Word Embeddings (Vector Semantics)
No ratings yet
Vector Semantics 2 Word Embeddings (Vector Semantics)
5 pages
Stray Dogs Exhibit A Lesser Variety of Colors Than Pet Dogs
50% (4)
Stray Dogs Exhibit A Lesser Variety of Colors Than Pet Dogs
6 pages
Maths Roadmap For Machine Learning-1
No ratings yet
Maths Roadmap For Machine Learning-1
8 pages
LspCAD 6 Tutorial PDF
No ratings yet
LspCAD 6 Tutorial PDF
30 pages
What Is Vector
No ratings yet
What Is Vector
4 pages
MEMS Motion Sensor: Three-Axis Digital Output Gyroscope: Applications
No ratings yet
MEMS Motion Sensor: Three-Axis Digital Output Gyroscope: Applications
44 pages
IR-Lab Manual A1
No ratings yet
IR-Lab Manual A1
3 pages
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
No ratings yet
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
5 pages
Contoh Soal Peringatan
No ratings yet
Contoh Soal Peringatan
3 pages
Ec8353 Electronic Devices and Circuits Unit 2
No ratings yet
Ec8353 Electronic Devices and Circuits Unit 2
126 pages
Reading Passage 1: "Tongue Often Hang Man Quicker Than Rope." (Chan Was Forever Carping at His Son For Too Much
No ratings yet
Reading Passage 1: "Tongue Often Hang Man Quicker Than Rope." (Chan Was Forever Carping at His Son For Too Much
8 pages
BISC 1403 R04 Fall 2024 V2
No ratings yet
BISC 1403 R04 Fall 2024 V2
6 pages
Histopathology
No ratings yet
Histopathology
196 pages
There Are 12 Main Ways by Which You May Send Your Payment To The College
No ratings yet
There Are 12 Main Ways by Which You May Send Your Payment To The College
3 pages
c200 Series Product Manual Black Bruin en
No ratings yet
c200 Series Product Manual Black Bruin en
75 pages
Imucet 2024 Pacific Marine Academy Sikar
No ratings yet
Imucet 2024 Pacific Marine Academy Sikar
37 pages
Routerboard 2011L Series: Quick Setup Guide and Warranty Information
No ratings yet
Routerboard 2011L Series: Quick Setup Guide and Warranty Information
3 pages
BDRRM Minutes 7-23-24
No ratings yet
BDRRM Minutes 7-23-24
4 pages
Characteristics of Precipitation
No ratings yet
Characteristics of Precipitation
4 pages
Carbon Capture and Storage
No ratings yet
Carbon Capture and Storage
2 pages
Tariff - 2024-25
No ratings yet
Tariff - 2024-25
4 pages
The Beginner’s Guide to Creating AI Chatbots
From Everand
The Beginner’s Guide to Creating AI Chatbots
Steven Mcananey
No ratings yet
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet

Cosine Similarity in Machine Learning

Uploaded by

Cosine Similarity in Machine Learning

Uploaded by

COSINE SIMILARITY IN

Introduction to cosine similarity

Before (The Problem):

After (The Solution):

2. What’s a Vector Embedding?

Cosine similarity ignores vector length — so a short tweet and

similarity between two vectors:

3. How Are Embeddings Created?

Models: Word2Vec, BERT, OpenAI embeddings.

4. Why Do We Need Them?

1. Assign Embeddings (Simplified 2D Example)

Let’s pretend these are their vector coordinates:

2. Calculate Cosine Similarity

Calculation Cosine Similarity calculation:

Cat vs Lion = (0.8*0.9 + 0.5*0.2) / (same denominators)

Step 1: Chatbot Blueprint

1. User Input: "How do I reset my password?"

A database optimized to store and query vector embeddings at scale.

2.High Dimensional Sparsity

Example: In very high dimensions (e.g., 1000D), random vectors can

"COSINE SIMILARITY TURNS DATA INTO

You might also like

Cat vs Lion = (0.80.9 + 0.50.2) / (same denominators)