Cosine Similarity in Machine Learning
Cosine Similarity in Machine Learning
MACHINE LEARNING
by Natnael Teklemariam
WHAT IS COSINE SIMILARITY?
Cosine similarity measures how similar two things are by calculating the angle between their vector
representations—ignoring their size and focusing only on their direction
Real-World Examples:
1. Daily.dev – Recommends articles based on what you read (not just keywords, but meaning!)
2. Spotify/Netflix – Suggests songs/movies similar to your taste
3. RAG Chatbots – Retrieves the most relevant info before generating an answer
Problem:
* Machines don’t "understand" text like humans.
* We need a way to measure semantic similarity, not just exact word matches.
Solution: Cosine Similarity – A simple yet powerful math trick to compare meanings!
Why Cosine Similarity? (The "Before & After" Story)
Analogy: "Think of it like comparing two people’s music tastes. It’s not about how many songs they’ve listened to (Euclidean
distance), but how alike their preferences are (cosine similarity)."
VECTORS RECAP:
1. What’s a Vector?
Definition:
A list of numbers that represents data in multi-dimensional space.
Analogy: Like GPS coordinates, animals, images, videos and more ...
e.g., "Cat" = [0.7, -0.2, 0.4, ...]).
(Note: Real embeddings have 100s of dimensions, but we’ll visualize in 2D for clarity.)
"Use vector embeddings and cosine similarity to match user questions with answers."
Architecture:
Key Properties:
1. Native Vector Support: Handles high-dimensional data (e.g., 768D
embeddings).
2. Similarity Search: Finds closest vectors via cosine/L2 distance.
3. Hybrid Storage: Can also store metadata (e.g., text, timestamps).
DRAWBACKS OF COSINE SIMILARITY
1. Magnitude Ignorance
Example: Short text ("cat") vs. long text ("a large domesticated feline")
may have identical direction but different magnitudes.
Fix: Normalize vectors or combine with Euclidean distance.
Fix: Use dimensionality reduction (PCA, UMAP) or switch to inner product for
normalized embeddings.
FINAL THOUGHT