0% found this document useful (0 votes)

17 views3 pages

CS 3308 Learning Journal 4

The document discusses the importance of document similarity in information retrieval, specifically using cosine similarity to recommend similar documents. It outlines the process of document vectorization through text preprocessing and TF-IDF representation, followed by the calculation of cosine similarity. The methodology is applied to example documents to illustrate how to recommend the most relevant content based on similarity scores.

Uploaded by

djromodeste

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views3 pages

CS 3308 Learning Journal 4

Uploaded by

djromodeste

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

CS 3308-01 - AY2025-T3 Learning Journal Unit 4

Introduction

In the field of information retrieval, understanding document similarity is crucial for tasks such as
ranking search results and recommending content. One effective way to measure similarity between
documents is by using cosine similarity, a metric that calculates the angle between document
vectors in a high-dimensional space. This technique will be employed to recommend documents
that are similar to a user's preferred document within a given corpus.

Document Vectorization

The first step in this process involves converting the provided documents into numerical
representations. This can be done through the following stages:

1. Text Preprocessing:

 Tokenize each document into individual words.

 Remove stop words, such as "is," to focus on meaningful terms that contribute to the document's
content.

2. Vector Representation:

 Convert each document into a vector representation using the term frequency-inverse document
frequency (TF-IDF) method. This approach assigns weights to terms based on their frequency in
a document and their rarity across the entire corpus.

The TF-IDF vector (vd) for a document (d) is computed as follows:

where

Here, tf(t, d) represents the term frequency of term t in document d, and idf(t) is the inverse
document frequency of term t across the corpus.
Cosine Similarity Calculation

Once we have the vectorized representation of each document, we can calculate the cosine
similarity between the document vectors. The cosine similarity between documents d1 and d2 is
given by the formula:

Where:

 is the dot product of the vectors and

 and are the Euclidean norms (lengths) of the vectors and ,

respectively.

Recommendation Process

Now, let's apply this methodology to the provided documents:

1. Document Representation:

 Document 1: "Earth is round."

 Document 2: "Moon is round."
 Document 3: "Day is nice."

2. Vectorization:

 After removing stop words ("is"), we represent each document as a TF-IDF vector.

3. Cosine Similarity Calculation:

 Compute the cosine similarity between Document 1 and Documents 2 and 3.

4. Recommendation:

 Recommend the document with the highest cosine similarity to Document 1.

Conclusion

In conclusion, recommending similar documents involves transforming text into numerical vectors,
calculating cosine similarity between these vectors, and using the results to identify the most
relevant documents. This approach, grounded in information retrieval principles, enhances
document recommendation systems and improves user experience in search engines.
References

Manning, C.D., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval
(Online ed.). Cambridge, MA: Cambridge University Press. Available at
http://nlp.stanford.edu/IR-book/information-retrieval-book.html

PastPapers Harony P4 2024
No ratings yet
PastPapers Harony P4 2024
484 pages
Excel Dynamic Arrays: Course Notes
No ratings yet
Excel Dynamic Arrays: Course Notes
34 pages
Module-6 Thumb Instruction Set
100% (1)
Module-6 Thumb Instruction Set
54 pages
Dashboard in A Day
No ratings yet
Dashboard in A Day
40 pages
Tribhuvan University: Project Proposal
No ratings yet
Tribhuvan University: Project Proposal
17 pages
24-Bit, 4-Channel Simultaneous Sampling 1.5 MSPS Precision Alias Free ADC
No ratings yet
24-Bit, 4-Channel Simultaneous Sampling 1.5 MSPS Precision Alias Free ADC
86 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
04 CBLM With Competency Assessment Tools
No ratings yet
04 CBLM With Competency Assessment Tools
73 pages
SagarRane (10 0) PDF
No ratings yet
SagarRane (10 0) PDF
7 pages
X - AI - Question Bank2022
No ratings yet
X - AI - Question Bank2022
7 pages
Vector Space Modeling With TFIDF
No ratings yet
Vector Space Modeling With TFIDF
4 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
VMware Vsphere Troubleshooting Workshop 6.5 Lab Manual
No ratings yet
VMware Vsphere Troubleshooting Workshop 6.5 Lab Manual
64 pages
Diamond 3 13 User Guide
No ratings yet
Diamond 3 13 User Guide
152 pages
Usability Evaluation of Modeling Languages by Christian Schalles
No ratings yet
Usability Evaluation of Modeling Languages by Christian Schalles
185 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
HNDR-S4812 User's Manual
No ratings yet
HNDR-S4812 User's Manual
74 pages
L6 Recommendation
No ratings yet
L6 Recommendation
56 pages
IT Security Hacker Pitch Deck by Slidesgo
No ratings yet
IT Security Hacker Pitch Deck by Slidesgo
42 pages
Mindsight Codex
No ratings yet
Mindsight Codex
87 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
CS 3308 Learning Journal Unit 4
No ratings yet
CS 3308 Learning Journal Unit 4
5 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
ISR Chap... 5
No ratings yet
ISR Chap... 5
34 pages
1715 Redundant I/O System Specifications: Technical Data
No ratings yet
1715 Redundant I/O System Specifications: Technical Data
20 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
L14 VSM
No ratings yet
L14 VSM
24 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Vector Semantics - NLP
No ratings yet
Vector Semantics - NLP
118 pages
Lecture 11 Collaborative Filtering
No ratings yet
Lecture 11 Collaborative Filtering
37 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
IR - ch5 - Vector Space Model
No ratings yet
IR - ch5 - Vector Space Model
23 pages
Microsoft Purview Data Lifecycle Management Overview
No ratings yet
Microsoft Purview Data Lifecycle Management Overview
22 pages
Lec 3
No ratings yet
Lec 3
51 pages
Tkde 2014 26 7
No ratings yet
Tkde 2014 26 7
17 pages
Performance Analysis of Startup Time in CPU Within Windows Environment
No ratings yet
Performance Analysis of Startup Time in CPU Within Windows Environment
9 pages
Webir 06
No ratings yet
Webir 06
32 pages
Tringo Catalogue 2024 (TG-EP)
No ratings yet
Tringo Catalogue 2024 (TG-EP)
10 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
Cosine Similarity in Machine Learning
No ratings yet
Cosine Similarity in Machine Learning
14 pages
L04
No ratings yet
L04
35 pages
Ilham 2020 IOP Conf. Ser. Mater. Sci. Eng. 875 012039
No ratings yet
Ilham 2020 IOP Conf. Ser. Mater. Sci. Eng. 875 012039
12 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
Discussion of Relay Protection Testing Technology For Intelligent Substation
No ratings yet
Discussion of Relay Protection Testing Technology For Intelligent Substation
6 pages
Syllabus Ns (Network Security)
No ratings yet
Syllabus Ns (Network Security)
6 pages
Text Similarity Cosine BOW TF-IDF Lecture
No ratings yet
Text Similarity Cosine BOW TF-IDF Lecture
6 pages
(2012) Sistemasderecomendacion
No ratings yet
(2012) Sistemasderecomendacion
18 pages
Text Similarity Metrics
No ratings yet
Text Similarity Metrics
10 pages
Movie Recommendation System Using Cosine Similarity and KNN: II. Related Work
No ratings yet
Movie Recommendation System Using Cosine Similarity and KNN: II. Related Work
4 pages
Yan 2021 Fine Grained Motion Estimation For
No ratings yet
Yan 2021 Fine Grained Motion Estimation For
11 pages
Online Ijmebac 2022 1 1 3 12 16 291
No ratings yet
Online Ijmebac 2022 1 1 3 12 16 291
5 pages
Library API
No ratings yet
Library API
7 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
No ratings yet
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
4 pages
Alshammari 2023 Ijca 922667
No ratings yet
Alshammari 2023 Ijca 922667
4 pages
Contextual Document Similarity For Content-Based Literature Recommender Systems
No ratings yet
Contextual Document Similarity For Content-Based Literature Recommender Systems
8 pages
A Neural Network Approach To Ordinal Regression
No ratings yet
A Neural Network Approach To Ordinal Regression
6 pages
E96660695201532
No ratings yet
E96660695201532
5 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
5.2.1 Packet Tracer - Configure VTP and DTP - ITExamAnswers
No ratings yet
5.2.1 Packet Tracer - Configure VTP and DTP - ITExamAnswers
6 pages
Similarity Measures Le 512
No ratings yet
Similarity Measures Le 512
14 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
Cosine Similarity
No ratings yet
Cosine Similarity
3 pages
Intel 8080 CPU Chip Development
No ratings yet
Intel 8080 CPU Chip Development
4 pages
17-Demonstration On Document Similarity Techniques and Measurements.-24-03-2025
No ratings yet
17-Demonstration On Document Similarity Techniques and Measurements.-24-03-2025
4 pages
Vikram Takalkar 4.4+yrs ReactJS
No ratings yet
Vikram Takalkar 4.4+yrs ReactJS
2 pages
38.3 - Similarity Based Algorithms - mp4
No ratings yet
38.3 - Similarity Based Algorithms - mp4
4 pages
The Most Effective Digital Marketing Strategies
No ratings yet
The Most Effective Digital Marketing Strategies
5 pages
WSMA-Mid 1 Descriptive QP
No ratings yet
WSMA-Mid 1 Descriptive QP
3 pages
Study Shore
No ratings yet
Study Shore
4 pages
Plagiarism Detector NLP Theory
No ratings yet
Plagiarism Detector NLP Theory
3 pages
Queries As Vectors
No ratings yet
Queries As Vectors
3 pages
IR-Lab Manual A1
No ratings yet
IR-Lab Manual A1
3 pages
Worksheet04 - Recommender Systems
No ratings yet
Worksheet04 - Recommender Systems
2 pages
Plagiarism Detector NLP Theory
No ratings yet
Plagiarism Detector NLP Theory
2 pages
Bavya
No ratings yet
Bavya
2 pages
Summary
No ratings yet
Summary
2 pages
Summary
No ratings yet
Summary
2 pages
Invoice-42178080 2
No ratings yet
Invoice-42178080 2
1 page
Dictionary of Computer Vision and Image Processing
From Everand
Dictionary of Computer Vision and Image Processing
Robert B. Fisher
No ratings yet
Communication Nets: Stochastic Message Flow and Delay
From Everand
Communication Nets: Stochastic Message Flow and Delay
Leonard Kleinrock
3/5 (1)
Cross Correlation: Unlocking Patterns in Computer Vision
From Everand
Cross Correlation: Unlocking Patterns in Computer Vision
Fouad Sabry
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet

CS 3308 Learning Journal 4

Uploaded by

CS 3308 Learning Journal 4

Uploaded by

CS 3308-01 - AY2025-T3 Learning Journal Unit 4

 Tokenize each document into individual words.

The TF-IDF vector (vd) for a document (d) is computed as follows:

 is the dot product of the vectors and

 and are the Euclidean norms (lengths) of the vectors and ,

Now, let's apply this methodology to the provided documents:

 Document 1: "Earth is round."

3. Cosine Similarity Calculation:

 Compute the cosine similarity between Document 1 and Documents 2 and 3.

 Recommend the document with the highest cosine similarity to Document 1.

You might also like