0% found this document useful (0 votes)
17 views3 pages

CS 3308 Learning Journal 4

The document discusses the importance of document similarity in information retrieval, specifically using cosine similarity to recommend similar documents. It outlines the process of document vectorization through text preprocessing and TF-IDF representation, followed by the calculation of cosine similarity. The methodology is applied to example documents to illustrate how to recommend the most relevant content based on similarity scores.

Uploaded by

djromodeste
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views3 pages

CS 3308 Learning Journal 4

The document discusses the importance of document similarity in information retrieval, specifically using cosine similarity to recommend similar documents. It outlines the process of document vectorization through text preprocessing and TF-IDF representation, followed by the calculation of cosine similarity. The methodology is applied to example documents to illustrate how to recommend the most relevant content based on similarity scores.

Uploaded by

djromodeste
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

CS 3308-01 - AY2025-T3 Learning Journal Unit 4

Introduction

In the field of information retrieval, understanding document similarity is crucial for tasks such as
ranking search results and recommending content. One effective way to measure similarity between
documents is by using cosine similarity, a metric that calculates the angle between document
vectors in a high-dimensional space. This technique will be employed to recommend documents
that are similar to a user's preferred document within a given corpus.

Document Vectorization

The first step in this process involves converting the provided documents into numerical
representations. This can be done through the following stages:

1. Text Preprocessing:

 Tokenize each document into individual words.


 Remove stop words, such as "is," to focus on meaningful terms that contribute to the document's
content.

2. Vector Representation:

 Convert each document into a vector representation using the term frequency-inverse document
frequency (TF-IDF) method. This approach assigns weights to terms based on their frequency in
a document and their rarity across the entire corpus.

The TF-IDF vector (vd) for a document (d) is computed as follows:

where

Here, tf(t, d) represents the term frequency of term t in document d, and idf(t) is the inverse
document frequency of term t across the corpus.
Cosine Similarity Calculation

Once we have the vectorized representation of each document, we can calculate the cosine
similarity between the document vectors. The cosine similarity between documents d1 and d2 is
given by the formula:

Where:

 is the dot product of the vectors and

 and are the Euclidean norms (lengths) of the vectors and ,

respectively.

Recommendation Process

Now, let's apply this methodology to the provided documents:

1. Document Representation:

 Document 1: "Earth is round."


 Document 2: "Moon is round."
 Document 3: "Day is nice."

2. Vectorization:

 After removing stop words ("is"), we represent each document as a TF-IDF vector.

3. Cosine Similarity Calculation:

 Compute the cosine similarity between Document 1 and Documents 2 and 3.

4. Recommendation:

 Recommend the document with the highest cosine similarity to Document 1.

Conclusion

In conclusion, recommending similar documents involves transforming text into numerical vectors,
calculating cosine similarity between these vectors, and using the results to identify the most
relevant documents. This approach, grounded in information retrieval principles, enhances
document recommendation systems and improves user experience in search engines.
References

Manning, C.D., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval
(Online ed.). Cambridge, MA: Cambridge University Press. Available at
http://nlp.stanford.edu/IR-book/information-retrieval-book.html

You might also like