0% found this document useful (0 votes)
47 views5 pages

CS 3308 Learning Journal Unit 4

The document outlines a method for applying cosine similarity to recommend documents based on user preferences, specifically focusing on three documents in a corpus. It details the preprocessing steps, formation of binary term vectors, and the calculation of cosine similarity, ultimately recommending Document 2 due to its similarity with Document 1. The methodology is scalable and can enhance the relevance of search results in larger corpora.

Uploaded by

Reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views5 pages

CS 3308 Learning Journal Unit 4

The document outlines a method for applying cosine similarity to recommend documents based on user preferences, specifically focusing on three documents in a corpus. It details the preprocessing steps, formation of binary term vectors, and the calculation of cosine similarity, ultimately recommending Document 2 due to its similarity with Document 1. The methodology is scalable and can enhance the relevance of search results in larger corpora.

Uploaded by

Reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Applying Cosine Similarity for Document Recommendation

The primary objective of this exercise is to determine the cosine similarity between

documents within a corpus and to subsequently recommend a document that is most pertinent to

the user based on their preferences. The corpus at hand comprises three distinct documents, and

the user frequently consults Document 1. This investigation will elucidate the methodology

employed to compute the cosine similarity and ultimately propose an analogous document from

the corpus.

Preparatory Methods

To initiate the process of computing similarity, the documents underwent a series of pre-

processing steps that are fundamental to achieving consistent and reliable results:

1. Lowercasing: The transformation of all textual content to lowercase serves to standardize

the input data, thereby ensuring that the analysis is not skewed by variations in

capitalization.

2. Tokenization: This step involved dissecting each document into discrete tokens, which

are essentially the individual words that constitute the text. This process facilitates the

analysis of the text on a word-level basis.

3. Stop-word Elimination: Common terms, such as "is" in Document 1, that do not

significantly contribute to the semantic essence of the documents were removed to

prevent them from influencing the similarity calculations.

4. Stemming/Lemmatization: Although unnecessary in this instance due to the simplicity

of the words involved, this step typically involves reducing words to their base or root

forms, thereby eliminating the consideration of various inflections as unique terms.


Formation of Vocabulary and Binary Term Vectors

The vocabulary was constructed by identifying the unique terms present across the entire

corpus. The vocabulary for the given documents is as follows:

[earth , moon , round , day , nice] .

Using this vocabulary, each document was represented as a binary vector, with a '1'

indicating the presence of a term and a '0' signifying its absence. The resulting binary term

vectors for Documents 1, 2, and 3 are:

Document 1 :[1 , 0 , 1, 0 , 0]

Document 2:[0 ,1 , 1 ,0 , 0]

Document 3:[0 , 0 ,0 , 1 ,1 ]

Calculating Cosine Similarity

The cosine similarity metric was utilized to measure the similarity between the vectors of

the documents. This technique evaluates the cosine of the angle formed between two vectors in a

multi-dimensional space. The formula is presented as:

( D 1 · D 2)
Cosine Similarity ( D1 , D 2 )=
(||D 1||∗||D 2||)

where:

 D1 · D2 represents the dot product of vectors D1 and D2,

 ||D1|| and ||D2|| denote the magnitudes of vectors D1 and D2, respectively.

Procedure
The cosine similarity was computed for Document 1 against Documents 2 and 3 as follows:

1. Document 1 and Document 2:

Dot product: 1 (since "round" is common to both)

Magnitudes: ||Document 1||=√ ( 1 +0 +1 +0 +0 ) =√ ( 2 ) ,


2 2 2 2 2

¿∨Document 2∨¿= √ ( 0 +1 +1 +0 +0 )=√ (2)


2 2 2 2 2

1 1
Cosine Similarity : = =0.5
√( 2 )∗√(2) 2

2. Document 1 and Document 3:

Dot product: 0 (no common terms)

Magnitudes: ||Document 1||=√ ( 1 +0 +1 +0 +0 ) =√ ( 2 )


2 2 2 2 2

¿∨Document 3∨¿= √ ( 0 + 0 +0 +1 +1 )=√ (2)


2 2 2 2 2

0
Cosine Similarity : =0
√( 2 )∗√( 2 )

Recommendation

Based on the cosine similarity values obtained:

 The similarity between Document 1 and Document 2 is 0.5, which signifies a match.

 The similarity between Document 1 and Document 3 is 0, indicating no similarity.


Consequently, Document 2, which shares the concept of roundness with Document 1, is the most

suitable recommendation for the user.

Conclusion

This exercise has successfully illustrated the practical application of cosine similarity in

the context of a search engine algorithm. By systematically pre-processing the text and creating

vector representations, the engine can determine the most pertinent document for the user's

query. This methodology is highly scalable, permitting its application to extensive corpora

containing millions of documents, thereby enhancing the precision and relevance of search

results.
References

Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. In

C. D. Manning, P. Raghavan, & H. Schütze (Eds.), Introduction to information retrieval

(pp. 1-2). Cambridge University Press. https://nlp.stanford.edu/IR-book/information-

retrieval-book.html

García, E. (n.d.). Term vector calculations: A fast track tutorial.

https://www.yumpu.com/en/document/view/521048/term-vector-calculations-fast-track-

tutorial

Wikipedia. (n.d.). Cosine similarity. https://en.wikipedia.org/wiki/Cosine_similarity

You might also like