CS 3308 Learning Journal Unit 4
CS 3308 Learning Journal Unit 4
The primary objective of this exercise is to determine the cosine similarity between
documents within a corpus and to subsequently recommend a document that is most pertinent to
the user based on their preferences. The corpus at hand comprises three distinct documents, and
the user frequently consults Document 1. This investigation will elucidate the methodology
employed to compute the cosine similarity and ultimately propose an analogous document from
the corpus.
Preparatory Methods
To initiate the process of computing similarity, the documents underwent a series of pre-
processing steps that are fundamental to achieving consistent and reliable results:
the input data, thereby ensuring that the analysis is not skewed by variations in
capitalization.
2. Tokenization: This step involved dissecting each document into discrete tokens, which
are essentially the individual words that constitute the text. This process facilitates the
of the words involved, this step typically involves reducing words to their base or root
The vocabulary was constructed by identifying the unique terms present across the entire
Using this vocabulary, each document was represented as a binary vector, with a '1'
indicating the presence of a term and a '0' signifying its absence. The resulting binary term
Document 1 :[1 , 0 , 1, 0 , 0]
Document 2:[0 ,1 , 1 ,0 , 0]
Document 3:[0 , 0 ,0 , 1 ,1 ]
The cosine similarity metric was utilized to measure the similarity between the vectors of
the documents. This technique evaluates the cosine of the angle formed between two vectors in a
( D 1 · D 2)
Cosine Similarity ( D1 , D 2 )=
(||D 1||∗||D 2||)
where:
||D1|| and ||D2|| denote the magnitudes of vectors D1 and D2, respectively.
Procedure
The cosine similarity was computed for Document 1 against Documents 2 and 3 as follows:
1 1
Cosine Similarity : = =0.5
√( 2 )∗√(2) 2
0
Cosine Similarity : =0
√( 2 )∗√( 2 )
Recommendation
The similarity between Document 1 and Document 2 is 0.5, which signifies a match.
Conclusion
This exercise has successfully illustrated the practical application of cosine similarity in
the context of a search engine algorithm. By systematically pre-processing the text and creating
vector representations, the engine can determine the most pertinent document for the user's
query. This methodology is highly scalable, permitting its application to extensive corpora
containing millions of documents, thereby enhancing the precision and relevance of search
results.
References
Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. In
retrieval-book.html
https://www.yumpu.com/en/document/view/521048/term-vector-calculations-fast-track-
tutorial