0% found this document useful (0 votes)

47 views5 pages

CS 3308 Learning Journal Unit 4

The document outlines a method for applying cosine similarity to recommend documents based on user preferences, specifically focusing on three documents in a corpus. It details the preprocessing steps, formation of binary term vectors, and the calculation of cosine similarity, ultimately recommending Document 2 due to its similarity with Document 1. The methodology is scalable and can enhance the relevance of search results in larger corpora.

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views5 pages

CS 3308 Learning Journal Unit 4

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Applying Cosine Similarity for Document Recommendation

The primary objective of this exercise is to determine the cosine similarity between

documents within a corpus and to subsequently recommend a document that is most pertinent to

the user based on their preferences. The corpus at hand comprises three distinct documents, and

the user frequently consults Document 1. This investigation will elucidate the methodology

employed to compute the cosine similarity and ultimately propose an analogous document from

the corpus.

Preparatory Methods

To initiate the process of computing similarity, the documents underwent a series of pre-

processing steps that are fundamental to achieving consistent and reliable results:

1. Lowercasing: The transformation of all textual content to lowercase serves to standardize

the input data, thereby ensuring that the analysis is not skewed by variations in

capitalization.

2. Tokenization: This step involved dissecting each document into discrete tokens, which

are essentially the individual words that constitute the text. This process facilitates the

analysis of the text on a word-level basis.

3. Stop-word Elimination: Common terms, such as "is" in Document 1, that do not

significantly contribute to the semantic essence of the documents were removed to

prevent them from influencing the similarity calculations.

4. Stemming/Lemmatization: Although unnecessary in this instance due to the simplicity

of the words involved, this step typically involves reducing words to their base or root

forms, thereby eliminating the consideration of various inflections as unique terms.

Formation of Vocabulary and Binary Term Vectors

The vocabulary was constructed by identifying the unique terms present across the entire

corpus. The vocabulary for the given documents is as follows:

[earth , moon , round , day , nice] .

Using this vocabulary, each document was represented as a binary vector, with a '1'

indicating the presence of a term and a '0' signifying its absence. The resulting binary term

vectors for Documents 1, 2, and 3 are:

Document 1 :[1 , 0 , 1, 0 , 0]

Document 2:[0 ,1 , 1 ,0 , 0]

Document 3:[0 , 0 ,0 , 1 ,1 ]

Calculating Cosine Similarity

The cosine similarity metric was utilized to measure the similarity between the vectors of

the documents. This technique evaluates the cosine of the angle formed between two vectors in a

multi-dimensional space. The formula is presented as:

( D 1 · D 2)
Cosine Similarity ( D1 , D 2 )=
(||D 1||∗||D 2||)

where:

 D1 · D2 represents the dot product of vectors D1 and D2,

 ||D1|| and ||D2|| denote the magnitudes of vectors D1 and D2, respectively.

Procedure
The cosine similarity was computed for Document 1 against Documents 2 and 3 as follows:

1. Document 1 and Document 2:

Dot product: 1 (since "round" is common to both)

Magnitudes: ||Document 1||=√ ( 1 +0 +1 +0 +0 ) =√ ( 2 ) ,

2 2 2 2 2

¿∨Document 2∨¿= √ ( 0 +1 +1 +0 +0 )=√ (2)

2 2 2 2 2

1 1
Cosine Similarity : = =0.5
√( 2 )∗√(2) 2

2. Document 1 and Document 3:

Dot product: 0 (no common terms)

Magnitudes: ||Document 1||=√ ( 1 +0 +1 +0 +0 ) =√ ( 2 )

2 2 2 2 2

¿∨Document 3∨¿= √ ( 0 + 0 +0 +1 +1 )=√ (2)

2 2 2 2 2

0
Cosine Similarity : =0
√( 2 )∗√( 2 )

Recommendation

Based on the cosine similarity values obtained:

 The similarity between Document 1 and Document 2 is 0.5, which signifies a match.

 The similarity between Document 1 and Document 3 is 0, indicating no similarity.

Consequently, Document 2, which shares the concept of roundness with Document 1, is the most

suitable recommendation for the user.

Conclusion

This exercise has successfully illustrated the practical application of cosine similarity in

the context of a search engine algorithm. By systematically pre-processing the text and creating

vector representations, the engine can determine the most pertinent document for the user's

query. This methodology is highly scalable, permitting its application to extensive corpora

containing millions of documents, thereby enhancing the precision and relevance of search

results.
References

Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. In

C. D. Manning, P. Raghavan, & H. Schütze (Eds.), Introduction to information retrieval

(pp. 1-2). Cambridge University Press. https://nlp.stanford.edu/IR-book/information-

retrieval-book.html

García, E. (n.d.). Term vector calculations: A fast track tutorial.

https://www.yumpu.com/en/document/view/521048/term-vector-calculations-fast-track-

tutorial

Wikipedia. (n.d.). Cosine similarity. https://en.wikipedia.org/wiki/Cosine_similarity

Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
No ratings yet
Hkdse English Reading全方位實戰神技精讀主筆記 Sample 1643026918
23 pages
Alshammari 2023 Ijca 922667
No ratings yet
Alshammari 2023 Ijca 922667
4 pages
CS 3308 Learning Journal 4
No ratings yet
CS 3308 Learning Journal 4
3 pages
What Is Cosine Similarity and Why Is It Advantageous?
No ratings yet
What Is Cosine Similarity and Why Is It Advantageous?
2 pages
Vector Space Model
No ratings yet
Vector Space Model
4 pages
Cosine Similarity
No ratings yet
Cosine Similarity
3 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
Contextual Document Similarity For Content-Based Literature Recommender Systems
No ratings yet
Contextual Document Similarity For Content-Based Literature Recommender Systems
8 pages
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
Comparison Jaccard Similarity Cosine Similarity and Combined
No ratings yet
Comparison Jaccard Similarity Cosine Similarity and Combined
8 pages
Text Similarity Metrics
No ratings yet
Text Similarity Metrics
10 pages
Tkde 2014 26 7
No ratings yet
Tkde 2014 26 7
17 pages
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
No ratings yet
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
5 pages
Similarity Metric
No ratings yet
Similarity Metric
13 pages
Computer Science For Digital Engineering Assignment Report
No ratings yet
Computer Science For Digital Engineering Assignment Report
15 pages
Lab 4
No ratings yet
Lab 4
24 pages
A Comparison of Document Similarity Algorithms
No ratings yet
A Comparison of Document Similarity Algorithms
10 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
2 (C) - Jaccard and Cosine Method
No ratings yet
2 (C) - Jaccard and Cosine Method
6 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Exposure of Document
No ratings yet
Exposure of Document
5 pages
Supplement 1: An Example For Computing Cosine Similarity of Annotations
No ratings yet
Supplement 1: An Example For Computing Cosine Similarity of Annotations
1 page
Lecture 3
No ratings yet
Lecture 3
58 pages
L14 VSM
No ratings yet
L14 VSM
24 pages
Semantic Similarity
No ratings yet
Semantic Similarity
14 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
A Web-Based Kernel Function For Measuring The Similarity of Short Text Snippets
No ratings yet
A Web-Based Kernel Function For Measuring The Similarity of Short Text Snippets
10 pages
PSO11
No ratings yet
PSO11
5 pages
L04
No ratings yet
L04
35 pages
CIKM2022 Submission 3961
No ratings yet
CIKM2022 Submission 3961
5 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
Text Similarity Cosine BOW TF-IDF Lecture
No ratings yet
Text Similarity Cosine BOW TF-IDF Lecture
6 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
A Survey of Numerous Text Similarity Approach
No ratings yet
A Survey of Numerous Text Similarity Approach
10 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Format Synopsis DP
No ratings yet
Format Synopsis DP
12 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
Vector Semantics - NLP
No ratings yet
Vector Semantics - NLP
118 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
Document Similarity From Vector Space Densities
No ratings yet
Document Similarity From Vector Space Densities
12 pages
Datamining MCQ
No ratings yet
Datamining MCQ
3 pages
IR-Lab Manual A1
No ratings yet
IR-Lab Manual A1
3 pages
Principles of Hash-Based Text Retrieval.
100% (1)
Principles of Hash-Based Text Retrieval.
8 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Clustering With Multi-Viewpoint Based Similarity Measure: An Overview
No ratings yet
Clustering With Multi-Viewpoint Based Similarity Measure: An Overview
5 pages
Webir 06
No ratings yet
Webir 06
32 pages
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
No ratings yet
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
4 pages
Learning Guide Unit 6 - Home
No ratings yet
Learning Guide Unit 6 - Home
10 pages
CS 3308 Learning Journal Unit 5
No ratings yet
CS 3308 Learning Journal Unit 5
6 pages
CS 3308 Learning Journal Unit 7
No ratings yet
CS 3308 Learning Journal Unit 7
5 pages
Learning Guide Unit 1 - Home
No ratings yet
Learning Guide Unit 1 - Home
10 pages
MATH 1302 - Unit 2 Discussion Assignment
No ratings yet
MATH 1302 - Unit 2 Discussion Assignment
4 pages
MATH 1281 - Unit 8 Assignment
100% (1)
MATH 1281 - Unit 8 Assignment
2 pages
MATH 1281 - Unit 4 Discussion Assignment
No ratings yet
MATH 1281 - Unit 4 Discussion Assignment
5 pages
MATH 1281 - Unit 3 Assignment
No ratings yet
MATH 1281 - Unit 3 Assignment
5 pages
MATH 1281 - Unit 5 Assignment
No ratings yet
MATH 1281 - Unit 5 Assignment
4 pages
MATH 1280-Unit 2 Discussion Assignment
No ratings yet
MATH 1280-Unit 2 Discussion Assignment
2 pages
ENGL 1102-Unit 2 Discussion Assignment
No ratings yet
ENGL 1102-Unit 2 Discussion Assignment
3 pages
MATH 1280-Unit 1 Discussion Assignment
No ratings yet
MATH 1280-Unit 1 Discussion Assignment
3 pages
Lecture 2
No ratings yet
Lecture 2
29 pages
Poptropica English L1 - Scope and Sequence
No ratings yet
Poptropica English L1 - Scope and Sequence
2 pages
Introduction To CAM Lesson 1
No ratings yet
Introduction To CAM Lesson 1
9 pages
NVEM - EC300 - EC500 Self Test Guide
No ratings yet
NVEM - EC300 - EC500 Self Test Guide
8 pages
SABA Sports Book
No ratings yet
SABA Sports Book
11 pages
FINAL EXAM - Reading and Writing
No ratings yet
FINAL EXAM - Reading and Writing
3 pages
Fire Fighter
No ratings yet
Fire Fighter
3 pages
Queue - Haynes Kia Sephia &amp Spectra Automotive Repair Manual
No ratings yet
Queue - Haynes Kia Sephia &amp Spectra Automotive Repair Manual
4 pages
MMG 301 Final March18
No ratings yet
MMG 301 Final March18
143 pages
Schematic Nrf24l01+Pa+Lna
100% (1)
Schematic Nrf24l01+Pa+Lna
2 pages
Division of Negros Occidental
No ratings yet
Division of Negros Occidental
5 pages
General Organic Chemistry
No ratings yet
General Organic Chemistry
78 pages
TDA8139
No ratings yet
TDA8139
5 pages
Opensap: Big Data With Sap Hana Vora: Course Week 03 - Exercises
No ratings yet
Opensap: Big Data With Sap Hana Vora: Course Week 03 - Exercises
18 pages
Rockridge News
No ratings yet
Rockridge News
16 pages
Department of Education: Bukidnon National High School
No ratings yet
Department of Education: Bukidnon National High School
18 pages
Tyler Hoge Resume
No ratings yet
Tyler Hoge Resume
1 page
Check List DPR at SRRDA Level
No ratings yet
Check List DPR at SRRDA Level
4 pages
Generally Accepted Scheduling Principles Gasp Compiled
0% (1)
Generally Accepted Scheduling Principles Gasp Compiled
1,821 pages
Chem 1 Subject-Outline
No ratings yet
Chem 1 Subject-Outline
10 pages
CV Ajab Gul
No ratings yet
CV Ajab Gul
3 pages
FDP Manual - Petrel Dynamic Modeling PDF
83% (6)
FDP Manual - Petrel Dynamic Modeling PDF
28 pages
Quarter 3 Tle 9
No ratings yet
Quarter 3 Tle 9
5 pages
Parental Involvement Report
No ratings yet
Parental Involvement Report
59 pages
LDM Practicum Portfolio For School Head: 103502-San Jose Norte Elementary School
No ratings yet
LDM Practicum Portfolio For School Head: 103502-San Jose Norte Elementary School
35 pages
Summons in A Civil Action - National Attorney Collection Services Inc Kansas
No ratings yet
Summons in A Civil Action - National Attorney Collection Services Inc Kansas
3 pages
Ammeraal Beltech: Innovation & Service in Belting
No ratings yet
Ammeraal Beltech: Innovation & Service in Belting
6 pages
Unit 3 - Subject Evaluation Building Sentences and Paragraphs (BSP)
No ratings yet
Unit 3 - Subject Evaluation Building Sentences and Paragraphs (BSP)
4 pages
Compare Two Images
0% (1)
Compare Two Images
3 pages
Tokyo Revengers, Chapter 219 - English Scans
No ratings yet
Tokyo Revengers, Chapter 219 - English Scans
1 page

CS 3308 Learning Journal Unit 4

Uploaded by

CS 3308 Learning Journal Unit 4

Uploaded by

Applying Cosine Similarity for Document Recommendation

1. Lowercasing: The transformation of all textual content to lowercase serves to standardize

analysis of the text on a word-level basis.

3. Stop-word Elimination: Common terms, such as "is" in Document 1, that do not

significantly contribute to the semantic essence of the documents were removed to

prevent them from influencing the similarity calculations.

4. Stemming/Lemmatization: Although unnecessary in this instance due to the simplicity

forms, thereby eliminating the consideration of various inflections as unique terms.

corpus. The vocabulary for the given documents is as follows:

[earth , moon , round , day , nice] .

vectors for Documents 1, 2, and 3 are:

Calculating Cosine Similarity

multi-dimensional space. The formula is presented as:

 D1 · D2 represents the dot product of vectors D1 and D2,

1. Document 1 and Document 2:

Dot product: 1 (since "round" is common to both)

Magnitudes: ||Document 1||=√ ( 1 +0 +1 +0 +0 ) =√ ( 2 ) ,

¿∨Document 2∨¿= √ ( 0 +1 +1 +0 +0 )=√ (2)

2. Document 1 and Document 3:

Dot product: 0 (no common terms)

Magnitudes: ||Document 1||=√ ( 1 +0 +1 +0 +0 ) =√ ( 2 )

¿∨Document 3∨¿= √ ( 0 + 0 +0 +1 +1 )=√ (2)

Based on the cosine similarity values obtained:

 The similarity between Document 1 and Document 3 is 0, indicating no similarity.

suitable recommendation for the user.

C. D. Manning, P. Raghavan, & H. Schütze (Eds.), Introduction to information retrieval

(pp. 1-2). Cambridge University Press. https://nlp.stanford.edu/IR-book/information-

García, E. (n.d.). Term vector calculations: A fast track tutorial.

Wikipedia. (n.d.). Cosine similarity. https://en.wikipedia.org/wiki/Cosine_similarity

You might also like