0% found this document useful (0 votes)

47 views6 pages

CS 3308 Learning Journal Unit 5

This journal entry reflects on the author's experiences in developing a comprehensive search system with scoring and ranking capabilities, focusing on enhancing an inverted index and implementing cosine similarity for document retrieval. The author discusses the challenges faced, insights gained, and the importance of optimization techniques in improving search efficiency. Overall, the unit has significantly contributed to the author's programming skills and understanding of information retrieval systems, which are relevant to their career aspirations in software engineering and data science.

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views6 pages

CS 3308 Learning Journal Unit 5

Uploaded by

Reg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

In this journal entry, I will discuss my experiences and insights from Unit 5, where the

central objective was to synthesize the knowledge acquired thus far to construct a comprehensive

search system with scoring and ranking capabilities. This endeavor required me to enhance an

existing inverted index and integrate it with a scoring mechanism based on cosine similarity.

Throughout this process, I encountered various challenges and surprises that have significantly

contributed to my academic growth.

Methodology

The implementation of the search system began with a review of the inverted index that I

had developed in previous units. To optimize the index for the new scoring and ranking

functionality, I incorporated several preprocessing steps: stop word removal, token filtering, and

the Porter Stemmer algorithm. These techniques aimed to refine the dataset by eliminating

irrelevant terms and normalizing the remaining tokens for more accurate matching.

To achieve the goal of returning the top 20 most pertinent documents for a given query,

the system followed a structured approach:

1. Query Processing: A function was created to parse user queries and compute the term

frequency-inverse document frequency (tf-idf) weights for each term. This involved

multiplying the frequency of each term within the query by its inverse document

frequency, which is the reciprocal of the number of documents in which it appears

(Manning et al., 2009).

2. Document Retrieval: Utilizing the enhanced inverted index, the system identified all

documents containing at least one of the query's terms. This step was optimized to

minimize computational effort by focusing on relevant documents.

3. Scoring with Cosine Similarity: For each retrieved document, I computed the cosine

similarity between the query vector and the document vector. This metric is widely used

in information retrieval to assess the degree of relevance between queries and documents

(Kowalski, 2007). The formula for cosine similarity is:

cos (θ)=(Σ(ti∗di))/¿

Where:

ti is the weight of term i in the query and

di is the weight of term i in the document.

4. Sorting and Presentation of Results: After calculating the cosine similarity scores, the

system sorted the documents in descending order of relevance and displayed the top 20

results, including the filename, similarity score, and the total number of candidates

considered.

Challenges and Insights

One of the most challenging aspects of this assignment was the computational complexity

associated with calculating cosine similarity for large datasets. However, by breaking the task

into smaller, manageable components, I gained a clearer understanding of the process. The

implementation of the dot product and vector normalization functions provided insight into the

practical use of linear algebra in information retrieval systems.

The moment of truth came when I executed a sample query, such as "home mortgage,"

and observed the results accurately ordered by relevance. This success reinforced my confidence

in the accuracy of the tf-idf calculations and the effectiveness of the inverted index. Moreover, I
found it intriguing to explore methods for efficient retrieval, such as inexact top K retrieval and

index elimination, which can drastically reduce computational load.

Peer Feedback and Instructor Interactions

Collaborative discussions with my peers yielded invaluable suggestions for improving the

search engine's efficiency. One peer introduced the concept of Champion Lists, which prioritize

documents with high weights during retrieval. This strategy is particularly useful in systems with

vast document collections, as it minimizes the number of computations needed to identify

relevant documents.

My instructor's feedback on previous programming assignment highlighted the necessity

of thorough testing and debugging. Incorporating this advice, I rigorously tested my

implementation with diverse queries to ensure its precision and reliability.

Emotional and Attitudinal Reflections

Engaging in this assignment has had a profound impact on my confidence and

programming skills. Initially, the complexity of integrating cosine similarity into the search

system was intimidating. However, as I progressed through the steps, my understanding grew,

and the process became more manageable. The realization that seemingly minor optimizations,

like ignoring stop words and low idf terms, can significantly enhance search efficiency was

surprising and motivating.

The experience underscored the importance of writing clean, modular code for complex

projects. Dividing the system into discrete functions not only simplified the implementation but

also improved its readability and maintainability.

Key Learning Outcomes

This unit has been instrumental in deepening my understanding of:

 Efficient Scoring and Ranking: I now appreciate the inefficiency of computing cosine

similarity for all documents and the value of techniques such as inexact top K retrieval,

index elimination, and impact ordering for large datasets.

 Cosine Similarity and tf-idf: The practical application of these mathematical

foundations has solidified my grasp of how relevance is determined in modern search

engines.

 Query Types: Distinctions between boolean retrieval, wildcard queries, and phrase

queries.

 Optimizing Search Systems: The role of advanced strategies like Champion Lists, static

quality scores, and cluster pruning in enhancing search performance.

 Practical Programming Skills: Translating theoretical knowledge into a functional

system and developing robust code.

Surprising Findings and Challenges

I was particularly intrigued by the substantial impact that seemingly small preprocessing

decisions, such as removing stop words, can have on search efficiency. Additionally,

understanding and applying the cosine similarity metric to a real-world context was an

enlightening experience.

The most significant challenge was managing computational complexity for cosine

similarity calculations on large datasets. This required a careful study of optimization strategies

and efficient algorithmic implementation.

Application to Career and Personal Interests

The skills and knowledge acquired in this unit are highly pertinent to my aspirations in

software engineering and data science. For instance, constructing efficient search systems is

essential for applications such as e-commerce product searches and document retrieval tools.

Furthermore, the principles of tf −idf and cosine similarity are fundamental in text

mining and natural language processing (NLP), which are fields I am keen on exploring in the

future.

Developing a systematic approach to problem-solving, with a focus on persistence and

attention to detail, has been invaluable in this process. These qualities are indispensable in the

programming field and will undeniably aid in my professional growth.

Conclusion

In conclusion, this unit has been a substantial milestone in my academic development. By

creating a functioning search engine and gaining a deeper insight into the workings of modern

information retrieval systems, I have developed a solid foundation in the field. The knowledge

and skills honed through this experience will undoubtedly serve me well in my future career

pursuits and academic endeavors.

References

Kowalski, G. J. (2007). Information retrieval systems: theory and implementation (Vol. 1).

Springer.

Manning, C. D., Raghavan, P., & Schütze, H. (2009). Introduction to information retrieval.

Retrieved from http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4),

35-43.

Programming Assignment Unit 05 - CS 3308 - Information Retrieval - University of The People
No ratings yet
Programming Assignment Unit 05 - CS 3308 - Information Retrieval - University of The People
9 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
Information Retrieval
100% (1)
Information Retrieval
11 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Chapter 5 Summary of Findings, Conclusions and Recommendations Corrections
100% (3)
Chapter 5 Summary of Findings, Conclusions and Recommendations Corrections
3 pages
VP-2025JV0P10083-000-O94-001 - 1 - (Installation Manuals)
No ratings yet
VP-2025JV0P10083-000-O94-001 - 1 - (Installation Manuals)
32 pages
IR Journal
No ratings yet
IR Journal
36 pages
Unreal Engine 4 - Lighting Presentation
No ratings yet
Unreal Engine 4 - Lighting Presentation
48 pages
Information Retrieval Systems (A70533)
No ratings yet
Information Retrieval Systems (A70533)
11 pages
Classwork For Information Retrieval
No ratings yet
Classwork For Information Retrieval
118 pages
Pfe Resumatcher
No ratings yet
Pfe Resumatcher
76 pages
EQUIPMENT LIST - 2021-01-15 - Rev-E
No ratings yet
EQUIPMENT LIST - 2021-01-15 - Rev-E
9 pages
What Is Mechanical Integrity and What Are The Requirements of An MI Program - Life Cycle Engineering
No ratings yet
What Is Mechanical Integrity and What Are The Requirements of An MI Program - Life Cycle Engineering
5 pages
RE Monthly March-2025-Report JMKResearch
No ratings yet
RE Monthly March-2025-Report JMKResearch
49 pages
FULLTEXT01
No ratings yet
FULLTEXT01
32 pages
Big Data Searching FIRST Review
No ratings yet
Big Data Searching FIRST Review
10 pages
Officer General 2022 - 10202 - PDF
No ratings yet
Officer General 2022 - 10202 - PDF
1 page
MATH 1281 - Unit 8 Assignment
100% (1)
MATH 1281 - Unit 8 Assignment
2 pages
Thèse Information Retrieval
No ratings yet
Thèse Information Retrieval
67 pages
Final Report
No ratings yet
Final Report
59 pages
CSI 4107 - Winter 2016 - Midterm
0% (1)
CSI 4107 - Winter 2016 - Midterm
10 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
B47 IRS JayeshSIngh AssignmentNo-1
No ratings yet
B47 IRS JayeshSIngh AssignmentNo-1
8 pages
Lecture5 6
No ratings yet
Lecture5 6
30 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Syllabus
No ratings yet
Syllabus
9 pages
CS 3308 Learning Journal Unit 7
No ratings yet
CS 3308 Learning Journal Unit 7
5 pages
1 Overview
No ratings yet
1 Overview
44 pages
Title Search Engine: Submitted in Partial Fulfillment For The Award of Degree
No ratings yet
Title Search Engine: Submitted in Partial Fulfillment For The Award of Degree
40 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
35 pages
Expert Systems With Applications ResuMat
No ratings yet
Expert Systems With Applications ResuMat
14 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Prepared by : 2014510013 SEREN BOLAT 2014510043 ÖZGÜR HEPSAĞ 2014510091 ABDULSAMET İleri
No ratings yet
Prepared by : 2014510013 SEREN BOLAT 2014510043 ÖZGÜR HEPSAĞ 2014510091 ABDULSAMET İleri
16 pages
3453 Compressed
No ratings yet
3453 Compressed
55 pages
Learning Guide Unit 5 - Home
No ratings yet
Learning Guide Unit 5 - Home
12 pages
CS8080 Information Retrieval Technique Ripped From Amazon Kindle
No ratings yet
CS8080 Information Retrieval Technique Ripped From Amazon Kindle
168 pages
Irt Book1
No ratings yet
Irt Book1
175 pages
Project Proposal
No ratings yet
Project Proposal
10 pages
Cif Irws
No ratings yet
Cif Irws
3 pages
Project Report
No ratings yet
Project Report
5 pages
CS 3308 Learning Journal Unit 1
No ratings yet
CS 3308 Learning Journal Unit 1
6 pages
asila-IR
No ratings yet
asila-IR
16 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
Irt Ia 2
No ratings yet
Irt Ia 2
9 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Also Electric Curcuit Workbook With Solutions
No ratings yet
Also Electric Curcuit Workbook With Solutions
27 pages
IR Journal
No ratings yet
IR Journal
20 pages
Ir QB
No ratings yet
Ir QB
8 pages
Learning Guide Unit 6 - Home
No ratings yet
Learning Guide Unit 6 - Home
10 pages
153 Sanskriti IR File
No ratings yet
153 Sanskriti IR File
55 pages
MATH 1281 - Unit 5 Assignment
No ratings yet
MATH 1281 - Unit 5 Assignment
4 pages
Bulu
No ratings yet
Bulu
47 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
IR - Set 1
No ratings yet
IR - Set 1
5 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Ir End Pyq Sols
No ratings yet
Ir End Pyq Sols
8 pages
Midterm Examination IR 2025
No ratings yet
Midterm Examination IR 2025
3 pages
IR Practical Theory
No ratings yet
IR Practical Theory
9 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
5 pages
IJCRT2208099
No ratings yet
IJCRT2208099
16 pages
Question Bank-Print-Irt
No ratings yet
Question Bank-Print-Irt
9 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
IR
No ratings yet
IR
5 pages
ML - Project Report PDF
No ratings yet
ML - Project Report PDF
24 pages
Module 2-1
No ratings yet
Module 2-1
6 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
46 pages
Irt Q&A
No ratings yet
Irt Q&A
14 pages
Research Methodology
100% (1)
Research Methodology
10 pages
Helipad Lighting System
No ratings yet
Helipad Lighting System
11 pages
Lecture 2
No ratings yet
Lecture 2
29 pages
Student Researchers Guide New Template 1 Qualitative
No ratings yet
Student Researchers Guide New Template 1 Qualitative
3 pages
Purdue Owl Developing Strong Thesis Statements
100% (3)
Purdue Owl Developing Strong Thesis Statements
6 pages
Consent Letter For Society
No ratings yet
Consent Letter For Society
3 pages
COSEC Reports Time Attendance
No ratings yet
COSEC Reports Time Attendance
74 pages
Module 5: Design and Investigation of Steel Beams Lesson Outcomes
No ratings yet
Module 5: Design and Investigation of Steel Beams Lesson Outcomes
8 pages
Simulation Model For The Study of Maintenance Actions I
No ratings yet
Simulation Model For The Study of Maintenance Actions I
19 pages
MATH 1281 - Unit 3 Assignment
No ratings yet
MATH 1281 - Unit 3 Assignment
5 pages
MATH 1281 - Unit 4 Discussion Assignment
No ratings yet
MATH 1281 - Unit 4 Discussion Assignment
5 pages
(QP) JEE ADVANCED MOCK TESTS - PDF - 5 PDF
No ratings yet
(QP) JEE ADVANCED MOCK TESTS - PDF - 5 PDF
22 pages
M100 Twin Technical Sheet en
No ratings yet
M100 Twin Technical Sheet en
15 pages
MATH 1280-Unit 2 Discussion Assignment
No ratings yet
MATH 1280-Unit 2 Discussion Assignment
2 pages
Evax-Hmx PDF
No ratings yet
Evax-Hmx PDF
4 pages
11th & 12 TH Q&A
No ratings yet
11th & 12 TH Q&A
28 pages
The Man With A Movie Camera
No ratings yet
The Man With A Movie Camera
7 pages
1 Annual Olympics
No ratings yet
1 Annual Olympics
25 pages
MATH 1280-Unit 1 Discussion Assignment
No ratings yet
MATH 1280-Unit 1 Discussion Assignment
3 pages
Yunita Jeliyah Jalis Putri (18033023)
No ratings yet
Yunita Jeliyah Jalis Putri (18033023)
32 pages
XD Series Catalouge
No ratings yet
XD Series Catalouge
90 pages
Image Enhancement Techniques Using OpenCV
No ratings yet
Image Enhancement Techniques Using OpenCV
13 pages
ENGL 1102-Unit 2 Discussion Assignment
No ratings yet
ENGL 1102-Unit 2 Discussion Assignment
3 pages
Chavan Motors Solapur
No ratings yet
Chavan Motors Solapur
2 pages
Conf Ospf
No ratings yet
Conf Ospf
3 pages
MATH 1302 - Unit 2 Discussion Assignment
No ratings yet
MATH 1302 - Unit 2 Discussion Assignment
4 pages
Learning Guide Unit 1 - Home
No ratings yet
Learning Guide Unit 1 - Home
10 pages
ENGS 25 LabExercise01 Pacing
No ratings yet
ENGS 25 LabExercise01 Pacing
3 pages
Admit Card: Important Points
No ratings yet
Admit Card: Important Points
1 page
Project Direction
No ratings yet
Project Direction
1 page

CS 3308 Learning Journal Unit 5

Uploaded by

CS 3308 Learning Journal Unit 5

Uploaded by

In this journal entry, I will discuss my experiences and insights from Unit 5, where the

contributed to my academic growth.

the system followed a structured approach:

frequency, which is the reciprocal of the number of documents in which it appears

(Manning et al., 2009).

minimize computational effort by focusing on relevant documents.

(Kowalski, 2007). The formula for cosine similarity is:

ti is the weight of term i in the query and

di is the weight of term i in the document.

Challenges and Insights

practical use of linear algebra in information retrieval systems.

index elimination, which can drastically reduce computational load.

Peer Feedback and Instructor Interactions

vast document collections, as it minimizes the number of computations needed to identify

My instructor's feedback on previous programming assignment highlighted the necessity

of thorough testing and debugging. Incorporating this advice, I rigorously tested my

implementation with diverse queries to ensure its precision and reliability.

Emotional and Attitudinal Reflections

Engaging in this assignment has had a profound impact on my confidence and

surprising and motivating.

also improved its readability and maintainability.

This unit has been instrumental in deepening my understanding of:

index elimination, and impact ordering for large datasets.

 Cosine Similarity and tf-idf: The practical application of these mathematical

foundations has solidified my grasp of how relevance is determined in modern search

quality scores, and cluster pruning in enhancing search performance.

 Practical Programming Skills: Translating theoretical knowledge into a functional

system and developing robust code.

Surprising Findings and Challenges

and efficient algorithmic implementation.

Developing a systematic approach to problem-solving, with a focus on persistence and

programming field and will undeniably aid in my professional growth.

In conclusion, this unit has been a substantial milestone in my academic development. By

pursuits and academic endeavors.

Retrieved from http://nlp.stanford.edu/IR-book/information-retrieval-book.html

You might also like