0% found this document useful (0 votes)
47 views6 pages

CS 3308 Learning Journal Unit 5

This journal entry reflects on the author's experiences in developing a comprehensive search system with scoring and ranking capabilities, focusing on enhancing an inverted index and implementing cosine similarity for document retrieval. The author discusses the challenges faced, insights gained, and the importance of optimization techniques in improving search efficiency. Overall, the unit has significantly contributed to the author's programming skills and understanding of information retrieval systems, which are relevant to their career aspirations in software engineering and data science.

Uploaded by

Reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views6 pages

CS 3308 Learning Journal Unit 5

This journal entry reflects on the author's experiences in developing a comprehensive search system with scoring and ranking capabilities, focusing on enhancing an inverted index and implementing cosine similarity for document retrieval. The author discusses the challenges faced, insights gained, and the importance of optimization techniques in improving search efficiency. Overall, the unit has significantly contributed to the author's programming skills and understanding of information retrieval systems, which are relevant to their career aspirations in software engineering and data science.

Uploaded by

Reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

In this journal entry, I will discuss my experiences and insights from Unit 5, where the

central objective was to synthesize the knowledge acquired thus far to construct a comprehensive

search system with scoring and ranking capabilities. This endeavor required me to enhance an

existing inverted index and integrate it with a scoring mechanism based on cosine similarity.

Throughout this process, I encountered various challenges and surprises that have significantly

contributed to my academic growth.

Methodology

The implementation of the search system began with a review of the inverted index that I

had developed in previous units. To optimize the index for the new scoring and ranking

functionality, I incorporated several preprocessing steps: stop word removal, token filtering, and

the Porter Stemmer algorithm. These techniques aimed to refine the dataset by eliminating

irrelevant terms and normalizing the remaining tokens for more accurate matching.

To achieve the goal of returning the top 20 most pertinent documents for a given query,

the system followed a structured approach:

1. Query Processing: A function was created to parse user queries and compute the term

frequency-inverse document frequency (tf-idf) weights for each term. This involved

multiplying the frequency of each term within the query by its inverse document

frequency, which is the reciprocal of the number of documents in which it appears

(Manning et al., 2009).

2. Document Retrieval: Utilizing the enhanced inverted index, the system identified all

documents containing at least one of the query's terms. This step was optimized to

minimize computational effort by focusing on relevant documents.


3. Scoring with Cosine Similarity: For each retrieved document, I computed the cosine

similarity between the query vector and the document vector. This metric is widely used

in information retrieval to assess the degree of relevance between queries and documents

(Kowalski, 2007). The formula for cosine similarity is:

cos (θ)=(Σ(ti∗di))/¿

Where:

ti is the weight of term i in the query and

di is the weight of term i in the document.

4. Sorting and Presentation of Results: After calculating the cosine similarity scores, the

system sorted the documents in descending order of relevance and displayed the top 20

results, including the filename, similarity score, and the total number of candidates

considered.

Challenges and Insights

One of the most challenging aspects of this assignment was the computational complexity

associated with calculating cosine similarity for large datasets. However, by breaking the task

into smaller, manageable components, I gained a clearer understanding of the process. The

implementation of the dot product and vector normalization functions provided insight into the

practical use of linear algebra in information retrieval systems.

The moment of truth came when I executed a sample query, such as "home mortgage,"

and observed the results accurately ordered by relevance. This success reinforced my confidence

in the accuracy of the tf-idf calculations and the effectiveness of the inverted index. Moreover, I
found it intriguing to explore methods for efficient retrieval, such as inexact top K retrieval and

index elimination, which can drastically reduce computational load.

Peer Feedback and Instructor Interactions

Collaborative discussions with my peers yielded invaluable suggestions for improving the

search engine's efficiency. One peer introduced the concept of Champion Lists, which prioritize

documents with high weights during retrieval. This strategy is particularly useful in systems with

vast document collections, as it minimizes the number of computations needed to identify

relevant documents.

My instructor's feedback on previous programming assignment highlighted the necessity

of thorough testing and debugging. Incorporating this advice, I rigorously tested my

implementation with diverse queries to ensure its precision and reliability.

Emotional and Attitudinal Reflections

Engaging in this assignment has had a profound impact on my confidence and

programming skills. Initially, the complexity of integrating cosine similarity into the search

system was intimidating. However, as I progressed through the steps, my understanding grew,

and the process became more manageable. The realization that seemingly minor optimizations,

like ignoring stop words and low idf terms, can significantly enhance search efficiency was

surprising and motivating.

The experience underscored the importance of writing clean, modular code for complex

projects. Dividing the system into discrete functions not only simplified the implementation but

also improved its readability and maintainability.


Key Learning Outcomes

This unit has been instrumental in deepening my understanding of:

 Efficient Scoring and Ranking: I now appreciate the inefficiency of computing cosine

similarity for all documents and the value of techniques such as inexact top K retrieval,

index elimination, and impact ordering for large datasets.

 Cosine Similarity and tf-idf: The practical application of these mathematical

foundations has solidified my grasp of how relevance is determined in modern search

engines.

 Query Types: Distinctions between boolean retrieval, wildcard queries, and phrase

queries.

 Optimizing Search Systems: The role of advanced strategies like Champion Lists, static

quality scores, and cluster pruning in enhancing search performance.

 Practical Programming Skills: Translating theoretical knowledge into a functional

system and developing robust code.

Surprising Findings and Challenges

I was particularly intrigued by the substantial impact that seemingly small preprocessing

decisions, such as removing stop words, can have on search efficiency. Additionally,

understanding and applying the cosine similarity metric to a real-world context was an

enlightening experience.

The most significant challenge was managing computational complexity for cosine

similarity calculations on large datasets. This required a careful study of optimization strategies

and efficient algorithmic implementation.


Application to Career and Personal Interests

The skills and knowledge acquired in this unit are highly pertinent to my aspirations in

software engineering and data science. For instance, constructing efficient search systems is

essential for applications such as e-commerce product searches and document retrieval tools.

Furthermore, the principles of tf −idf and cosine similarity are fundamental in text

mining and natural language processing (NLP), which are fields I am keen on exploring in the

future.

Developing a systematic approach to problem-solving, with a focus on persistence and

attention to detail, has been invaluable in this process. These qualities are indispensable in the

programming field and will undeniably aid in my professional growth.

Conclusion

In conclusion, this unit has been a substantial milestone in my academic development. By

creating a functioning search engine and gaining a deeper insight into the workings of modern

information retrieval systems, I have developed a solid foundation in the field. The knowledge

and skills honed through this experience will undoubtedly serve me well in my future career

pursuits and academic endeavors.


References

Kowalski, G. J. (2007). Information retrieval systems: theory and implementation (Vol. 1).

Springer.

Manning, C. D., Raghavan, P., & Schütze, H. (2009). Introduction to information retrieval.

Retrieved from http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4),

35-43.

You might also like