CS 3308 Learning Journal Unit 5
CS 3308 Learning Journal Unit 5
central objective was to synthesize the knowledge acquired thus far to construct a comprehensive
search system with scoring and ranking capabilities. This endeavor required me to enhance an
existing inverted index and integrate it with a scoring mechanism based on cosine similarity.
Throughout this process, I encountered various challenges and surprises that have significantly
Methodology
The implementation of the search system began with a review of the inverted index that I
had developed in previous units. To optimize the index for the new scoring and ranking
functionality, I incorporated several preprocessing steps: stop word removal, token filtering, and
the Porter Stemmer algorithm. These techniques aimed to refine the dataset by eliminating
irrelevant terms and normalizing the remaining tokens for more accurate matching.
To achieve the goal of returning the top 20 most pertinent documents for a given query,
1. Query Processing: A function was created to parse user queries and compute the term
frequency-inverse document frequency (tf-idf) weights for each term. This involved
multiplying the frequency of each term within the query by its inverse document
2. Document Retrieval: Utilizing the enhanced inverted index, the system identified all
documents containing at least one of the query's terms. This step was optimized to
similarity between the query vector and the document vector. This metric is widely used
in information retrieval to assess the degree of relevance between queries and documents
cos (θ)=(Σ(ti∗di))/¿
Where:
4. Sorting and Presentation of Results: After calculating the cosine similarity scores, the
system sorted the documents in descending order of relevance and displayed the top 20
results, including the filename, similarity score, and the total number of candidates
considered.
One of the most challenging aspects of this assignment was the computational complexity
associated with calculating cosine similarity for large datasets. However, by breaking the task
into smaller, manageable components, I gained a clearer understanding of the process. The
implementation of the dot product and vector normalization functions provided insight into the
The moment of truth came when I executed a sample query, such as "home mortgage,"
and observed the results accurately ordered by relevance. This success reinforced my confidence
in the accuracy of the tf-idf calculations and the effectiveness of the inverted index. Moreover, I
found it intriguing to explore methods for efficient retrieval, such as inexact top K retrieval and
Collaborative discussions with my peers yielded invaluable suggestions for improving the
search engine's efficiency. One peer introduced the concept of Champion Lists, which prioritize
documents with high weights during retrieval. This strategy is particularly useful in systems with
relevant documents.
programming skills. Initially, the complexity of integrating cosine similarity into the search
system was intimidating. However, as I progressed through the steps, my understanding grew,
and the process became more manageable. The realization that seemingly minor optimizations,
like ignoring stop words and low idf terms, can significantly enhance search efficiency was
The experience underscored the importance of writing clean, modular code for complex
projects. Dividing the system into discrete functions not only simplified the implementation but
Efficient Scoring and Ranking: I now appreciate the inefficiency of computing cosine
similarity for all documents and the value of techniques such as inexact top K retrieval,
engines.
Query Types: Distinctions between boolean retrieval, wildcard queries, and phrase
queries.
Optimizing Search Systems: The role of advanced strategies like Champion Lists, static
I was particularly intrigued by the substantial impact that seemingly small preprocessing
decisions, such as removing stop words, can have on search efficiency. Additionally,
understanding and applying the cosine similarity metric to a real-world context was an
enlightening experience.
The most significant challenge was managing computational complexity for cosine
similarity calculations on large datasets. This required a careful study of optimization strategies
The skills and knowledge acquired in this unit are highly pertinent to my aspirations in
software engineering and data science. For instance, constructing efficient search systems is
essential for applications such as e-commerce product searches and document retrieval tools.
Furthermore, the principles of tf −idf and cosine similarity are fundamental in text
mining and natural language processing (NLP), which are fields I am keen on exploring in the
future.
attention to detail, has been invaluable in this process. These qualities are indispensable in the
Conclusion
creating a functioning search engine and gaining a deeper insight into the workings of modern
information retrieval systems, I have developed a solid foundation in the field. The knowledge
and skills honed through this experience will undoubtedly serve me well in my future career
Kowalski, G. J. (2007). Information retrieval systems: theory and implementation (Vol. 1).
Springer.
Manning, C. D., Raghavan, P., & Schütze, H. (2009). Introduction to information retrieval.
Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4),
35-43.