0% found this document useful (0 votes)

32 views

Did It Make The News?

Search Engine Architecture An explanation of the Parts of a Search Engine

Uploaded by

Adeel Hashmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Did It Make The News?

Search Engine Architecture An explanation of the Parts of a Search Engine

Uploaded by

Adeel Hashmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Did it Make the News?

Overview
Millions of people visit news websites daily. In some cases, a knowledge worker might need the ability to focus a search on
more specific topics or industries. One very relevant and recent example related to Covid-19 is a dataset named Cord-19. A
virologist or biochemist might be interested in targeted searches in the ~500,000 scientific articles it contains. Similarly,
someone who works in the financial markets might be interested in a search engine that operates on financial new source
data. And that is the type of data you’re going to use in this project.

For this project, you’re going to build a search engine for a large collection of financial news articles from Jan - May 2018.
The dataset contains more than 300,000 articles.

You can download the dataset from Kaggle at https://www.kaggle.com/jeet2016/us-financial-news-articles. You will need to
make a Kaggle account to download it. Note that the download is around 1.3 GB and the uncompressed dataset is around 2.5
GB.

Search Engine Architecture

Search engines are designed to allow users to quickly locate the information they want. Building a custom search engine
requires input of the documents that the user will eventually want to search. This is called the corpus. Then, once indexed,
users can begin entering search queries. The search engine will take a query, find the documents that satisfy the request, and
order them by some manner of relevancy. As you can see, there are two fundamental “roles” here: 1) the search engine
maintainer, and 2) the search engine user.
The four major components1 of a typical search engine are:
1. Document parser/processor,
2. Query processor,
3. Search processor, and
4. Ranking processor.
Figure 1 provides a general overview of a search engine system architecture.
The fundamental “document” for this project is one news article with its associated metadata such as publication venue, date,
author, associated entities and the full text of the article.

The files containing the news articles are in JSON format. JSON is a “lightweight data interchange format”
(https://www.json.org/json-en.html) that is easily understood by both humans and machines. There are a number of open
source JSON parsing libraries available. The “officially supported parser” for this project is RapidJSON.

1 HYPERLINK "http://www.infotoday.com/searcher/may01/liddy.htm"
http://www.infotoday.com/searcher/may01/liddy.htm
Figure 1 – Sample Search Engine System Architecture

An explanation of the Parts of a Search Engine

The index handler, the workhorse of the search engine, is responsible for:
● Read from and write to the main word index. You'll be creating an inverted file index which stores references from
each element to be indexed to the corresponding document(s) in which those elements exist.
● Create and maintain an index of ORGANIZATION entities and an index of PERSON entities.
● Searching the inverted file index based on a request from the query processor.
● Storing other data with each indexed item (such as word frequency or entity frequency).
The document parser/processor is responsible for the following tasks:
● Processing each news article in the corpus. The dataset contains one news article per file. Each document is in
JSON format. Processing of an article involves the following steps:
○ Removing stopwords from the articles. Stopwords are common words that appear in text but that provide
little useful information with respect to the value of a document relative to a query because of the
commonality of the words. Example stop words include “a”, “the”, and “if”. One possible list of stop
words to use for this project can be found at http://www.webconfs.com/stop-words.php. You may use
other stop word lists you find online.
○ Stemming words. Stemming2 refers to removing certain grammatical modifications to words. For instance,
the stemmed version of “running” may be “run”. For this project, you may make use of any previously
implemented stemming algorithm that you can find online.
■ One such algorithm is the Porter Stemming algorithm. More information as well as
implementations can be found at http://tartarus.org/~martin/PorterStemmer/.
■ Another option is http://www.oleandersolutions.com/stemming/stemming.html.
■ C ++ implementation of Porter 2: https://bitbucket.org/smassung/porter2_stemmer/src.
● Computing/maintaining information for relevancy ranking. You’ll have to design and implement some algorithm to
determine how to rank the results that will be returned from the execution of a query. You can make use of metadata
provided, important words in the articles (look up term-frequency/inverse document frequency metric), and/or a
combination of several metrics.
The query processor is responsible for:

2 See https://en.wikipedia.org/wiki/Stemming for more information.

● Parsing queries entered by the user of the search engine. For this project, you'll implement functionality to handle
simple prefix Boolean queries entered by the user.
○ The Boolean expression will be prefixed with a Boolean operator of either AND or OR if there is more
than one word of interest.
○ No query will contain both AND and OR.
○ Single word queries (not counting NOT or additional operators below) do not need a boolean operator.
○ Trailing search terms may be preceded with the NOT operator, which indicates articles containing that term
should be removed from the resultset.
○ Additional Operators: A query can contain zero or more of the following:
■ ORG <some organization name> - the org operator will search a special index you maintain
related to organizations mentioned in the entity metadata
■ PERSON <some person name> - the person operator will search a special index you maintain
related to persons mentioned in the article’s entity metadata.
■ Additional Operator Notes:
● the order of ORG or PERSON doesn’t matter (meaning, you should accept queries that
have them in either order)
● the operators will always be entered in all caps.
● you may assume that neither ORG nor PERSON will be search terms themselves.
● Here are some examples:
○ markets
■ This query should return all articles that contain the word markets.
○ AND social network
■ This query should return all articles that contain the words “social” and “network” (doesn’t have
to be as a 2-word phrase)
○ AND social network PERSON cramer
■ This query should return all articles that contain the words social and network and that mention
cramer as a person entity.
○ AND social network ORG facebook PERSON cramer
■ This query should return all articles that contain the words social and network, that have an entity
organization of facebook and that mention cramer as a person entity.
○ OR snap facebook
■ This query should return all articles that contain either snap OR facebook
○ OR facebook meta NOT profits
■ This query should return all articles that contain facebook or meta but that do not contain the word
profits.
○ bankruptcy NOT facebook
■ This query should return all articles that contain bankruptcy, but not facebook.
○ OR facebook instagram NOT bankruptcy ORG snap PERSON cramer
■ This query should return any article that contains the word facebook OR instagram but that does
NOT contain the word bankruptcy, and the article should have an organization entity with Snap
and a person entity of cramer
● Ranking the Results. Relevancy ranking refers to organizing the results of a query so that “more relevant”
documents are higher in the result set than less relevant documents. The difficulty here is determining what the
concept of “more relevant” means. One way of calculating relevancy is by using a basic term frequency – inverse
document frequency (tf/idf) statistic3. tf/idf is used to determine how important a particular word is to a document
from the corpus. If a word appears frequently in document dt but infrequently in other documents, then document dt
would be ranked higher than another document ds in which a query term appears frequently, but it also appears
frequently in other documents as well. There is quite a bit of other information that you can use to do relevancy
ranking as well such as date of publication of the article, etc.

The Index
The inverted file index4 is a data structure that relates each unique word from the corpus to the document(s) in which it
appears. It allows for efficient execution of a query to quickly determine in which documents a particular query term
appears. For instance, let's assume we have the following documents with ascribed contents:
• doc d1 = Computer network security
• doc d2 = network cryptography
• doc d3 = database security
The inverted file index for these documents would contain, at a very minimum, the following:
• computer = d1
• network = d1, d2
• security = d1, d3
• cryptography = d2
• database = d3
The query “AND computer security” would find the intersection of the documents that contained computer and the
documents that contained security.
• set of documents containing computer = d1
• set of documents containing security = d1, d3
• the intersection of the set of documents containing computer AND security = d1

Inverted File Index Implementation Details

The heart of this project is the inverted file index.
● To index the text of the articles, you will create an inverted index with an AVL tree. Each node of the tree would
represent a word being indexed and would provide information about all the articles that contain said word.
● To index organizations and persons, you will use separate instances of an AVL tree.
In other words, you’ll have 3 AVL Trees: one for the main index of words, one for organizations, and lastly, one for persons.
User Interface
The user interface of the application should provide the following options:
● allows the user to clear the index completely
● allows the user to manage the persistent index (see Index Persistence above for more info)
● allows the user to parse a document dataset to populate the index OR read from the persistence file
● allow the user to enter a Boolean query (as described above).
○ You may assume the query is properly formatted.
○ The results should display the article’s identifying/important information including Article Title,
publication, and date published. If the result set contains more than 15 results, display the 15 with the

3http://en.wikipedia.org/wiki/Tf-idf or http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html for more information

4 See http://en.wikipedia.org/wiki/Inverted_index for more information.
highest relevancy. If less than 15 are returned, display all of them ordered by relevance. If you’d like to
show more, please paginate.
○ The user should be allowed to choose one of the articles from the result set above and have the complete
text of the article printed.
○ Helpful Hint: that the query terms should have stop words removed and stemmed before querying
the index.
● Output basic statistics of the search engine including:
○ Total number of individual articles indexed
○ The total number of unique words indexed (total nodes in the word AVL Tree)
○ Any other interesting stats that you gather in the course of parsing

Mechanics of Implementation
○ This project must be implemented using an object-oriented design methodology.
● You are free to use as much of the C++ standard library as you would like. In fact, I encourage you to make
generous use of it. You may use other libraries as well except for the caveat below.
○ You must implement your own version of an AVL tree. You may, of course, refer to other
implementations for guidance, but you MAY NOT incorporate the total implementation from another
source.
● You should research and use the RapidJSON parser. See https://rapidjson.org/ for more info. The other alternative
is to create your own parser from scratch (which isn’t as bad as it sounds).
● All of your code must be properly documented and formatted
● Explanaton on how the code is used.
● Each class should be separated into interface and implementation (.h and .cpp) files unless templated.
● Each file should have appropriate header comments to include the owner of the class and a history of
updates/modifications to the class.
● You should have a complete AVL Implementation and be well on your way to parsing all the documents.
● Parsing Speed Check: in and Parsing Timing Data Collection
○ Complete project with full user interface
● Need a whole explanation of the code being used.
● The Check In and Speed need to be label.

Additional Features that are added to code

Index Persistence
The index must also be persistent once it is created. This means
● the contents of the index should be written to disk when requested by the user,
● the contents of the persistent index should be read in when requested by the user and it should replace any data that
is currently indexed in memory.
● reading the contents of the persistent index should be much faster than reparsing all the data from scratch.
● The user should have the option of clearing the persistent index and starting over.
● You can have a separate file for words, organization, and persons.

Hash Table Index

You can implement a customized hash table class and use it to index the Orgs and Person entities (one hash table per). For
the Hash Table implementation, you can use a publicly available hash function, std::hash(), or implement one of your own.

Gather Additional Statistics

● When processing a query from the user, use std::chrono to time how long it takes to execute the complete query and
output this at the top of the query results.
● Total number of unique Orgs and Person entities.
● Top 25 most frequent words in descending order (NOT including stopwords)

Windows Server 2019
No ratings yet
Windows Server 2019
39 pages
An Elasticsearch Crash Course Presentation PDF
No ratings yet
An Elasticsearch Crash Course Presentation PDF
81 pages
Text
No ratings yet
Text
5 pages
chapter 2
No ratings yet
chapter 2
45 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
Cross Lingual Information Retrieval and Error Tracking in Search Engine
No ratings yet
Cross Lingual Information Retrieval and Error Tracking in Search Engine
37 pages
7 Query Languages Operations
No ratings yet
7 Query Languages Operations
12 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Made By:-Bhawana Agarwal Cs Iiiyr
No ratings yet
Made By:-Bhawana Agarwal Cs Iiiyr
29 pages
Information Retrieval
No ratings yet
Information Retrieval
142 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
No ratings yet
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
51 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Full Text Indexes in Postgresql
No ratings yet
Full Text Indexes in Postgresql
37 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
emutye
No ratings yet
emutye
20 pages
FULLTEXT01
No ratings yet
FULLTEXT01
32 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Assessment 2
No ratings yet
Assessment 2
3 pages
Unit II
No ratings yet
Unit II
73 pages
Chap 1
No ratings yet
Chap 1
22 pages
AP MAY 23 QP ANS
No ratings yet
AP MAY 23 QP ANS
9 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Effective Search Engine - Final With Modules
No ratings yet
Effective Search Engine - Final With Modules
12 pages
Introduction To Information Retrieval - by William Scott - Medium
No ratings yet
Introduction To Information Retrieval - by William Scott - Medium
4 pages
613
No ratings yet
613
9 pages
2 Mod-1_Lec-2
No ratings yet
2 Mod-1_Lec-2
58 pages
L01
No ratings yet
L01
33 pages
1-Overview of Information Retrieval_new
No ratings yet
1-Overview of Information Retrieval_new
47 pages
Yann Debray - 1714613827618
No ratings yet
Yann Debray - 1714613827618
16 pages
Unit3 QueryLanguages Berlin
No ratings yet
Unit3 QueryLanguages Berlin
29 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Chap 1
No ratings yet
Chap 1
23 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
4
No ratings yet
4
35 pages
Software Requirement Specification Template
0% (1)
Software Requirement Specification Template
7 pages
Improving-Retrieval-Augmented-Generation
No ratings yet
Improving-Retrieval-Augmented-Generation
33 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Information_Retrieval_systems_and_Web_Search_Engin
No ratings yet
Information_Retrieval_systems_and_Web_Search_Engin
4 pages
Sem Ser Ori
No ratings yet
Sem Ser Ori
69 pages
1-Introduction-MIR
No ratings yet
1-Introduction-MIR
35 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
Mini Google
No ratings yet
Mini Google
34 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
CAT King study material 3
No ratings yet
CAT King study material 3
25 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
RITM0115780 Apache PHP Setup
No ratings yet
RITM0115780 Apache PHP Setup
5 pages
iOS Syllabus PDF
No ratings yet
iOS Syllabus PDF
5 pages
7.4.1-Packet-Tracer - Implement-Dhcpv4
No ratings yet
7.4.1-Packet-Tracer - Implement-Dhcpv4
2 pages
Modular December-Vacation
No ratings yet
Modular December-Vacation
14 pages
Hydra Billing User Guide PDF
No ratings yet
Hydra Billing User Guide PDF
457 pages
SQL Server Failover Cluster (Active-Passive) DR Drill
No ratings yet
SQL Server Failover Cluster (Active-Passive) DR Drill
6 pages
DBMS Cheatsheet
No ratings yet
DBMS Cheatsheet
1 page
Mono Group6
No ratings yet
Mono Group6
16 pages
Solved - BTE 1040 - SAP Community
No ratings yet
Solved - BTE 1040 - SAP Community
9 pages
Service Data Point/Extended Database (SDP/EDB) : Objectives
No ratings yet
Service Data Point/Extended Database (SDP/EDB) : Objectives
8 pages
JQ6500 Voice Module Instruction Manual V1 - English - Translated
No ratings yet
JQ6500 Voice Module Instruction Manual V1 - English - Translated
15 pages
Guia-Modem Arcadyan VRV9519BWAC23
No ratings yet
Guia-Modem Arcadyan VRV9519BWAC23
1 page
E-Smart Bin Technology & Waste Segregation System
No ratings yet
E-Smart Bin Technology & Waste Segregation System
4 pages
Improving The Giant Armadillo Optimization Method
No ratings yet
Improving The Giant Armadillo Optimization Method
17 pages
Ppt-Math 10-Quarter 2 Week 2
No ratings yet
Ppt-Math 10-Quarter 2 Week 2
13 pages
Zeiss225bosello TechnicalDescription
100% (1)
Zeiss225bosello TechnicalDescription
42 pages
Malayalam Conjunct
No ratings yet
Malayalam Conjunct
11 pages
Evaluative Commentary Elc231
No ratings yet
Evaluative Commentary Elc231
4 pages
MULTIMEDIA DATA PROCESSING AND STREAMING CGWD
No ratings yet
MULTIMEDIA DATA PROCESSING AND STREAMING CGWD
3 pages
How To Check CPU Voltage
No ratings yet
How To Check CPU Voltage
4 pages
IT202-DS-Unit 5 - Advanced Search Techniques
No ratings yet
IT202-DS-Unit 5 - Advanced Search Techniques
87 pages
First Term Class 9 2079
No ratings yet
First Term Class 9 2079
1 page
Power Xpert Microgrid Controller Guide Spec 26 37 13
No ratings yet
Power Xpert Microgrid Controller Guide Spec 26 37 13
16 pages
Team 10 8.6
No ratings yet
Team 10 8.6
3 pages
Data Preprocessing Solution-24-37
No ratings yet
Data Preprocessing Solution-24-37
14 pages
Random User Generator - Documentation Treca
No ratings yet
Random User Generator - Documentation Treca
1 page
Az 900
No ratings yet
Az 900
7 pages
git
No ratings yet
git
5 pages
Slides Module 3 Basics of Queueing Theory
No ratings yet
Slides Module 3 Basics of Queueing Theory
59 pages

Did It Make The News?

Uploaded by

Did It Make The News?

Uploaded by

Did it Make the News?

Search Engine Architecture

An explanation of the Parts of a Search Engine

2 See https://en.wikipedia.org/wiki/Stemming for more information.

Inverted File Index Implementation Details

3http://en.wikipedia.org/wiki/Tf-idf or http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html for more information

Additional Features that are added to code

Hash Table Index

Gather Additional Statistics

You might also like