Did It Make The News?
Did It Make The News?
Overview
Millions of people visit news websites daily. In some cases, a knowledge worker might need the ability to focus a search on
more specific topics or industries. One very relevant and recent example related to Covid-19 is a dataset named Cord-19. A
virologist or biochemist might be interested in targeted searches in the ~500,000 scientific articles it contains. Similarly,
someone who works in the financial markets might be interested in a search engine that operates on financial new source
data. And that is the type of data you’re going to use in this project.
For this project, you’re going to build a search engine for a large collection of financial news articles from Jan - May 2018.
The dataset contains more than 300,000 articles.
You can download the dataset from Kaggle at https://www.kaggle.com/jeet2016/us-financial-news-articles. You will need to
make a Kaggle account to download it. Note that the download is around 1.3 GB and the uncompressed dataset is around 2.5
GB.
The files containing the news articles are in JSON format. JSON is a “lightweight data interchange format”
(https://www.json.org/json-en.html) that is easily understood by both humans and machines. There are a number of open
source JSON parsing libraries available. The “officially supported parser” for this project is RapidJSON.
1 HYPERLINK "http://www.infotoday.com/searcher/may01/liddy.htm"
http://www.infotoday.com/searcher/may01/liddy.htm
Figure 1 – Sample Search Engine System Architecture
The index handler, the workhorse of the search engine, is responsible for:
● Read from and write to the main word index. You'll be creating an inverted file index which stores references from
each element to be indexed to the corresponding document(s) in which those elements exist.
● Create and maintain an index of ORGANIZATION entities and an index of PERSON entities.
● Searching the inverted file index based on a request from the query processor.
● Storing other data with each indexed item (such as word frequency or entity frequency).
The document parser/processor is responsible for the following tasks:
● Processing each news article in the corpus. The dataset contains one news article per file. Each document is in
JSON format. Processing of an article involves the following steps:
○ Removing stopwords from the articles. Stopwords are common words that appear in text but that provide
little useful information with respect to the value of a document relative to a query because of the
commonality of the words. Example stop words include “a”, “the”, and “if”. One possible list of stop
words to use for this project can be found at http://www.webconfs.com/stop-words.php. You may use
other stop word lists you find online.
○ Stemming words. Stemming2 refers to removing certain grammatical modifications to words. For instance,
the stemmed version of “running” may be “run”. For this project, you may make use of any previously
implemented stemming algorithm that you can find online.
■ One such algorithm is the Porter Stemming algorithm. More information as well as
implementations can be found at http://tartarus.org/~martin/PorterStemmer/.
■ Another option is http://www.oleandersolutions.com/stemming/stemming.html.
■ C ++ implementation of Porter 2: https://bitbucket.org/smassung/porter2_stemmer/src.
● Computing/maintaining information for relevancy ranking. You’ll have to design and implement some algorithm to
determine how to rank the results that will be returned from the execution of a query. You can make use of metadata
provided, important words in the articles (look up term-frequency/inverse document frequency metric), and/or a
combination of several metrics.
The query processor is responsible for:
The Index
The inverted file index4 is a data structure that relates each unique word from the corpus to the document(s) in which it
appears. It allows for efficient execution of a query to quickly determine in which documents a particular query term
appears. For instance, let's assume we have the following documents with ascribed contents:
• doc d1 = Computer network security
• doc d2 = network cryptography
• doc d3 = database security
The inverted file index for these documents would contain, at a very minimum, the following:
• computer = d1
• network = d1, d2
• security = d1, d3
• cryptography = d2
• database = d3
The query “AND computer security” would find the intersection of the documents that contained computer and the
documents that contained security.
• set of documents containing computer = d1
• set of documents containing security = d1, d3
• the intersection of the set of documents containing computer AND security = d1
Mechanics of Implementation
○ This project must be implemented using an object-oriented design methodology.
● You are free to use as much of the C++ standard library as you would like. In fact, I encourage you to make
generous use of it. You may use other libraries as well except for the caveat below.
○ You must implement your own version of an AVL tree. You may, of course, refer to other
implementations for guidance, but you MAY NOT incorporate the total implementation from another
source.
● You should research and use the RapidJSON parser. See https://rapidjson.org/ for more info. The other alternative
is to create your own parser from scratch (which isn’t as bad as it sounds).
● All of your code must be properly documented and formatted
● Explanaton on how the code is used.
● Each class should be separated into interface and implementation (.h and .cpp) files unless templated.
● Each file should have appropriate header comments to include the owner of the class and a history of
updates/modifications to the class.
● You should have a complete AVL Implementation and be well on your way to parsing all the documents.
● Parsing Speed Check: in and Parsing Timing Data Collection
○ Complete project with full user interface
● Need a whole explanation of the code being used.
● The Check In and Speed need to be label.
Index Persistence
The index must also be persistent once it is created. This means
● the contents of the index should be written to disk when requested by the user,
● the contents of the persistent index should be read in when requested by the user and it should replace any data that
is currently indexed in memory.
● reading the contents of the persistent index should be much faster than reparsing all the data from scratch.
● The user should have the option of clearing the persistent index and starting over.
● You can have a separate file for words, organization, and persons.