Hadoop Exercise To Create An Inverted Index Objectives
Hadoop Exercise To Create An Inverted Index Objectives
Objectives:
We’ll be using a subset of 74 files from a total of 408 files (text extracted from HTML tags) derived from the
Stanford WebBase project that is available here. It was obtained from a web crawl done in February 2007. It
is one of the largest collections totaling more than 100 million web pages from more than 50,000 websites.
This version has been cleaned for the purpose of this assignment.
These files will be placed in a bucket on your Google cloud storage and the Hadoop job will be instructed to
read the input from this bucket.
https://drive.google.com/drive/u/1/folders/1Z4KyalIuddPGVkIm6dUjkpD_FiXyNIcq
You may use your USC account to get access to the data from the Google Drive link. Compressed full data
is around 1.1GB. Uncompressed, it is 3.12 GB of data for the files for this project.
b. Unzip the contents. You will find two folders inside named ‘development’ and ‘full data’. Each
of the folders contains the actual data (txt files). We suggest you use the development data
initially while you are testing your code. Using the full data will take up to few minutes for each
run of the Map-Reduce job and you may risk spending all your cloud credits while testing the
code.
c. Click on ‘Dataproc’ in the left navigation menu under . Next, locate the address of the default
Google cloud storage staging bucket for your cluster in the Figure 1 below. If you’ve previously
disabled billing, you need to re-enable it before you can upload the data. Refer to the “Enable
and Disable Billing account” section to see how to do this.
d.
1
e. Go to the storage section in the left navigation bar and select your cluster’s default bucket from
the list of buckets. At the top you should see menu items UPLOAD FILES, UPLOAD FOLDER,
CREATE FOLDER, etc (Figure 2). Click on the UPLOAD FOLDER button and upload the
dev_data folder and full_data folder individually. This will take a while, but there will be a
progress bar (Figure 3). You may not see this progress bar as soon as you start the upload but,
it will show up eventually.
Refer to the examples below and write a Map-Reduce job in java that creates an Inverted Index given a
collection of text files. You can very easily tweak a word-count example to create an inverted index instead
(Hint: Change the mapper to output word docID instead of word count and in the reducer use a HashMap).
2
Here are some helpful examples of Map-Reduce Jobs
1. https://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
2. https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
The example in the following pages explains a Hadoop word count implementation in detail. It takes one text file
as input and returns the word count for every word in the file. Refer to the comments in the code for explanation.
3
The Reducer Class:
Main Class
We’ve already cleaned up the input data so you don’t have to worry about any stray characters. Each
input file consists of text that has been cleared of ‘\n\r’, ‘\n’ and all but one ‘\t’. The only ‘\t’ separates
the key(Document ID) from the value(Document). The input files are in a key value format as below:
4
DocumentID document
Sample document:
The above example indicates that the word aspect occurred 1 time in the document with docID 5722018411
and economics 2 times.
The reducer takes this as input, aggregates the word counts using a Hashmap and creates the Inverted index.
The format of the index is as follows.
The above sample shows a portion of the inverted index created by the reducer.
To write the Hadoop java code you can use the VI or nano editors that come pre-installed on the master
node. You can test your code on the cluster itself. Be sure to use the development data while testing the
code. You are expected to write a simple Hadoop job. You can just tweak this example if you’d like, but make
sure you understand it first.
1. Say your Java Job file is called InvertedIndex.java. Create a JAR as follows:
○ hadoop com.sun.tools.javac.Main InvertedIndexJob.java
If you get the following Notes you can ignore them
Note: InvertedIndexJob.java uses or overrides a deprecated API.
5
Note: Recompile with -Xlint:deprecation for details.
○ jar cf invertedindex.jar InvertedIndex*.class
Now you have a jar file for your job. You need to place this jar file in the default cloud bucket of your
cluster. Just create a folder called JAR on your bucket and upload it to that folder. If you created your jar file
on the cluster’s master node itself use the following commands to copy it to the JAR folder.
○ hadoop fs -copyFromLocal ./invertedindex.jar
○ hadoop fs -cp ./invertedindex.jar gs://dataproc-69070.../JAR
The highlighted part is the default bucket of your cluster. It needs to be prepended by the gs:// to tell the
Hadoop environment that it is a bucket and not a regular location on the filesystem.
Note: This is not the only way to package your code into a jar file. You can follow any method that will create
a single jar file that can be uploaded to the Google cloud.
If you’d like to submit the job via the command line follow the instructions here
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
Follow the instructions below to submit a job to the cluster via the console’s UI.
1. Go to the “Jobs” section in the left navigation bar of the Dataproc page and click on “Submit job”.
6
○ Jar File: Full path to the jar file you uploaded earlier to the Google storage bucket. Don’t forget
the gs://
○ Main Class or jar: The name of the java class you wrote the mapper and reducer in.
○ Arguments: This takes two arguments
i. Input: Path to the input data you uploaded
ii. Output: Path to the storage bucket followed by a new folder name. The folder is created
during execution. You will get an error if you give the name of an existing folder.
○ Leave the rest at their default settings
○
Figure 5: Job submission details
3. Submit Job. It will take quite a while. Please be patient. You can see the progress on the job’s status
7
section.
NOTE: If you encounter a Java.lang.Interrupted exception you can safely ignore it.
Your submission will still execute.
4. Once the job executes copy all the log entries that were generated to a text file called log.txt. You
need to submit this log along with the java code. You need to do this only for the job you run on the
full data. No need to submit the logs for the dev_data.
5. The output files will be stored in the output folder on the bucket. If you open this folder you’ll notice
that the inverted index is in several segments.(Delete the _SUCCESS file in the folder before merging
all the output files)
To merge the output files, run the following command in the master nodes command line(SSH)
○ hadoop fs -getmerge gs://dataproc-69070458-bbe2-.../output
./output.txt
○ hadoop fs -copyFromLocal ./output.txt
○ hadoop fs -cp ./output.txt gs://dataproc-69070458-bbe2-.../output.txt
8
The output.txt file in the bucket contains the full Inverted Index for all the files.
Use grep to search for the words mentioned in the submissions section. Using grep is the fastest way to get
the entries associated with the words.
Note:-> Whitespace following the word (Eg;- 'little') is actually a tab rather than space
Submission Instructions:
1. Include all the code that you have written(java) and the log file created for the full data job submission.
2. Also include the inverted index file for the document “5722018442.txt”
3. Create a text file named index.txt and include the index entries for the following words
a. incidence
b. documentation
c. tendency
d. standard
e. university
f. interpret
g. house
h. friend
Add the full line from the index including the word itself
4. Also submit a screenshot of the output folder for the full data run in GCP.
5. Also submit Log file generated from running the job on the full data.
6. Do NOT submit your full index.
7. Compress your code and the text file into a single zip archive and name it index.zip. Use a
standard zip format and not zipx, rar, ace, etc.
8. To submit your file electronically to the csci572 account enter the following command from your UNIX
prompt:
$ submit -user csci572 -tag hw3 index.zip
9
FAQ:
Q) Is it fine to submit only one .Java file, which has the all the (Mapper and Reducer Classes) inside it ?
A) One .java file containing your entire program should be good enough.
Q) Approximately how long does it take for a submitted job to finish in GCloud Dataproc?
A) It takes approximately 10 minutes
Q) Different index order. Should we take the same index order (sorted) or can it be different (unsorted)?
A) Order does not matter. The accuracy of results is important.
Q) Code runs fine on development but strange file size with full data.
A) Check if the results produced by running on dev_data produces huge file sizes as well. If so, that means
you have to check your code. If not, check if your full_data is uploaded correctly.
Q) I'm getting this error repeatedly, but I've already created the output directory and have set the argument
path to that directory. Can someone help me with it?
A) You need to delete the output folder because the driver will attempt to create the output folder based
on the argument provided.
Q) Am able to run the dev_data and it is generating results. But if I ran the same code on the full data I am
getting an error. The job is running for till map 25% and then it throws an error?
A) Please check that you have all the files uploaded just fine, and you should have 74 files in full_data.
Q) Did anyone run into a situation where if you go under Dataproc > Clusters > (name of cluster instance)
> VM instances > SSH, the only available option is to use another SSH client?
A) You probably didn't start the VM instances. Every time you disable billing and enable billing, you need
to start VMs manually.
Important Points:
12