0% found this document useful (0 votes)
17 views50 pages

Word Count (2021)

This document provides a comprehensive guide on implementing a Word Count application using Hadoop 3.2.1 and the MapReduce technique. It includes step-by-step instructions for setting up the environment, compiling Java code, creating JAR files, and running the Word Count process on sample text files. Additionally, it covers advanced features such as handling punctuation and utilizing a pattern file for improved word counting.

Uploaded by

mmpham1501
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views50 pages

Word Count (2021)

This document provides a comprehensive guide on implementing a Word Count application using Hadoop 3.2.1 and the MapReduce technique. It includes step-by-step instructions for setting up the environment, compiling Java code, creating JAR files, and running the Word Count process on sample text files. Additionally, it covers advanced features such as handling punctuation and utilizing a pattern file for improved word counting.

Uploaded by

mmpham1501
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

IM5211701 – Big Data Analytics

and Applications
Word Count Example on
Hadoop 3.2.1

Instructor: Chao-Lung Yang, Ph.D.


Department of Industrial
Management
National Taiwan University of
Science and Technology

C.-L. Yang, Big Data, NTUST IM 1


Introduction
 This exercise will “Count” words in multiple
documents
 WordCount is a simple application that
counts the number of occurrences of each
word in a given input set.
 This work should use Hadoop 3.2.1 and
MapReduce technique
 The Java code can be downloaded on
moodle system

C.-L. Yang, Big Data, NTUST IM 2


Inputs and Outputs
 The MapReduce framework operates
exclusively on <key, value> pairs, that is, the
framework views the input to the job as a
set of <key, value> pairs and produces a set
of <key, value> pairs as the output of the job,
conceivably of different types.
 Input and Output types of a MapReduce job:
 (input) <k1, v1> -> map -> <k2, v2> -> combine
-> <k2, v2> -> reduce -> <k3, v3> (output)

C.-L. Yang, Big Data, NTUST IM 3


Create your own WordCount.java
 Create a folder in your HOME
 mkdir /home/hadoop/java
 cd /home/hadoop/java
 Download WordCount.java and
WordCount2.java from moodle system
under Wordcount Example folder to your
/home/hadoop/java

C.-L. Yang, Big Data, NTUST IM 4


Compile JAVA and Create JAR

C.-L. Yang, Big Data, NTUST IM 5


Compile Your First WordCount Java
File
 In your /home/hadoop/java
 Create a new folder called wordcount_classes
 $ mkdir
/home/hadoop/java/wordcount_classes
 Go to your java folder
 $ cd /home/hadoop/java
 Compile your java file to create class files
 $ hadoop com.sun.tools.javac.Main -d
wordcount_classes WordCount.java

C.-L. Yang, Big Data, NTUST IM 6


Compile Result

C.-L. Yang, Big Data, NTUST IM 7


Create Your jar file
 JAR (Java ARchive) is an archive file format
typically used to aggregate many Java class files
and associated metadata and resources
 An executable Java program can be packaged in a
JAR file
 $ cd /home/hadoop/java
 $ jar cvf WordCount.jar -C wordcount_classes/ .
 (there is a space between / and .)

java jar file

C.-L. Yang, Big Data, NTUST IM 8


First WordCount Example

C.-L. Yang, Big Data, NTUST IM 9


Create Some Text for WordCount
 Assuming that
 Input directory in HDFS
 /user/hadoop/wordcount/input
 output directory in HDFS
 /user/hadoop/wordcount/output
 Create folders on HDFS
 hadoop fs -mkdir /user/hadoop/wordcount
 hadoop fs -mkdir /user/hadoop/wordcount/input

C.-L. Yang, Big Data, NTUST IM 10


Create Two Text Files on Ubuntu
File System
 file01 contains text below
 Hello World Bye World
 file02 contains text below
 Hello Hadoop Goodbye Hadoop
 You can create by nano or download from
moodle system

C.-L. Yang, Big Data, NTUST IM 11


Upload Text Files to HDFS
 You can upload file01 and file02 to HDFS by
hadoop command
 hadoop fs -copyFromLocal file01 file02
/user/hadoop/wordcount/input

C.-L. Yang, Big Data, NTUST IM 12


Check Text
 You can check if the contents of file01 and
file02 on HDFS by hadoop command
 hadoop fs -cat
/user/hadoop/wordcount/input/file01
 hadoop fs -cat
/user/hadoop/wordcount/input/file02

C.-L. Yang, Big Data, NTUST IM 13


Run Word Count by Your
WordCount.jar file
 Make sure wordcount/output does not exist
on HDFS (it will be created everytime when
we run WordCount)
 hadoop fs -rm -r
/user/hadoop/wordcount/output
 Execute WordCount.jar (located at your
/home/hadoop/java)
 hadoop jar WordCount.jar WordCount
/user/hadoop/wordcount/input
/user/hadoop/wordcount/output
C.-L. Yang, Big Data, NTUST IM 14
C.-L. Yang, Big Data, NTUST IM 15
C.-L. Yang, Big Data, NTUST IM 16
Check Results
 Print out the result of “word count”
 hadoop fs -cat
/user/hadoop/wordcount/output/part-r-
00000

C.-L. Yang, Big Data, NTUST IM 17


Explanation of WordCount Code

C.-L. Yang, Big Data, NTUST IM 18


Main
 Main function contains the running
sequences

C.-L. Yang, Big Data, NTUST IM 19


Mapper
 The Mapper implementation, via the map method,
processes one line at a time, as provided by the
specified TextInputFormat.
 It then splits the line into tokens separated by
whitespaces, via the StringTokenizer, and emits a
key-value pair of < <word>, 1>.

C.-L. Yang, Big Data, NTUST IM 20


Results of Mapper
 < Hello, 1>  < Hello, 1>
 < World, 1>  < Hadoop, 1>
 < Bye, 1>  < Goodbye, 1>
 < World, 1>  < Hadoop, 1>

C.-L. Yang, Big Data, NTUST IM 21


Combiner
 WordCount also specifies a combiner.
 Hence, the output of each map is passed
through the local combiner (which is same
as the Reducer as per the job configuration)
for local aggregation, after being sorted on
the keys.

C.-L. Yang, Big Data, NTUST IM 22


Result of Combiner
 < Bye, 1>  < Goodbye, 1>
 < Hello, 1>  < Hadoop, 2>
 < World, 2>  < Hello, 1>

C.-L. Yang, Big Data, NTUST IM 23


Reducer
 The Reducer implementation, via the reduce
method just sums up the values, which are
the occurrence counts for each key (i.e.
words in this example).

C.-L. Yang, Big Data, NTUST IM 24


Result of Reducer
 < Bye, 1>
 < Goodbye, 1>
 < Hadoop, 2>
 < Hello, 2>
 < World, 2>

C.-L. Yang, Big Data, NTUST IM 25


Second WordCount Example

C.-L. Yang, Big Data, NTUST IM 26


WordCount2.java
 You can download WordCount2.java from
moodle system
 This java file which uses many of the
features provided by the MapReduce
framework has more functions to improve
the word count process

C.-L. Yang, Big Data, NTUST IM 27


Compile Your Second WordCount
Java File
 Clean up your wordcount_classes folder
 $ rm wordcount_classes/*.*

 Compile your java file to create class files


 $ hadoop com.sun.tools.javac.Main -d
wordcount_classes WordCount2.java

C.-L. Yang, Big Data, NTUST IM 28


Create Your WordCount2.jar
 An executable Java program can be packaged in a
JAR file
 $ cd /home/hadoop/java
 $ jar cvf WordCount2.jar -C wordcount_classes/ .
 (there is a space between / and .)

C.-L. Yang, Big Data, NTUST IM 29


Create More Complicated Text
 Create text which have some punctuation
marks
 newfile01 contains text below
 Hello World, Bye World!
 newfile02 contains text below
 Hello Hadoop, Goodbye to hadoop.
 You can create by nano or download from
moodle system

C.-L. Yang, Big Data, NTUST IM 30


Upload New Text Files on HDFS
 Clean the old files first
 hadoop fs -rm
/user/hadoop/wordcount/input/*
 You can upload file01 and file02 to HDFS by
hadoop command
 hadoop fs -copyFromLocal newfile01 newfile02
/user/hadoop/wordcount/input

C.-L. Yang, Big Data, NTUST IM 31


Check Text
 You can check if the contents of newfile01
and newfile02 on HDFS by hadoop
command
 hadoop fs -cat
/user/hadoop/wordcount/input/newfile01
 hadoop fs -cat
/user/hadoop/wordcount/input/newfile01

C.-L. Yang, Big Data, NTUST IM 32


Rerun Word Count by Your
WordCount2.jar file
 Make sure wordcount/output does not exist
on HDFS (it will be created everytime when
we run WordCount)
 hadoop fs -rm -r
/user/hadoop/wordcount/output
 Execute WordCount.jar (located at your
/home/hadoop/java)
 hadoop jar WordCount2.jar WordCount2
/user/hadoop/wordcount/input
/user/hadoop/wordcount/output
C.-L. Yang, Big Data, NTUST IM 33
Check Results of WordCount2
 Print out the result of “word count”
 hadoop fs -cat
/user/hadoop/wordcount/output/part-r-
00000

C.-L. Yang, Big Data, NTUST IM 34


How to Remove Punctuation
 By WordCount2.java, we can plug-in a
“pattern-file” which lists the word-patterns
to be, via the DistributedCache.
 Create a local patterns.txt (or download
from moodle)
 nano patterns.txt

C.-L. Yang, Big Data, NTUST IM 35


Upload patterns.txt to HDFS
 You can upload patterns.txt to HDFS by
hadoop command
 hadoop fs -copyFromLocal patterns.txt
/user/hadoop/wordcount

C.-L. Yang, Big Data, NTUST IM 36


Rerun WordCount2.jar with -skip
 Make sure wordcount/output does not exist on
HDFS (it will be created everytime when we
run WordCount)
 hadoop fs -rm -r /user/hadoop/wordcount/output
 Execute WordCount.jar (located at your
/home/hadoop/java)
 hadoop jar WordCount2.jar WordCount2 -
Dwordcount.case.sensitive=true
/user/hadoop/wordcount/input
/user/hadoop/wordcount/output -skip
/user/hadoop/wordcount/patterns.txt

C.-L. Yang, Big Data, NTUST IM 37


Check Results
 Print out the result of “word count”
 hadoop fs -cat
/user/hadoop/wordcount/output/part-r-
00000

No punctuation is counted!
C.-L. Yang, Big Data, NTUST IM 38
More DATA

C.-L. Yang, Big Data, NTUST IM 39


Download Some Text File for
Testing
 There are a lot of free ebook on internet
 mkdir /home/hadoop/gutenberg
 cd /home/hadoop/gutenberg
 Gutenberg.org (http://www.gutenberg.org/)
 wget http://www.gutenberg.org/files/57100/57100-
0.txt
 wget http://www.gutenberg.org/files/1342/1342-0.txt
 wget http://www.gutenberg.org/files/74/74-0.txt
 wget http://www.gutenberg.org/files/98/98-0.txt
 wget http://www.gutenberg.org/files/219/219-0.txt
 …

C.-L. Yang, Big Data, NTUST IM 40


Copy Text to HDFS
 Create a folder in HDFS called ‘gutenberg’
 Copy all txt files from
/home/hadoop/gutenberg to HDFS
/user/hadoop/gutenberg
 It will copy whole folder to HDFS
 hadoop fs -mkdir /user/hadoop
 hadoop fs -mkdir /user/hadoop/gutenberg
 hadoop fs -copyFromLocal
/home/hadoop/gutenberg/*.*
/user/hadoop/gutenberg
Ps. hadoop is your account name. It might be different for everyone.
C.-L. Yang, Big Data, NTUST IM 41
Run WordCount2.jar on gutenberg
 Make sure wordcount-output does not exist
on HDFS (it will be created everytime when
we run WordCount)
 hadoop fs -rm -r wordcount-output
 Execute WordCount.jar (located at your
/home/hadoop/java)
 hadoop jar WordCount.jar WordCount
/user/hadoop/gutenberg
/user/hadoop/wordcount-output
C.-L. Yang, Big Data, NTUST IM 42
C.-L. Yang, Big Data, NTUST IM 43
C.-L. Yang, Big Data, NTUST IM 44
C.-L. Yang, Big Data, NTUST IM 45
Check the Results
 You can check mapreduce wordcount result
on HDFS
 hadoop fs -cat /user/hadoop/wordcount-
output/part-r-00000
 You can copy to local folder
 hadoop fs -getmerge /user/hadoop/wordcount-
output /home/hadoop/java/wordcount-
output.txt

C.-L. Yang, Big Data, NTUST IM 46


Update Patterns-more.txt
 You can create update patterns-more.txt to
skip more punctuation
 Once finish, upload patterns-more.txt to
HDFS
 hadoop fs -copyFromLocal patterns-more.txt
/user/hadoop/wordcount

C.-L. Yang, Big Data, NTUST IM 47


Rerun WordCount2.jar on
gutenberg with -skip
 Make sure wordcount-output does not exist on
HDFS (it will be created everytime when we
run WordCount)
 hadoop fs -rm -r wordcount-output
 Execute WordCount.jar (located at your
/home/hadoop/java)
 hadoop jar WordCount2.jar WordCount2 -
Dwordcount.case.sensitive=true
/user/hadoop/gutenberg
/user/hadoop/wordcount-output -skip
/user/hadoop/wordcount/patterns-more.txt

C.-L. Yang, Big Data, NTUST IM 48


Check the Results
 You can check mapreduce wordcount result
on HDFS
 hadoop fs -cat /user/hadoop/wordcount-
output/part-r-00000
 You can copy to local folder
 hadoop fs -getmerge /user/hadoop/wordcount-
output /home/hadoop/java/wordcount-
output-skip.txt

C.-L. Yang, Big Data, NTUST IM 49


References
 https://hadoop.apache.org/docs/r3.0.3/ha
doop-mapreduce-client/hadoop-
mapreduce-client-
core/MapReduceTutorial.html

C.-L. Yang, Big Data, NTUST IM 50

You might also like