0% found this document useful (0 votes)

13 views6 pages

TP3_hadoop python_Wordcount (1)

This document is a tutorial on performing a distributed word count using Hadoop and Python. It provides step-by-step instructions for creating a mapper and reducer script, setting up data, and executing the Hadoop job. The tutorial also discusses the effects of changing the number of reducers on the output results.

Uploaded by

hakimazagriri41

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views6 pages

TP3_hadoop python_Wordcount (1)

Uploaded by

hakimazagriri41

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Pr.

ROCHD
2022-2023

TP 3

Big Data Tutorial 31: distributed wordcount

Hugues Talbot
November 18, 2019

1 Big data tutorial 133

1.1 Wordcount on Hadoop using Python

Lesson 1, Introduction to Map/Reduce Module, Running Wordcount with streaming, using

Python code
1. Open a Terminal (Right-click on Desktop or click Terminal icon in the top toolbar)
2. Review the following to create the python code Section 1: wordcount_mapper.py

1.1.1 section 1: mapper

[1]: #!/usr/bin/env python

#the above just indicates to use python to intepret this file

# ---------------------------------------------------------------
#This mapper code will input a line of text and output <word, 1>
#
# ---------------------------------------------------------------

import sys #a python module with system functions for this OS

# ------------------------------------------------------------
# this 'for loop' will set 'line' to an input line from system
# standard input file
# ------------------------------------------------------------
for line in sys.stdin:

#-----------------------------------
#sys.stdin call 'sys' to read a line from standard input,
# note that 'line' is a string object, ie variable, and it has methods that you␣
,→can apply to it,

# as in the next line

# ---------------------------------
line = line.strip() #strip is a method, ie function, associated
# with string variable, it will strip

1
# the carriage return (by default)
keys = line.split() #split line at blanks (by default),
# and return a list of keys
for key in keys: #a for loop through the list of keys
value = 1
print('{0}\t{1}'.format(key, value) ) #the {} is replaced by 0th,1st␣
,→items in format list

#also, note that the Hadoop default is 'tab'␣

,→separates key from the value

1.1.2 Section 2: wordcount_reducer.py

The reducer code has some basic parts, see the comments in the code. The Lesson 2 assignment
will have a similar basic structure.
[3]: #!/usr/bin/env python

# ---------------------------------------------------------------
#This reducer code will input a line of text and
# output <word, total-count>
# ---------------------------------------------------------------
import sys

last_key = None #initialize these variables

this_key = None
running_total = 0

# -----------------------------------
# Loop thru file
# --------------------------------
for input_line in sys.stdin:
input_line = input_line.strip()

# --------------------------------
# Get Next Word # --------------------------------
this_key, value = input_line.split("\t", 1) #the Hadoop default is tab␣
,→separates key value

#the split command returns a list of strings, in this␣

,→case into 2 variables

value = int(value) #int() will convert a string to integer (this␣

,→program does no error checking)

# ---------------------------------
# Key Check part
# if this current key is same
# as the last one Consolidate

2
# otherwise Emit
# ---------------------------------
if last_key == this_key: #check if key has changed ('==' is ␣
,→ # logical equalilty check
running_total += value # add value to running total

else:
if last_key: #if this key that was just read in
# is different, and the previous
# (ie last) key is not empy,
# then output
# the previous <key running-count>
print( "{0}\t{1}".format(last_key, running_total) )
# hadoop expects tab(ie '\t')
# separation
running_total = value #reset values
last_key = this_key

if last_key == this_key:
print( "{0}\t{1}".format(last_key, running_total))

None 0

[ ]: NOTE: If you have not programmed with Python please read the following:

Python notes:
# 1 indentations are required to indicate blocks of code,
# 2 all code to be executed as part of some flow control
# (e.g. if or for statements) must have the same indentation
# (to be safe use 4 space per indentation level, and don't
# mix with tabs)
# 3 flow control conditions have a ':' before
# the corresponding block of code
#

You can cut and paste the above into a text file as follows from the terminal prompt in Cloudera
VM.
Type in the following to open a text editor, and then cut and paste the above lines for word-
count_mapper.py into the text editor, save, and exit. Repeat for wordcount_reducer.py
gedit wordcount_mapper.py
gedit wordcount_reducer.py
Enter the following to see that the indentations line up as above
more wordcount_mapper.py

3
more wordcount_reducer.py
Enter the following to make it executable
chmod +x wordcount_mapper.py
chmod +x wordcount_reducer.py
Enter the following to see what directory you are in
pwd
It should be /user/cloudera , or something like that.

1.1.3 Section 3. Create some data:

echo “A long time ago in a galaxy far far away” > /home/cloudera/testfile1
echo “Another episode of Star Wars” > /home/cloudera/testfile2

1.1.4 Section 4. Create a directory on the HDFS file system (if already exists that’s OK):

hdfs dfs -mkdir /user/cloudera/input

1.1.5 Section 5. Copy the files from local filesystem to the HDFS filesystem:

hdfs dfs -put /home/cloudera/testfile1 /user/cloudera/input

hdfs dfs -put /home/cloudera/testfile2 /user/cloudera/input

1.1.6 Section 6. You can see your files on HDFS

hdfs dfs -ls /user/cloudera/input

1.1.7 Section 7. Run the Hadoop WordCount example with the input and output specified.

Note that your file paths may differ. The ‘’ just means the command continues on next line.

[ ]: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-input /user/cloudera/input \
-output /user/cloudera/output_new \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py

Hadoop prints out a whole lot of logging or error information. If it runs you will see something
like the following on the screen scroll by:

4
[ ]: ....

INFO mapreduce.Job: map 0% reduce 0%

INFO mapreduce.Job: map 67% reduce 0%

INFO mapreduce.Job: map 100% reduce 0%

INFO mapreduce.Job: map 100% reduce 100%

INFO mapreduce.Job: Job job_1442937183788_0003 completed successfully

...

1.1.8 Section 8. Check the output file to see the results:

hdfs dfs -cat /user/cloudera/output_new/part-00000

1.1.9 Section 9. View the output directory:

hdfs dfs -ls /user/cloudera/output_new

Look at the files there and check out the contents, e.g.:
hdfs dfs -cat /user/cloudera/output_new/part-00000

1.1.10 Section 10. Streaming options:

Try: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar –help

or see hadoop.apache.org/docs/r1.2.1/
Let’s change the number of reduce tasks to see its effects. Setting it to 0 will execute no reducer
and only produce the map output. (Note the output directory is changed in the snippet below
because Hadoop doesn’t like to overwrite output)

[ ]: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-input /user/cloudera/input \
-output /user/cloudera/output_new_0 \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py \
-numReduceTasks 0

Get the output file from this run, and then upload it:
hdfs dfs -getmerge /user/cloudera/output_new_0/* wordcount_num0_output.txt

5
Try to notice the differences between the output when the reducers are run in Step 9, versus the
output when there are no reducers and only the mapper is run in this step. The point of the task
is to be aware of what the intermediate results look like. A successful result will have words and
counts that are not accumulated (which the reducer performs). Hopefully, this will help you get
a sense of how data and tasks are split upin the map/reduce framework, and we will build upon
that in the next lesson.

1.1.11 11. Change the number of reducers to 2

When you use 2 reducers instead of 1 reducer, what is the difference in global sort order?
• With 1 reducer, but not 2 reducers, the word counts are in global sort order by word.
• With 2 reducers, but not 1 reducer, the word counts are in global sort order by word.
• With 1 reducer or 2 reducers, the word counts are in global sort order by word.
• With 1 reducer or 2 reducers, the word counts are NOT in global sort order by word.

[ ]:

Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
No ratings yet
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
5 pages
lsde_workshop_wk9(2)
No ratings yet
lsde_workshop_wk9(2)
31 pages
Commands in Hadoop
No ratings yet
Commands in Hadoop
7 pages
Word Count using MapReduce on Hadoop
No ratings yet
Word Count using MapReduce on Hadoop
14 pages
Lab11 B
No ratings yet
Lab11 B
9 pages
Hadoop Mapreduce Python Script
No ratings yet
Hadoop Mapreduce Python Script
3 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Bash Command Line Pro Tips
From Everand
Bash Command Line Pro Tips
Jason Cannon
4.5/5 (8)
Palak
No ratings yet
Palak
10 pages
Unit 3 MapReduce Part 2
No ratings yet
Unit 3 MapReduce Part 2
12 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
MapReduce(Streaming) TP Report
No ratings yet
MapReduce(Streaming) TP Report
16 pages
hai hadoop
No ratings yet
hai hadoop
14 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
Experiment-4 BDA LAB
No ratings yet
Experiment-4 BDA LAB
7 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
Word Count
No ratings yet
Word Count
10 pages
CS702_Big_Data_Programs
No ratings yet
CS702_Big_Data_Programs
58 pages
BDF Programs
No ratings yet
BDF Programs
32 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
big datalab
No ratings yet
big datalab
4 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Activity 2
No ratings yet
Activity 2
31 pages
BDA
No ratings yet
BDA
6 pages
BDA
No ratings yet
BDA
88 pages
Hadoop Exercise Mapreduce
No ratings yet
Hadoop Exercise Mapreduce
5 pages
03_Run the WordCount program instructions.docx
No ratings yet
03_Run the WordCount program instructions.docx
4 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
Profound Linux For Developers
From Everand
Profound Linux For Developers
Onder Teker
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Practical 2c
No ratings yet
Practical 2c
2 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
Big Data File
No ratings yet
Big Data File
16 pages
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Lecture - 3
No ratings yet
Lecture - 3
25 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
Setup Hadoop Gettingstart
No ratings yet
Setup Hadoop Gettingstart
4 pages
Hadoop Mini Project
No ratings yet
Hadoop Mini Project
8 pages
Word_Count(2021)
No ratings yet
Word_Count(2021)
50 pages
L4A Running Hadoop with MR
No ratings yet
L4A Running Hadoop with MR
5 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
B1 instructions
No ratings yet
B1 instructions
9 pages
TPhadoop
No ratings yet
TPhadoop
27 pages
Big Data Cloudera TP
No ratings yet
Big Data Cloudera TP
33 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Windows Command Prompt
From Everand
Windows Command Prompt
Murat Yildirimoglu
No ratings yet
Map Reduce
No ratings yet
Map Reduce
57 pages
Ravikant_Hadoop_file
No ratings yet
Ravikant_Hadoop_file
22 pages
Hadoop Map-Reduce
No ratings yet
Hadoop Map-Reduce
2 pages
Example - (Map Function in Word Count)
No ratings yet
Example - (Map Function in Word Count)
6 pages
EXP 1-2
No ratings yet
EXP 1-2
9 pages
Practical 2 Hadoop Distributed File System (HDFS)
No ratings yet
Practical 2 Hadoop Distributed File System (HDFS)
4 pages
HDFS Tutorial
No ratings yet
HDFS Tutorial
5 pages
Eti 1986 10
No ratings yet
Eti 1986 10
72 pages
huawei
No ratings yet
huawei
32 pages
DFSEvolution User Manual ENG
No ratings yet
DFSEvolution User Manual ENG
28 pages
AD Schema Attributes-N-Classes
No ratings yet
AD Schema Attributes-N-Classes
65 pages
BK Hdfs Administration
No ratings yet
BK Hdfs Administration
73 pages
Hadoop Installation Guide
No ratings yet
Hadoop Installation Guide
18 pages
551 Sp24 Mt1 Answers
No ratings yet
551 Sp24 Mt1 Answers
6 pages
Share: Share Unshare /etc/dfs/dfstab Shareall /etc/dfs/sharetab Unshareall /etc/dfs/sharetab
No ratings yet
Share: Share Unshare /etc/dfs/dfstab Shareall /etc/dfs/sharetab Unshareall /etc/dfs/sharetab
2 pages
Hortonworks Data Platform: HDFS Administration
No ratings yet
Hortonworks Data Platform: HDFS Administration
74 pages
HDFS Commands1
No ratings yet
HDFS Commands1
18 pages
Exam: 310-015 Title: Solaris 9 System Administration II Ver: 12.28.04
No ratings yet
Exam: 310-015 Title: Solaris 9 System Administration II Ver: 12.28.04
53 pages
Hdfs Default XML Parameters
No ratings yet
Hdfs Default XML Parameters
14 pages
Hadoop Commands Only
No ratings yet
Hadoop Commands Only
19 pages
devsh_201611_student_exercisemanual
No ratings yet
devsh_201611_student_exercisemanual
101 pages
Traing On Hadoop
No ratings yet
Traing On Hadoop
123 pages
Code 78
No ratings yet
Code 78
2 pages
HADOOP RECORD 2024-FINAL
No ratings yet
HADOOP RECORD 2024-FINAL
59 pages
TP 1 - HDFS
No ratings yet
TP 1 - HDFS
40 pages
NFS Presentation
No ratings yet
NFS Presentation
25 pages
Hadoop HDFS Commands
No ratings yet
Hadoop HDFS Commands
6 pages
Lineland - HBase Architecture 101 - Storage
No ratings yet
Lineland - HBase Architecture 101 - Storage
15 pages
DHIMATLABToolboxUserGuide.1465646613 Unlocked
No ratings yet
DHIMATLABToolboxUserGuide.1465646613 Unlocked
20 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
Hands-On Exercise: Access HDFS With Command Line and Hue
No ratings yet
Hands-On Exercise: Access HDFS With Command Line and Hue
7 pages
TP3_hadoop python_Wordcount (1)
No ratings yet
TP3_hadoop python_Wordcount (1)
6 pages
Big Data Assighmwnt 2
No ratings yet
Big Data Assighmwnt 2
60 pages
Computer Programing Lab Manual
No ratings yet
Computer Programing Lab Manual
123 pages

TP3_hadoop python_Wordcount (1)

Uploaded by

TP3_hadoop python_Wordcount (1)

Uploaded by

Pr.

Big Data Tutorial 31: distributed wordcount

1 Big data tutorial 133

1.1 Wordcount on Hadoop using Python

Lesson 1, Introduction to Map/Reduce Module, Running Wordcount with streaming, using

1.1.1 section 1: mapper

[1]: #!/usr/bin/env python

import sys #a python module with system functions for this OS

# as in the next line

#also, note that the Hadoop default is 'tab'␣

1.1.2 Section 2: wordcount_reducer.py

last_key = None #initialize these variables

#the split command returns a list of strings, in this␣

value = int(value) #int() will convert a string to integer (this␣

1.1.3 Section 3. Create some data:

hdfs dfs -mkdir /user/cloudera/input

hdfs dfs -put /home/cloudera/testfile1 /user/cloudera/input

1.1.6 Section 6. You can see your files on HDFS

hdfs dfs -ls /user/cloudera/input

[ ]: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

INFO mapreduce.Job: map 0% reduce 0%

INFO mapreduce.Job: map 67% reduce 0%

INFO mapreduce.Job: map 100% reduce 0%

INFO mapreduce.Job: map 100% reduce 100%

INFO mapreduce.Job: Job job_1442937183788_0003 completed successfully

1.1.8 Section 8. Check the output file to see the results:

hdfs dfs -cat /user/cloudera/output_new/part-00000

1.1.9 Section 9. View the output directory:

hdfs dfs -ls /user/cloudera/output_new

1.1.10 Section 10. Streaming options:

Try: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar –help

[ ]: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

1.1.11 11. Change the number of reducers to 2

You might also like