0% found this document useful (0 votes)

213 views5 pages

Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community

This document discusses running a Python MapReduce program on a local Docker Hadoop cluster. It provides instructions for setting up the cluster using Docker Compose, deploying the necessary Python mapper and reducer files, and executing a basic word count MapReduce job on sample input text files. The program counts the frequency of words in the files by mapping words to a count of 1 and reducing the counts. Running the Python code in Hadoop requires using the Hadoop Streaming library to interface the Python with the Java framework.

Uploaded by

Ahmed Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

213 views5 pages

Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community

Uploaded by

Ahmed Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Run Python MapReduce on local Docker Hadoop

Cluster
#hadoop #mapreduce

Boyu 5 oct. 2020 ・Updated on 20 oct. 2020 ・4 min read

Introduction
This post covers how to deploy a local Docker Hadoop Cluster to run custom Python
mapper and reducer function using the classic word count example.

Environment Setup
Docker, get Docker here
Docker Compose, get Docker Compose here
Git, get Git here

Deploy Hadoop Cluster using Docker

We will use the Docker image by big-data-europe repository to set up Hadoop.

git clone [email protected]:big-data-europe/docker-hadoop.git

With the Docker image for Hadoop on your local machine, we can use docker-compose
2 2 1
to configure the local Hadoop cluster. Replace the docker-compose.yml file with the
following file from this GitHub Gist.
This docker-compose file configures a Hadoop cluster with a master node (namenode)
and three worker nodes, it also configures the network port to allow communication
between the nodes. To start the cluster, run:

docker-compose up -d

Use docker ps to verify the containers are up, you should see a container list similar to
the following:

IMAGE PORTS NAMES

docker-hadoop_resourcemanager resourcemanager
docker-hadoop_nodemanager1 0.0.0.0:8042->8042/tcp nodemanager1
docker-hadoop_historyserver 0.0.0.0:8188->8188/tcp historyserver
docker-hadoop_datanode3 9864/tcp datanode3
docker-hadoop_datanode2 9864/tcp datanode2
docker-hadoop_datanode1 9864/tcp datanode1
docker-hadoop_namenode 0.0.0.0:9870->9870/tcp namenode

The current status of the local Hadoop cluster will be available at localhost:9870

Running Python MapReduce function

For this simple MapReduce program, we will use the classical word count example.
The program reads text files and counts how often each word occurs.
The mapper function will read the text and emit the key-value pair, which in this case is
<word, 1> . Copy the following code into mapper.py

#!/usr/bin/env python
"""mapper.py"""

import sys

# input comes from STDIN (standard input)

for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
2 2 1
# tab-delimited; the trivial word count is 1
print ('%s\t%s' % (word, 1))

The reducer function processes the result from the mapper and returns the word
count. Copy the following code into reducer.py

#!/usr/bin/env python
"""reducer.py"""

from operator import itemgetter

import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN

for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py

word, count = line.split('\t', 1)

# convert count (currently a string) to int

try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map output

# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print ('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word

# do not forget to output the last word if needed!

if current_word == word:
print ('%s\t%s' % (current_word, current_count))

Note because Hadoop runs on Apache server which is built in Java, the program takes
2 2 1
a Java JAR file as an input. To execute Python in Hadoop, we will need to use the
Hadoop Streaming library to pipe the Python executable into the Java framework. As a
result, we need to process the Python input from STDIN.
Copy the local mapper.py and reducer.py to the namenode:

docker cp LOCAL_PATH/mapper.py namenode:mapper.py

docker cp LOCAL_PATH/reducer.py namenode:reducer.py

Enter the namenode container of the Hadoop cluster:

docker exec -it namenode bash

Run ls and you should find mapper.py and reducer.py in the namenode container.
Now let's prepare the input. For this simple example, we will use a set of text files with
a short string. For a more realistic example, you can use e-book from Project
Gutenberg, download the Plain Text UTF-8 encoding.

mkdir input
echo "Hello World" >input/f1.txt
echo "Hello Docker" >input/f2.txt
echo "Hello Hadoop" >input/f3.txt
echo "Hello MapReduce" >input/f4.txt

The MapReduce program access files from the Hadoop Distributed File System
(HDFS). Run the following to transfer the input directory and files to HDFS:

hadoop fs -mkdir -p input

hdfs dfs -put ./input/* input

Use find / -name 'hadoop-streaming*.jar' to locate the hadoop string library JAR
file. The path should look something like PATH/hadoop-streaming-3.2.1.jar
Finally, we can execute the MapReduce program:

hadoop jar /opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar

-file mapper.py -mapper mapper.py \
-file reducer.py -reducer reducer.py \
-input input -output output

To safely shut down the cluster and remove containers, run:

docker-compose down

Reference
Yen V. (2019). How to set up a Hadoop cluster in Docker.
Retrieved from: here
2 2 1
Noll M. Writing An Hadoop MapReduce Program In Python.
Retrieved from: here

Discussion Subscribe

Add to the discussion

Code of Conduct • Report abuse

Boyu

LOCATION
San Francisco, CA

JOINED
20 mai 2020

Trending on DEV Community

How Developers can learn from the mistakes of Cyberpunk 2077

#webdev #codenewbie #watercooler #career

How to stay productive as a developer

#webdev #productivity #beginners

Jan. 8, 2021: What did you learn this week?

#weeklylearn #discuss #weeklyretro

2 2 1

Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
Hadoop Mapreduce Python Script
No ratings yet
Hadoop Mapreduce Python Script
3 pages
HDFS File Permissions and Operations
No ratings yet
HDFS File Permissions and Operations
20 pages
TP3 - Hadoop Python - Wordcount
No ratings yet
TP3 - Hadoop Python - Wordcount
6 pages
Assignment 04 - Saiful Islam
No ratings yet
Assignment 04 - Saiful Islam
6 pages
Lsde Workshop wk9
No ratings yet
Lsde Workshop wk9
31 pages
Commands in Hadoop
No ratings yet
Commands in Hadoop
7 pages
Big Data File
No ratings yet
Big Data File
16 pages
Unit 3 MapReduce Part 2
No ratings yet
Unit 3 MapReduce Part 2
12 pages
Bda File
No ratings yet
Bda File
28 pages
How To Set Up A Hadoop Cluster in Docker
No ratings yet
How To Set Up A Hadoop Cluster in Docker
13 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
Big Datalab
No ratings yet
Big Datalab
4 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Hadoop MapReduce Programming Guide
No ratings yet
Hadoop MapReduce Programming Guide
33 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
HadoopExercises July2011 PDF
No ratings yet
HadoopExercises July2011 PDF
26 pages
MapReduce Enhanced Guide
No ratings yet
MapReduce Enhanced Guide
3 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
34 pages
BDF Programs
No ratings yet
BDF Programs
32 pages
Intro to Hadoop and WordCount Setup
No ratings yet
Intro to Hadoop and WordCount Setup
22 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Big Data Analytics with Hadoop Guide
No ratings yet
Big Data Analytics with Hadoop Guide
10 pages
Hai Hadoop
No ratings yet
Hai Hadoop
14 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Hadoop Setup Guide for Developers
No ratings yet
Hadoop Setup Guide for Developers
7 pages
Data Science
No ratings yet
Data Science
82 pages
Data Science Record
No ratings yet
Data Science Record
30 pages
Hadoop Single-Node Setup Guide
No ratings yet
Hadoop Single-Node Setup Guide
4 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Bi Lab File
No ratings yet
Bi Lab File
19 pages
Hadoop Setup & MapReduce Guide
No ratings yet
Hadoop Setup & MapReduce Guide
88 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
Hadoop Mini Project
No ratings yet
Hadoop Mini Project
8 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
Hadoop Installation & MapReduce Guide
No ratings yet
Hadoop Installation & MapReduce Guide
13 pages
Mapreduce Program
No ratings yet
Mapreduce Program
3 pages
Procedure: 1
No ratings yet
Procedure: 1
29 pages
First Map-Reduce Program in Hadoop
No ratings yet
First Map-Reduce Program in Hadoop
22 pages
Lab11 B
No ratings yet
Lab11 B
9 pages
Hadoop Training for Researchers
100% (1)
Hadoop Training for Researchers
23 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
58 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Bda Exp1 Chinmay
No ratings yet
Bda Exp1 Chinmay
13 pages
Hadoop Installation Guide: Standalone Mode
No ratings yet
Hadoop Installation Guide: Standalone Mode
47 pages
Hadoop Single Node Cluster Setup Guide
No ratings yet
Hadoop Single Node Cluster Setup Guide
61 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Hadoop Practical Commands & Mapreduce Lab Mannula With Java and Python
No ratings yet
Hadoop Practical Commands & Mapreduce Lab Mannula With Java and Python
2 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Hadoop & Docker Lab Guide
No ratings yet
Hadoop & Docker Lab Guide
3 pages
MD5 Hash Calculator with Poco C++
No ratings yet
MD5 Hash Calculator with Poco C++
12 pages
C - C++ Projects Quick Start Tutorial
No ratings yet
C - C++ Projects Quick Start Tutorial
10 pages
Download
No ratings yet
Download
11 pages
Visual Studio Linux CMake Setup
No ratings yet
Visual Studio Linux CMake Setup
6 pages
Potentials and Challenges of Agile Project Management in Real Estate Development
100% (1)
Potentials and Challenges of Agile Project Management in Real Estate Development
35 pages
Python Doctest Module Tutorial
No ratings yet
Python Doctest Module Tutorial
3 pages
Run Your First Windows Container
No ratings yet
Run Your First Windows Container
7 pages
YIELD Function - Formula, Examples, Calculate Yield in Excel
No ratings yet
YIELD Function - Formula, Examples, Calculate Yield in Excel
5 pages
TrueSTUDIO - A Powerful Eclipse-Based C - C++ Integrated Development Tool For Your STM32 Projects - STMicroelectronics
No ratings yet
TrueSTUDIO - A Powerful Eclipse-Based C - C++ Integrated Development Tool For Your STM32 Projects - STMicroelectronics
4 pages
C++ TestMate - Visual Studio Marketplace
0% (1)
C++ TestMate - Visual Studio Marketplace
4 pages
Docker Desktop For Windows User Manual - Docker Documentation
No ratings yet
Docker Desktop For Windows User Manual - Docker Documentation
12 pages
Managing Environments - Anaconda Documentation
No ratings yet
Managing Environments - Anaconda Documentation
3 pages
R in Jupyter Notebook Guide
No ratings yet
R in Jupyter Notebook Guide
3 pages
Create R Environment in Anaconda
No ratings yet
Create R Environment in Anaconda
3 pages
Scribd-Dl PyPI
No ratings yet
Scribd-Dl PyPI
4 pages
Install Pandas with Anaconda Navigator
No ratings yet
Install Pandas with Anaconda Navigator
4 pages
Connect Hadoop Database by Using Hive in Python - Ting Yu
No ratings yet
Connect Hadoop Database by Using Hive in Python - Ting Yu
2 pages
Managing Anaconda Channels Guide
No ratings yet
Managing Anaconda Channels Guide
2 pages
Eclipse and PyDev - Anaconda Documentation
No ratings yet
Eclipse and PyDev - Anaconda Documentation
3 pages
Build JupyterLab App in Anaconda
No ratings yet
Build JupyterLab App in Anaconda
6 pages
VMware Vsphere Install Configure Manage V67
No ratings yet
VMware Vsphere Install Configure Manage V67
6 pages
Understanding Tablets and Mobile Devices
No ratings yet
Understanding Tablets and Mobile Devices
5 pages
K1000 Service Desk Admin Guide v53
No ratings yet
K1000 Service Desk Admin Guide v53
102 pages
PowerPoint Presentation - Kubernetes+-CKA-+0100+-+Core+Concepts
No ratings yet
PowerPoint Presentation - Kubernetes+-CKA-+0100+-+Core+Concepts
77 pages
Gaming & Torrenting with OpenVPN
No ratings yet
Gaming & Torrenting with OpenVPN
3 pages
Chapter: 4.2 Word Processing Concepts Topic: 4.2.1 Introduction To Word Processing
No ratings yet
Chapter: 4.2 Word Processing Concepts Topic: 4.2.1 Introduction To Word Processing
44 pages
Data Analysis and Processing Guide
100% (1)
Data Analysis and Processing Guide
2 pages
Infinity USB Phoenix - Infmanual
No ratings yet
Infinity USB Phoenix - Infmanual
7 pages
Red Hat Openstack Administration cl210 - Compress
No ratings yet
Red Hat Openstack Administration cl210 - Compress
2 pages
A Big Thank You To Mr. Andrej Škof From Toyota Adria For Realising This Document
No ratings yet
A Big Thank You To Mr. Andrej Škof From Toyota Adria For Realising This Document
62 pages
Macmillan Education Everywhere FAQs
No ratings yet
Macmillan Education Everywhere FAQs
1 page
ICT - 8 - Q1 - Periodical Test
100% (1)
ICT - 8 - Q1 - Periodical Test
4 pages
Wamp & Lamp: Installation and Configuration
No ratings yet
Wamp & Lamp: Installation and Configuration
13 pages
(Tutorial) Tifa's Bootleg
No ratings yet
(Tutorial) Tifa's Bootleg
24 pages
Trans Urge Users Manual
No ratings yet
Trans Urge Users Manual
172 pages
Wire Shark
No ratings yet
Wire Shark
53 pages
Logitech® Desktop MK120
No ratings yet
Logitech® Desktop MK120
2 pages
Computer System Servicing Grade 12: S H S Plan Maintenance And/or Diagnosis of Faults in Line With Job Requirements
17% (6)
Computer System Servicing Grade 12: S H S Plan Maintenance And/or Diagnosis of Faults in Line With Job Requirements
15 pages
The Ultimate Slack Cheat Sheet
No ratings yet
The Ultimate Slack Cheat Sheet
2 pages
Infra1 Lsinventory0502
No ratings yet
Infra1 Lsinventory0502
42 pages
Ultra VNC Repeater Guide: Document Version: 5
No ratings yet
Ultra VNC Repeater Guide: Document Version: 5
37 pages
ATOLL User - Manual-751-800
No ratings yet
ATOLL User - Manual-751-800
50 pages
HP v3 Universal Print Driver 6.9.0
No ratings yet
HP v3 Universal Print Driver 6.9.0
44 pages
Best Netcat Alternatives for Pentesters
No ratings yet
Best Netcat Alternatives for Pentesters
16 pages
Panelview Plus 400 To 1500 Communications
100% (1)
Panelview Plus 400 To 1500 Communications
27 pages
MSWord Vs GoogleDocs Workflow
No ratings yet
MSWord Vs GoogleDocs Workflow
3 pages
Config Upload Guide for Admins
No ratings yet
Config Upload Guide for Admins
5 pages
How To Update Xperia Arc S LT18i and Rooting
100% (1)
How To Update Xperia Arc S LT18i and Rooting
4 pages
Execution (Computing)
No ratings yet
Execution (Computing)
4 pages
Gain Root Access on VNXe1 Systems
No ratings yet
Gain Root Access on VNXe1 Systems
6 pages

Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community

Uploaded by

Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community

Uploaded by

Run Python MapReduce on local Docker Hadoop

Boyu 5 oct. 2020 ・Updated on 20 oct. 2020 ・4 min read

Deploy Hadoop Cluster using Docker

git clone [email protected]:big-data-europe/docker-hadoop.git

IMAGE PORTS NAMES

Running Python MapReduce function

# input comes from STDIN (standard input)

from operator import itemgetter

# input comes from STDIN

# parse the input we got from mapper.py

# convert count (currently a string) to int

# this IF-switch only works because Hadoop sorts map output

# do not forget to output the last word if needed!

docker cp LOCAL_PATH/mapper.py namenode:mapper.py

Enter the namenode container of the Hadoop cluster:

docker exec -it namenode bash

hadoop fs -mkdir -p input

hadoop jar /opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar

To safely shut down the cluster and remove containers, run:

Add to the discussion

Code of Conduct • Report abuse

Trending on DEV Community

How Developers can learn from the mistakes of Cyberpunk 2077

How to stay productive as a developer

Jan. 8, 2021: What did you learn this week?

You might also like