Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
Cluster
#hadoop #mapreduce
Introduction
This post covers how to deploy a local Docker Hadoop Cluster to run custom Python
mapper and reducer function using the classic word count example.
Environment Setup
Docker, get Docker here
Docker Compose, get Docker Compose here
Git, get Git here
With the Docker image for Hadoop on your local machine, we can use docker-compose
2 2 1
to configure the local Hadoop cluster. Replace the docker-compose.yml file with the
following file from this GitHub Gist.
This docker-compose file configures a Hadoop cluster with a master node (namenode)
and three worker nodes, it also configures the network port to allow communication
between the nodes. To start the cluster, run:
docker-compose up -d
Use docker ps to verify the containers are up, you should see a container list similar to
the following:
The current status of the local Hadoop cluster will be available at localhost:9870
#!/usr/bin/env python
"""mapper.py"""
import sys
The reducer function processes the result from the mapper and returns the word
count. Copy the following code into reducer.py
#!/usr/bin/env python
"""reducer.py"""
current_word = None
current_count = 0
word = None
Note because Hadoop runs on Apache server which is built in Java, the program takes
2 2 1
a Java JAR file as an input. To execute Python in Hadoop, we will need to use the
Hadoop Streaming library to pipe the Python executable into the Java framework. As a
result, we need to process the Python input from STDIN.
Copy the local mapper.py and reducer.py to the namenode:
Run ls and you should find mapper.py and reducer.py in the namenode container.
Now let's prepare the input. For this simple example, we will use a set of text files with
a short string. For a more realistic example, you can use e-book from Project
Gutenberg, download the Plain Text UTF-8 encoding.
mkdir input
echo "Hello World" >input/f1.txt
echo "Hello Docker" >input/f2.txt
echo "Hello Hadoop" >input/f3.txt
echo "Hello MapReduce" >input/f4.txt
The MapReduce program access files from the Hadoop Distributed File System
(HDFS). Run the following to transfer the input directory and files to HDFS:
Use find / -name 'hadoop-streaming*.jar' to locate the hadoop string library JAR
file. The path should look something like PATH/hadoop-streaming-3.2.1.jar
Finally, we can execute the MapReduce program:
docker-compose down
Reference
Yen V. (2019). How to set up a Hadoop cluster in Docker.
Retrieved from: here
2 2 1
Noll M. Writing An Hadoop MapReduce Program In Python.
Retrieved from: here
Discussion Subscribe
Boyu
Follow
LOCATION
San Francisco, CA
JOINED
20 mai 2020
2 2 1