0% found this document useful (0 votes)

462 views13 pages

How To Set Up A Hadoop Cluster in Docker

1. Docker can be used to easily set up a Hadoop cluster on a single computer without needing a big cluster network. 2. A 3-node Hadoop cluster is set up using Docker Compose by pulling images from an online repository and configuring port mappings. 3. The classic WordCount program is run on sample input text files to test the new cluster, demonstrating that Docker allows for easy testing of Hadoop without a large dedicated cluster.

Uploaded by

NP Neupane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

462 views13 pages

How To Set Up A Hadoop Cluster in Docker

Uploaded by

NP Neupane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Common Hadoop commands

Commands:
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs
means we want the executables of hdfs particularly dfs(Distributed File System) commands.

mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first
create it.
Syntax:
hdfs dfs -mkdir <folder name>
creating home directory:
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/username -> write the username of your computer
Example:
hdfs dfs -mkdir /geeks => '/' means absolute pathbin/hdfs dfs -mkdir geeks2 => Relative path
-> the folder will be created relative to the home directory.

touchz: It creates an empty file.

Syntax:
hdfs dfs -touchz <file_path>
Example:

copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the
most important command. Local filesystem means the files present on the OS.
Syntax:
hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks
present on hdfs.
hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
OR
hdfs dfs -put ../Desktop/AI.txt /geeks

cat: To print file contents.

Syntax:
hdfs dfs -cat <path>
Example:
// print the content of AI.txt present// inside geeks folder.bin/hdfs dfs -cat /geeks/AI.txt ->

copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
hdfs dfs -copyToLocal /geeks ../Desktop/hero
OR
hdfs dfs -get /geeks/myfile.txt ../Desktop/hero
myfile.txt from geeks folder will be copied to folder hero present on Desktop.

Note: Observe that we don’t write bin/hdfs while checking the things present on local filesystem.
moveFromLocal: This command will move file from local to hdfs.
Syntax:
hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
hdfs -cp /geeks /geeks_copied
mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from geeks
folder to geeks_copied.
Syntax:
hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
hdfs -mv /geeks/myfile.txt /geeks_copied

rmr: This command deletes a file from HDFS recursively. It is very useful command when you
want to delete a non-empty directory.
Syntax:
hdfs dfs -rmr <filename/directoryName>
Example:
hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.
du: It will give the size of each file in directory.
Syntax:
hdfs dfs -du <dirName>
Example:
hdfs dfs -du /geeks

dus:: This command will give the total size of directory/file.

Syntax:
hdfs dfs -dus <dirName>
Example:

hdfs dfs -dus /geeks

stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
hdfs dfs -stat <hdfs file>
Example:
hdfs dfs -stat /geeks
setrep: This command is used to change the replication factor of a file/directory in HDFS. By
default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.
hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a directory geeksInput stored in HDFS.
hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R means recursively, we use it
for directories as they may also contain many files and folders inside them.

Note: There are more commands in HDFS but we discussed the commands which are commonly
used when working with Hadoop. You can check out the list of dfs commands using the
following command:
hdfs dfs
How to set up a Hadoop cluster in Docker

Introduction
Apache Hadoop is a popular big data framework that is being used a lot in the software industry.
As a distributed system, Hadoop runs on clusters ranging from one single node to thousands of
nodes.
If you want to test out Hadoop, or don’t currently have access to a big Hadoop cluster network,
you can set up a Hadoop cluster on your own computer, using Docker. Docker is a popular
independent software container platform that allows you to build and ship your applications,
along with all its environments, libraries and dependencies in containers. The containers are
portable, so you can set up the exact same system on another machine by running some simple
Docker commands. Thanks to Docker, it’s easy to build, share and run your application
anywhere, without having to depend on the current operating system configuration.
For example, if you have a laptop that is running Windows but need to set up an application that
only runs on Linux, thanks to Docker, you don’t need to install a new OS or set up a virtual
machine. You can set up a Docker container containing all the libraries you need and delete it the
moment you are done with your work.
In this tutorial, we will set up a 3-node Hadoop cluster using Docker and run the classic Hadoop
Word Count program to test the system.

1. Setting up Docker
If you don’t already have Docker installed, you could install it easily following the instructions
on the official Docker homepage.
To check the version of your Docker Engine, Machine and Compose, use the following
commands:

$ docker --version
$ docker-compose --version
$ docker-machine –version

If it’s first-time running Docker, test to make sure things are working properly by launching your
first Dockerized web server:
$docker run -d -p 80:80 --name myserver nginx

Since this is the first time you will run this command, and the image is not yet available offline,
Docker will pull it from the Docker Hub library. After everything is finished, visit
http://localhost to view the homepage of your new server.

2. Setting up a Hadoop Cluster Using Docker

To install Hadoop in a Docker container, we need a Hadoop Docker image. To generate the
image, we will use the Big Data Europe repository. If Git is installed in your system, run the
following command, if not, simply download the compressed zip file to your computer:

$ git clone https://github.com/big-data-europe/docker-hadoop.git

Once we have the docker-hadoop folder on your local machine, we will need to edit the docker-
compose.yml file to enable some listening ports and change where Docker-compose pulls the
images from in case we have the images locally already (Docker will attempt to download files
and build the images the first time we run, but on subsequent times, we would love to use the
already existing images on disk instead of rebuilding everything from scratch again). Open the
docker-compose.yml file and replace the content with the following (You can also download or
copy and paste from this Github Gist):
To deploy a Hadoop cluster, use this command:

$ docker-compose up –d

Docker-Compose is a powerful tool used for setting up multiple containers at the same time. The
-d parameter is used to tell Docker-compose to run the command in the background and give you
back your command prompt so you can do other things. With just a single command above, you
are setting up a Hadoop cluster with 3 slaves (datanodes), one HDFS namenode (or the master
node to manage the datanodes), one YARN resourcemanager, one historyserver and one
nodemanager.
Docker-Compose will try to pull the images from the Docker-Hub library if the images are not
available locally, build the images and start the containers. After it finishes, you can use this
command to check for currently running containers:
$ docker ps

Go to http://localhost:9870 to view the current status of the system from the namenode:
3. Testing your Hadoop cluster
Now we can test the Hadoop cluster by running the classic WordCount program.
Enter into the running namenode container by executing this command:
$ docker exec -it namenode bash
First, we will create some simple input text files to feed that into the WordCount program:

$ mkdir input
$ echo "Hello World" >input/f1.txt
$ echo "Hello Docker" >input/f2.txt
Now create the input directory on HDFS
$ hadoop fs -mkdir -p input
To put the input files to all the datanodes on HDFS, use this command:

$ hdfs dfs -put ./input/* input

Download the example Word Count program from this link (Here I’m downloading it to my
Documents folder, which is the parent directory of my docker-hadoop folder.
Now we need to copy the WordCount program from our local machine to our Docker namenode
container.
Use this command to find out the ID of your namenode container:

$ docker container ls
Go to http://localhost:9870 to view the current status of the system from the namenode:

Copy the container ID of your namenode in the first column and use it in the following command
to start copying the jar file to your Docker Hadoop cluster:
$ docker cp ../hadoop-mapreduce-examples-2.7.1-sources.jar cb0c13085cd3:hadoop-
mapreduce-examples-2.7.1-sources.jar
Now you are ready to run the WordCount program from inside namenode:
root@namenode:/# hadoop jar hadoop-mapreduce-examples-2.7.1-sources.jar
org.apache.hadoop.examples.WordCount input output
To print out the WordCount program result:
root@namenode:/# hdfs dfs -cat output/part-r-00000
World 1
Docker 1
Hello 2
Congratulations, you just successfully set up a Hadoop cluster using Docker!
To safely shut down the cluster and remove containers, use this command:
$ docker-compose down

Galileo Reservation Manual
No ratings yet
Galileo Reservation Manual
47 pages
Operating System Tutorial
100% (1)
Operating System Tutorial
72 pages
Network Administrator Project 2
No ratings yet
Network Administrator Project 2
18 pages
Flask
No ratings yet
Flask
302 pages
Y21A - DAY21AMB6D0 HP
100% (2)
Y21A - DAY21AMB6D0 HP
41 pages
Quick Manual WinPAK
No ratings yet
Quick Manual WinPAK
55 pages
Blockchain
No ratings yet
Blockchain
28 pages
C-Api in Python
No ratings yet
C-Api in Python
162 pages
Student Management System Neelendra
100% (1)
Student Management System Neelendra
84 pages
Jntu 2025 Question Bank For Operating System
No ratings yet
Jntu 2025 Question Bank For Operating System
7 pages
Python Pt1 0702
No ratings yet
Python Pt1 0702
121 pages
Grade 7 EJS ICT HA by Ashraff Ameen
100% (2)
Grade 7 EJS ICT HA by Ashraff Ameen
4 pages
Blockchain Databases: Practice Exercises
0% (1)
Blockchain Databases: Practice Exercises
4 pages
Advances in Network and Distributed Systems Security PDF
No ratings yet
Advances in Network and Distributed Systems Security PDF
218 pages
GST 04204 - Computer Applications
No ratings yet
GST 04204 - Computer Applications
226 pages
Hive Using Hiveql
No ratings yet
Hive Using Hiveql
38 pages
Art of Programming Through Algorithms and Flowcharts in C
No ratings yet
Art of Programming Through Algorithms and Flowcharts in C
7 pages
SAP S4HANA On GCP PDF
No ratings yet
SAP S4HANA On GCP PDF
10 pages
Chapter 17-CORBA Case Study
100% (2)
Chapter 17-CORBA Case Study
47 pages
JavaFX Basics
100% (1)
JavaFX Basics
48 pages
Adm2000 Lab Guide
100% (1)
Adm2000 Lab Guide
48 pages
Jupyter Installation
100% (1)
Jupyter Installation
19 pages
BMW Ewsii&III 3.2systems
No ratings yet
BMW Ewsii&III 3.2systems
6 pages
OS Notes Module 1, 2 and 3
No ratings yet
OS Notes Module 1, 2 and 3
66 pages
Id-11652 Web Python Flask
No ratings yet
Id-11652 Web Python Flask
62 pages
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
SOA Unit I
No ratings yet
SOA Unit I
114 pages
Vsphere Esxi Vcenter Server 702 Availability Guide
No ratings yet
Vsphere Esxi Vcenter Server 702 Availability Guide
92 pages
Biostar A78lc-M3s Rev. 6.0
No ratings yet
Biostar A78lc-M3s Rev. 6.0
42 pages
TutorialsPoint Node - js-1-45 FR
No ratings yet
TutorialsPoint Node - js-1-45 FR
56 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
Okuma Machine Tool Apps Whitepaper
No ratings yet
Okuma Machine Tool Apps Whitepaper
8 pages
Introduction To Apache Kafka
No ratings yet
Introduction To Apache Kafka
15 pages
Building An LLVM Backend
No ratings yet
Building An LLVM Backend
65 pages
Consensus
No ratings yet
Consensus
77 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
LED LCD Monitor (LED Monitor ) : Owner'S Manual
No ratings yet
LED LCD Monitor (LED Monitor ) : Owner'S Manual
42 pages
Hortonworks Cluster Config Guide.1.0
No ratings yet
Hortonworks Cluster Config Guide.1.0
15 pages
A LGB Ronker Bosh
No ratings yet
A LGB Ronker Bosh
18 pages
Interfacing GSM With 8051
100% (1)
Interfacing GSM With 8051
14 pages
Features of Windows - 8 and 8.12037
No ratings yet
Features of Windows - 8 and 8.12037
43 pages
AI Assignment
No ratings yet
AI Assignment
6 pages
4 Spring JPA Hibernate
No ratings yet
4 Spring JPA Hibernate
48 pages
Data Mining Final Exam
No ratings yet
Data Mining Final Exam
1 page
K-Means With Spark & Hadoop - Big Data Analytics
No ratings yet
K-Means With Spark & Hadoop - Big Data Analytics
5 pages
Knowledge Graph Recommender
No ratings yet
Knowledge Graph Recommender
21 pages
Dl4j in Action
No ratings yet
Dl4j in Action
26 pages
First Blockchain With Java
No ratings yet
First Blockchain With Java
5 pages
Java RMI
No ratings yet
Java RMI
10 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
YARN - MapReduce
No ratings yet
YARN - MapReduce
34 pages
Ocl
No ratings yet
Ocl
47 pages
Parallel Distributed Architecture For Storage and Sharing (PDash)
No ratings yet
Parallel Distributed Architecture For Storage and Sharing (PDash)
6 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
C LEUTERT DPI Help Software Manual e
No ratings yet
C LEUTERT DPI Help Software Manual e
68 pages
Pig
No ratings yet
Pig
16 pages
S7-SCL - Analyzing Error Messages - Diagnosing Errors
No ratings yet
S7-SCL - Analyzing Error Messages - Diagnosing Errors
17 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Perry Wolf
No ratings yet
Perry Wolf
16 pages
Azure Training Curriculam
No ratings yet
Azure Training Curriculam
5 pages
DDBMS Exam Questions
No ratings yet
DDBMS Exam Questions
3 pages
Vortex Pok3r User Manual v1.5
No ratings yet
Vortex Pok3r User Manual v1.5
4 pages
Tutorial Hbase
No ratings yet
Tutorial Hbase
100 pages
Teleport User Manual
No ratings yet
Teleport User Manual
28 pages
Eswi November 2023 1ST Mid Paper
No ratings yet
Eswi November 2023 1ST Mid Paper
2 pages
Reverse Engineering Android's Aboot
No ratings yet
Reverse Engineering Android's Aboot
4 pages
Hadoop Installation
No ratings yet
Hadoop Installation
6 pages
Svelte Practice
No ratings yet
Svelte Practice
2 pages
MapR Sandbox For Hadoop DocUpdateFor3.1.1
No ratings yet
MapR Sandbox For Hadoop DocUpdateFor3.1.1
7 pages
WCCP
No ratings yet
WCCP
5 pages
MongoDB Tutorial PDF
No ratings yet
MongoDB Tutorial PDF
3 pages
An, Improved Successive-Approximation Register Design For Use in A/D Converters
No ratings yet
An, Improved Successive-Approximation Register Design For Use in A/D Converters
5 pages
Index: HP H3C IRF Setting - 3. MAD Overview
No ratings yet
Index: HP H3C IRF Setting - 3. MAD Overview
3 pages
Branded PC
No ratings yet
Branded PC
3 pages
How To Configure A Cisco Layer 3 Switch
No ratings yet
How To Configure A Cisco Layer 3 Switch
4 pages
MongoDB for Jobseekers: Reach new heights in your career with MongoDB (English Edition)
From Everand
MongoDB for Jobseekers: Reach new heights in your career with MongoDB (English Edition)
Justin Jenkins
No ratings yet
Ansible by Examples: 200+ Automation Examples For Linux and Windows System Administrators and DevOps
From Everand
Ansible by Examples: 200+ Automation Examples For Linux and Windows System Administrators and DevOps
Luca Berton
No ratings yet
Java 17 Backend Development: Design backend systems using Spring Boot, Docker, Kafka, Eureka, Redis, and Tomcat
From Everand
Java 17 Backend Development: Design backend systems using Spring Boot, Docker, Kafka, Eureka, Redis, and Tomcat
Elara Drevyn
No ratings yet
Django 1.0 Template Development
From Everand
Django 1.0 Template Development
Scott Newman
No ratings yet
Advanced GitLab CI/CD Pipelines: An In-Depth Guide for Continuous Integration and Deployment
From Everand
Advanced GitLab CI/CD Pipelines: An In-Depth Guide for Continuous Integration and Deployment
Adam Jones
No ratings yet
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
From Everand
Kubernetes and Cloud Native Associate (KCNA) Exam Preparation
Georgio Daccache
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Unix / Linux FAQ: with Tips to Face Interviews
From Everand
Unix / Linux FAQ: with Tips to Face Interviews
Prof. N.B. Venkateswarlu
No ratings yet
Cloud Development and Deployment with CloudBees
From Everand
Cloud Development and Deployment with CloudBees
Nicolas De loof
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
IBM Integration Bus Third Edition
From Everand
IBM Integration Bus Third Edition
Gerardus Blokdyk
No ratings yet
TIBCO Software The Ultimate Step-By-Step Guide
From Everand
TIBCO Software The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet

How To Set Up A Hadoop Cluster in Docker

Uploaded by

How To Set Up A Hadoop Cluster in Docker

Uploaded by

Common Hadoop commands

touchz: It creates an empty file.

cat: To print file contents.

dus:: This command will give the total size of directory/file.

hdfs dfs -dus /geeks

2. Setting up a Hadoop Cluster Using Docker

$ git clone https://github.com/big-data-europe/docker-hadoop.git

$ hdfs dfs -put ./input/* input

You might also like