0% found this document useful (0 votes)
6 views

Lab1 BigData

Uploaded by

montaest100
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lab1 BigData

Uploaded by

montaest100
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Lab 1

Hadoop : HDFS
Rania Yangui

Objectives of the Lab


⁃ Introduction to the Hadoop Framework
⁃ Use of Docker to launch a Hadoop cluster of 3 nodes: 1 master and 2 slaves.
⁃ Learn the concepts and commands to properly manage files on HDFS.
Tools

⁃ Apache Hadoop [http://hadoop.apache.org/]


⁃ Docker [https://www.docker.com/]
⁃ Python
⁃ Unix-like or Unix-based Systems (Various Linux and MacOS)
Basic concepts
⁃ Apache Hadoop: is an open-source framework for storing and processing
large data on a cluster. It is used by many contributors and users.
⁃ HDFS (Hadoop Distributed File System): Distributed file system for storing
very large files.
Hadoop and Docker

To deploy the Hadoop Framework, we will use Docker containers


[https://www.docker.com/]. The use of containers will guarantee consistency
between development environments and will considerably reduce the complexity of
machine configuration as well as the cumbersome execution (if we opt for the use
of a virtual machine).

Installing Docker

Download the Windows version of Docker [Docker Desktop for Windows

docs.docker.com].

Prerequisites

⁃ Windows 10 64-bit: Pro, Enterprise, or Education (Build 16299 or later).


⁃ For Windows 10 Home, see Install Docker Desktop on Windows Home.
⁃ Hyper-V and Containers Windows features must be enabled.
⁃ The following hardware prerequisites are required to successfully run Client
Hyper-V on Windows 10:
o 64 bit processor with Second Level Address Translation (SLAT)
o 4GB system RAM

o BIOS-level hardware virtualization support must be enabled in the BIOS


settings. For more information, see Virtualization.

Installing Hadoop containers

Throughout this lab, we will use three containers representing respectively a master
node (Namenode) and two slave nodes (Datanodes).

To do this, you must have Docker installed on your machine and have it correctly
configured.

1. Open the command line, and type the following instructions:

docker pull csturm/hadoop-python:h3.2-p3.9.10-j11

2. Create the three containers from the downloaded image. For that:

2.1 Create a network that will connect the three containers

docker network create --driver=bridge Hadoop

2.2 Create and launch the three containers (the -p instructions makes a mapping
between the ports of the host machine and those of the container)

Master
docker run -itd --net=hadoop -p 8031:8031 --name hadoop-master --hostname
hadoop-master csturm/hadoop-python:h3.2-p3.9.10-j11

Slaves
docker run -itd -p 8040:8042 --net=hadoop --name hadoop-slave1 --hostname
hadoop-slave1 csturm/hadoop-python:h3.2-p3.9.10-j11

docker run -itd -p 8041:8042 --net=hadoop --name hadoop-slave2 --hostname


hadoop-slave2 csturm/hadoop-python:h3.2-p3.9.10-j11

3. Go to the master container to start using it

docker exec -it hadoop-master bash

The result of this execution will be as follows: root@hadoop-master:~#

We will find ourselves in the NameNode shell, and we will be able to manipulate
the cluster as we wish. The first thing to do, once in the container, is to launch
Hadoop and Yarn. A script is provided for this, called start-all.sh (in the sbin
folder). Run this script:
# ls -l
Cd hadoop
cd sbin
ls -l
./start-all.sh

Getting Started with Hadoop: Manipulating Files on HDFS

All commands interacting with the Hadoop system begin with “hdfs dfs”. Then,
the added options are largely inspired by standard Unix commands.
1. Create a directory in HDFS, called input. To do this, type:

# hdfs dfs -mkdir -p /user/root/input

2. We will use the purchases.txt file as input for MapReduce processing.


[https://www.kaggle.com/datasets/dsfelix/purchasestxt?resource=download]

2.1 Leave the container and return to the local

exit

2.2 Copy the file from the local to the docker container

c:/dblp.json hadoop-master:/purchases.txt

2.3 Connect to container again

docker exec -it hadoop-master bash

2.4 Check the existence and display the contents of the file

# tail purchases.txt

2.5 Load the purchases file into the input directory you created

hdfs dfs -put purchases.txt /user/root/input

2.6 To display the contents of the input directory, the command is

hdfs dfs -ls /user/root/input

You might also like