Hadoop
Hadoop
Introduction
In pioneer days they used oxen for
heavy pulling, and when one ox couldn’t
budge a log, they didn’t try to grow a larger
ox.
We shouldn’t be trying for bigger
computers, but for more systems of
computers.
Introduction
Data center
Introduction
Blade server
Connected machines
All these machines are connected to each other in order to
share storage space and computing power.
Files and folders are organized in a tree (like Unix) these files are
stored on a large number of machines in such a way as to make
the exact position of a file invisible.
●
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Install Hadoop 3.2.2
Step 5 : Update the Hadoop-env command file
In the directory “Hadoop\hadoop-3.2.2\etc\Hadoop” open the
file “Hadoop-env” and add the following commands at the
last.
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Install Hadoop 3.2.2
Step 6 : start hadoop
In the Command Prompt. Excute the command “spark-shell”.
Then the command “for %I in (.) do echo %~sI”
This last command mast be excuted in the java directory to display
the short name of your installed jdk.
Use the short name to update the “Hadoop-env” file.
Install Hadoop 3.2.2
Step 7 : start hadoop
As with many systems, each HDFS file is split into fixed-size blocks.
A block HDFS= 256MB. Depending on the size of a file, it will need a certain
number of blocks. On HDFS, the last block of a file is the remaining size.
The blocks of the same file are not necessarily all on the same
machine. They are each copied to different machines in order to be
accessed simultaneously by several processes. By default, each block is
copied to 3 different machines (this is configurable).
• One of the machines is the HDFS master, called the namenode. This
machine contains all the file names and blocks, like a big phone book.
• Another machine is the secondary namenode, a kind of backup
namenode, which saves backups of the directory at regular intervals.
• Some machines are clients. These are access points to the cluster to
connect to and work with.
• All other machines are datanodes. They store blocks of file content.
Diagram of HDFS nodes
Value1=functionM(element1)
Value2=functionM(element2)
Value3=functionM(element3)
Value4=functionM(element4)
MapReduce algorithms
Parallelization of Map
The four calculations can be done simultaneously, for
example on 4 different machines, provided that the data is
copied there.
Data
Map
Reduce
YARN and MapReduce
What is YARN?
YARN (Yet Another Resource Negotiator) is a mechanism
in Hadoop for managing jobs on a cluster of machines.
YARN allows users to launch MapReduce jobs on data
present in HDFS, and to follow (monitor) their progress,
retrieve the messages (logs) displayed by the programs.
Eventually YARN can move a process from one
machine to another in the event of failure or progress
deemed too slow. In fact, YARN is transparent to
the user. We launch the execution of
a MapReduce program and YARN
ensures that it is executed as quickly
as possible.
YARN and MapReduce
What is MapReduce?
MapReduce is a Java environment for writing programs
for YARN. Java is not the simplest language for this, there are
packages to import, class paths to provide...
There are several points to know :
Result
Workout 1
Some commands
hadoop version # give the version of the Hadoop
hadoop fs -mkdir /test # create new directory named “test”
bin>hadoop fs -ls / # display the content of the directory
hadoop fs -copyFromLocal <localsrc> <hdfs destination>
# copy a file from <localsrc> to <dest>
haoop fs -put <localsrc> <dest> #copy localfile1 of the local file system
to the Hadoop filesystem.
hadoop fs -get <src> <localdest> #copies the file or directory from the
Hadoop file system to the local file
system.
hadoop fs –cat /path_to_file_in_hdfs
Workout 2
Map reduce
Block 1 Tuple 1
Data Block 2 Tuple 2
node Block 3 Tuple 3
1 Block 4 Tuple 4
Na Data
node
me 2
Block m
no Tuple n
de Nb_tuples=n
Data Nb_blocks=m
node Nb_datanodes=p
p Block_size
Workout 2
Student score (example)
Tuple architecture
{'st_id': 89, 'sp': 'GLSD', 'math': 3.09, 'phy': 16.89, 'sci': 14.26, 'phyl': 12.45, 'geog': 19.15, 'eng': 14.1}
block architecture
Workout 2 specialities=specialities):
data_nodes=[]
block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
dn=random.randint(0, len(data_nodes)-1)
while dn in dns:
dn=random.randint(0, len(data_nodes)-1)
data_nodes[dn].append(block)
dns.append(dn)
return [data_nodes,nb_blocks]
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']
for i in range(nb_dataNode):
data_nodes.append([])
nb_blocks=int(nb_tuples/block_size)
if (nb_tuples%block_size!=0):
nb_blocks=nb_blocks+1
for i in range(nb_blocks):
block={}
block['id_bk']=i
block['data']=[]
for j in range(block_size):
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']
nb_blocks=int(nb_tuples/block_size)
if (nb_tuples%block_size!=0): calculate the number of blocks
nb_blocks=nb_blocks+1
for i in range(nb_blocks):
block={}
block['id_bk']=i
block['data']=[]
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1)
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']
for i in range(nb_blocks):
block={}
block['id_bk']=I create blocks
block['data']=[]
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1)
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
tuple['sci']= round(random.uniform(0.,20.),2)
tuple['phyl']= round(random.uniform(0.,20.),2)
tuple['geog']=round(random.uniform(0.,20.),2)
tuple['eng']= round(random.uniform(0.,20.),2)
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']
for j in range(block_size):
if (i*block_size+j==nb_tuples):
break
sp=random.randint(0, len(specialities)-1) create tuples
tuple={}
tuple['st_id']=i*block_size+j
tuple['sp']=specialities[sp]
tuple['math']= round(random.uniform(0.,20.),2)
tuple['phy']= round(random.uniform(0.,20.),2)
tuple['sci']= round(random.uniform(0.,20.),2)
tuple['phyl']= round(random.uniform(0.,20.),2)
tuple['geog']=round(random.uniform(0.,20.),2)
tuple['eng']= round(random.uniform(0.,20.),2)
block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
Workout 2
System architecture specialities=['ST','MATH','GLSD','STW']
block['data'].append(tuple)
for k in range(nb_copies):
dns=[]
dn=random.randint(0, len(data_nodes)-1) save dataset
while dn in dns:
dn=random.randint(0, len(data_nodes)-1)
data_nodes[dn].append(block)
dns.append(dn)
return [data_nodes,nb_blocks]
def findBlock(id,dataset):
random_sort=np.arange(len(dataset[0]))
System architecture
'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy': 19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci':
15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp':
'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl': 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3,
'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng':
18.28}]}, {'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62,
'phyl': 7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy':
19.06, 'sci': 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}, {'id_bk': 3, 'data':
Nb_tuples=n [{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92,
'eng': 7.57}, {'st_id': 17, 'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl':
11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}], [{'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math':
Random dataset
Nb_blocks=m 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD',
'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl': 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4,
'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}]}, {'id_bk': 0, 'data': [{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56,
'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math': 1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl':
Nb_datanodes=p 6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49}, {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci':
13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}]}, {'id_bk': 1, 'data': [{'st_id': 5, 'sp': 'MATH', 'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST',
'math': 14.41, 'phy': 7.07, 'sci': 15.94, 'phyl': 0.26, 'geog': 13.12, 'eng': 13.41}, {'st_id': 7, 'sp': 'STW', 'math': 3.82, 'phy': 1.25, 'sci': 2.14, 'phyl': 15.8, 'geog': 0.01, 'eng': 13.38}, {'st_id': 8,
Block_size 'sp': 'ST', 'math': 8.7, 'phy': 13.91, 'sci': 0.35, 'phyl': 4.84, 'geog': 6.87, 'eng': 15.38}, {'st_id': 9, 'sp': 'ST', 'math': 17.09, 'phy': 12.64, 'sci': 18.08, 'phyl': 11.72, 'geog': 19.83, 'eng': 14.23}]},
{'id_bk': 2, 'data': [{'st_id': 10, 'sp': 'ST', 'math': 9.3, 'phy': 4.03, 'sci': 4.65, 'phyl': 19.33, 'geog': 12.55, 'eng': 15.24}, {'st_id': 11, 'sp': 'MATH', 'math': 7.95, 'phy': 16.42, 'sci': 5.62, 'phyl':
7.67, 'geog': 3.43, 'eng': 14.16}, {'st_id': 12, 'sp': 'ST', 'math': 15.14, 'phy': 9.67, 'sci': 13.69, 'phyl': 10.49, 'geog': 5.98, 'eng': 3.39}, {'st_id': 13, 'sp': 'MATH', 'math': 16.25, 'phy': 19.06, 'sci':
Specialities 18.48, 'phyl': 12.06, 'geog': 11.89, 'eng': 18.62}, {'st_id': 14, 'sp': 'GLSD', 'math': 10.12, 'phy': 13.78, 'sci': 15.21, 'phyl': 19.79, 'geog': 18.01, 'eng': 0.44}]}, {'id_bk': 3, 'data': [{'st_id': 15, 'sp':
'STW', 'math': 10.99, 'phy': 3.04, 'sci': 6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id':
nb_copies 17, 'sp': 'GLSD', 'math': 16.25, 'phy': 13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng':
10.46}, {'st_id': 19, 'sp': 'GLSD', 'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}, {'id_bk': 3, 'data': [{'st_id': 15, 'sp': 'STW', 'math': 10.99, 'phy': 3.04, 'sci':
6.65, 'phyl': 10.15, 'geog': 18.88, 'eng': 7.9}, {'st_id': 16, 'sp': 'ST', 'math': 16.4, 'phy': 8.89, 'sci': 18.13, 'phyl': 17.48, 'geog': 7.92, 'eng': 7.57}, {'st_id': 17, 'sp': 'GLSD', 'math': 16.25, 'phy':
13.82, 'sci': 4.84, 'phyl': 12.99, 'geog': 1.66, 'eng': 2.01}, {'st_id': 18, 'sp': 'MATH', 'math': 12.63, 'phy': 6.93, 'sci': 18.15, 'phyl': 11.98, 'geog': 6.87, 'eng': 10.46}, {'st_id': 19, 'sp': 'GLSD',
'math': 8.75, 'phy': 14.73, 'sci': 10.59, 'phyl': 3.2, 'geog': 18.01, 'eng': 14.63}]}]]
[[[{'st_id': 0, 'sp': 'GLSD', 'math': 16.25, 'phy': 11.24, 'sci': 12.98, 'phyl': 19.56, 'geog': 13.32, 'eng': 7.6}, {'st_id': 1, 'sp': 'ST', 'math':
1.94, 'phy': 4.64, 'sci': 7.1, 'phyl': 14.25, 'geog': 4.37, 'eng': 18.26}, {'st_id': 2, 'sp': 'GLSD', 'math': 8.32, 'phy': 16.65, 'sci': 1.0, 'phyl':
6.32, 'geog': 18.18, 'eng': 9.48}, {'st_id': 3, 'sp': 'MATH', 'math': 12.83, 'phy': 15.81, 'sci': 16.64, 'phyl': 7.04, 'geog': 9.32, 'eng': 8.49},
Generate a dataset {'st_id': 4, 'sp': 'STW', 'math': 16.75, 'phy': 0.65, 'sci': 13.35, 'phyl': 5.9, 'geog': 19.54, 'eng': 18.28}], 2], [[{'st_id': 5, 'sp': 'MATH',
'math': 2.5, 'phy': 9.33, 'sci': 2.48, 'phyl': 13.07, 'geog': 10.56, 'eng': 0.09}, {'st_id': 6, 'sp': 'ST', 'math': 14.41, 'phy': 7.07, 'sci': 15.94,
Random dataset
Map function
Generate a dataset
Averages
Random dataset
Map function
Generate a dataset
Averages
Random dataset
Map function
Random dataset
Map function