Part 03 Intro To Hadoop
Part 03 Intro To Hadoop
Introduction to Hadoop
2
Hadoop Cluster
3
Type of Nodes in a Hadoop
Cluster
• Name node
• the “master”
• Maintains a record of what and where data are stored across the
cluster
• Could be a single point of failure.
• Multiple nodes may used for security and extensibility
• Data node
• the “worker”
• Each node usually has large local storage and large memory
• Large data set are stored across data nodes with duplication
• Resilience towards node failure/disk errors
• Service node
• Metadata nodes for other application
• Backup namenode
• Usually not user accessible
From MapReduce to Hadoop
• Hadoop has now become a basic platform to support common big
data analysis tasks
• Hadoop includes multiple components which can be used beyond
MapReduce programming model as well.
Hadoop Distributed File System
(HDFS)
The hdfs will be set up with three top level
directories
6
Working with HDFS
HDFS has file system shell
hadoop fs [commands]
7
Getting Data in and out HDFS
hadoop fs -put local_file [path_in_HDFS]
Put a file in your local system into the HDFS
Each file would be stored in one or more “blocks”
The default block size is 128MB.
The block size used can be override by users
8
Other File Shell commands
-stat returns stat information of a path
-cat/tail output to stdout
-setrep set replication factor
9
YARN
YARN: Yet Another Resource Manager
Managing computing resources within Hadoop cluster
All jobs should be submitted to yarn to run.
E.g. using either yarn Jar or hadoop Jar
When use other hadoop-supported application, please
also specify YARN as resource manager. Such as SPARK.
YARN commands
Show cluster status
Help managing jobs running inside the Hadoop cluster.
10
YARN Commands
yarn application
-list to list applications submitted to YARN
default will show active/queued application
11
YARN Commands
yarn node
-list list of status of data nodes
Let us know if there is less than expected live data nodes.
yarn logs
dump logs of a finished application
-applicationID specific log from which application
-containerID specify log from which container
12
Running Hadoop Application
All Hadoop application can be run as a console
command.
13
Many Relevant Cluster/Job Settings
Matters
# of mapper, mapred.map.tasks
# of reducers, mapred.reduce.tasks
# of executors, spark.executor.instances
# of core-per-executor spark.executor.cores
# of core-per-container yarn.nodemanager.resource.cpu-vcores
Memory per container yarn.scheduler.minimum-alloca8on-vcores
yarn.scheduler.maximum-alloca8on-vcores
Memory per executors yarn.nodemanager.resource.memory-mb
Memory per mapper yarn.scheduler.minimum-alloca8on-mb
yarn.scheduler.maximum-alloca8on-mb
Memory per reducers yarn.scheduler.increment-alloca8on-mb
….
Interfacing with other programming
languages
• Enabling MR jobs with other language
• Python, Perl, R, C, etc...
Putting together:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.map.tasks=512 \
-D mapred.reduce.tasks=256 \
-D stream.num.map.output.key.fields=1 \
-input /tmp/data/20news-all/alt.atheism \
-output wiki_wc_bash \
-mapper ./mapwc.sh -reducer ./reducewc.sh \
-file ./mapwc.sh -file ./reducewc.sh
Note, with Hadoop streaming, you could use higher
number of mapper.
Why?
Running Wordcount Example
code
The on-screen
output will show
job running status
21
Try Hadoop Locally
• Main referenence
• https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/
SingleCluster.html
• Prerequisite
• Java
• SSH
• Download
• e.g from http://apache.claz.org/hadoop/common/hadoop-2.8.0/
• Unpack
• Try bin/hadoop