0% found this document useful (0 votes)
70 views22 pages

Part 03 Intro To Hadoop

This document provides an introduction and overview of using Hadoop for big data analysis. It describes the common software components in a Hadoop cluster including HDFS for storage and YARN for resource management. It also demonstrates how to run Hadoop applications and interface with other programming languages. Examples are given for running a word count example using Bash scripts with Hadoop streaming.

Uploaded by

Sahera Shabnam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views22 pages

Part 03 Intro To Hadoop

This document provides an introduction and overview of using Hadoop for big data analysis. It describes the common software components in a Hadoop cluster including HDFS for storage and YARN for resource management. It also demonstrates how to run Hadoop applications and interface with other programming languages. Examples are given for running a word count example using Bash scripts with Hadoop streaming.

Uploaded by

Sahera Shabnam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Big Data Analysis Workshop:

Introduction to Hadoop

Drs. Weijia Xu, Ruizhu Huang and Amit Gupta


Data Mining & Statistics Group
Texas Advanced Computing Center
University of Texas at Austin

Sept. 28~29, 2017


Atlanta, GA
Common Software Components
with Hadoop Cluster

2
Hadoop Cluster

3
Type of Nodes in a Hadoop
Cluster
•  Name node
•  the “master”
•  Maintains a record of what and where data are stored across the
cluster
•  Could be a single point of failure.
•  Multiple nodes may used for security and extensibility
•  Data node
•  the “worker”
•  Each node usually has large local storage and large memory
•  Large data set are stored across data nodes with duplication
•  Resilience towards node failure/disk errors
•  Service node
•  Metadata nodes for other application
•  Backup namenode
•  Usually not user accessible
From MapReduce to Hadoop
•  Hadoop has now become a basic platform to support common big
data analysis tasks
•  Hadoop includes multiple components which can be used beyond
MapReduce programming model as well.
Hadoop Distributed File System
(HDFS)
The hdfs will be set up with three top level
directories

/tmp public writeable, used by many hadoop


based application as temporary space.
/user all users home directory /user/$USERNAME
/var public readable, used by many hadoop
based application to store log files etc.

6
Working with HDFS
HDFS has file system shell
hadoop fs [commands]

The file system shell includes a set of command to work with


hdfs.
Command are similar to common linux commands e.g.
>hadoop fs -ls #to list content of the default user directory.
>hadoop fs -mkdir abc #to make a directory in hdfs.

7
Getting Data in and out HDFS
hadoop fs -put local_file [path_in_HDFS]
Put a file in your local system into the HDFS
Each file would be stored in one or more “blocks”
The default block size is 128MB.
The block size used can be override by users

Hadoop fs -get path_in_hdfs [path_in_local]


Get a file from the hadoop cluster to the local file system.

8
Other File Shell commands
-stat returns stat information of a path
-cat/tail output to stdout
-setrep set replication factor

For a complete lists just do


hadoop fs
Or
visit http://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/
hadoop-common/FileSystemShell.html

9
YARN
YARN: Yet Another Resource Manager
Managing computing resources within Hadoop cluster
All jobs should be submitted to yarn to run.
E.g. using either yarn Jar or hadoop Jar
When use other hadoop-supported application, please
also specify YARN as resource manager. Such as SPARK.

YARN commands
Show cluster status
Help managing jobs running inside the Hadoop cluster.

10
YARN Commands
yarn application
-list to list applications submitted to YARN
default will show active/queued application

-kill to kill application specified by application-ID.


-appStates/appTypes filter options

11
YARN Commands
yarn node
-list list of status of data nodes
Let us know if there is less than expected live data nodes.

yarn logs
dump logs of a finished application
-applicationID specific log from which application
-containerID specify log from which container

12
Running Hadoop Application
All Hadoop application can be run as a console
command.

The basic format is like following:

hadoop jar java_jar_name java_class_name


[parameters]

The user can use –D to specify more Hadoop options.


e.g.
-D mapred.map.tasks #number of map instances to be generated.
-D mapred.reduce.tasks # number of reduce instances to be used.

13
Many Relevant Cluster/Job Settings
Matters
# of mapper, mapred.map.tasks
# of reducers, mapred.reduce.tasks
# of executors, spark.executor.instances
# of core-per-executor spark.executor.cores


# of core-per-container yarn.nodemanager.resource.cpu-vcores
Memory per container yarn.scheduler.minimum-alloca8on-vcores
yarn.scheduler.maximum-alloca8on-vcores
Memory per executors yarn.nodemanager.resource.memory-mb
Memory per mapper yarn.scheduler.minimum-alloca8on-mb
yarn.scheduler.maximum-alloca8on-mb
Memory per reducers yarn.scheduler.increment-alloca8on-mb
….


Interfacing with other programming
languages
•  Enabling MR jobs with other language
•  Python, Perl, R, C, etc...

•  User need to provide scripts/programs for Map


and Reduce processing
•  The input/output format need to be compatible with
key-value pair

•  Intermediate data are passed through stdin,


stdout
•  A trade-off between convenience and performance
Hadoop Streaming API
hadoop jar hadoop-streaming.jar
-input /path/to/input/in/hdfs #input file location
-output /path/to/output/in/hdfs #output file location
-mapper map # mapper implementation
-reducer reduce #reduce implementation
-file map #location of the map code on local file system
-file reduce # location of the reduce code on local file system.

The map and reduce could be implemented in any


programming language, even with bash script.
17
WordCount using Bash with
Hadoop streaming
The map code:
WordCount using Bash with
Hadoop streaming
The reduce code
WordCount using Bash with
Hadoop streaming

Putting together:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.map.tasks=512 \
-D mapred.reduce.tasks=256 \
-D stream.num.map.output.key.fields=1 \
-input /tmp/data/20news-all/alt.atheism \
-output wiki_wc_bash \
-mapper ./mapwc.sh -reducer ./reducewc.sh \
-file ./mapwc.sh -file ./reducewc.sh
Note, with Hadoop streaming, you could use higher
number of mapper.
Why?
Running Wordcount Example
code
The on-screen
output will show
job running status

21
Try Hadoop Locally
•  Main referenence
•  https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/
SingleCluster.html

•  Prerequisite
•  Java
•  SSH

•  Download
•  e.g from http://apache.claz.org/hadoop/common/hadoop-2.8.0/

•  Unpack

•  Try bin/hadoop

You might also like