0% found this document useful (0 votes)

70 views22 pages

Part 03 Intro To Hadoop

This document provides an introduction and overview of using Hadoop for big data analysis. It describes the common software components in a Hadoop cluster including HDFS for storage and YARN for resource management. It also demonstrates how to run Hadoop applications and interface with other programming languages. Examples are given for running a word count example using Bash scripts with Hadoop streaming.

Uploaded by

Sahera Shabnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views22 pages

Part 03 Intro To Hadoop

Uploaded by

Sahera Shabnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Big Data Analysis Workshop:

Introduction to Hadoop

Drs. Weijia Xu, Ruizhu Huang and Amit Gupta

Data Mining & Statistics Group
Texas Advanced Computing Center
University of Texas at Austin

Sept. 28~29, 2017

Atlanta, GA
Common Software Components
with Hadoop Cluster

2
Hadoop Cluster

3
Type of Nodes in a Hadoop
Cluster
•  Name node
•  the “master”
•  Maintains a record of what and where data are stored across the
cluster
•  Could be a single point of failure.
•  Multiple nodes may used for security and extensibility
•  Data node
•  the “worker”
•  Each node usually has large local storage and large memory
•  Large data set are stored across data nodes with duplication
•  Resilience towards node failure/disk errors
•  Service node
•  Metadata nodes for other application
•  Backup namenode
•  Usually not user accessible
From MapReduce to Hadoop
•  Hadoop has now become a basic platform to support common big
data analysis tasks
•  Hadoop includes multiple components which can be used beyond
MapReduce programming model as well.
Hadoop Distributed File System
(HDFS)
The hdfs will be set up with three top level
directories

/tmp public writeable, used by many hadoop

based application as temporary space.
/user all users home directory /user/$USERNAME
/var public readable, used by many hadoop
based application to store log files etc.

6
Working with HDFS
HDFS has file system shell
hadoop fs [commands]

The file system shell includes a set of command to work with

hdfs.
Command are similar to common linux commands e.g.
>hadoop fs -ls #to list content of the default user directory.
>hadoop fs -mkdir abc #to make a directory in hdfs.

7
Getting Data in and out HDFS
hadoop fs -put local_file [path_in_HDFS]
Put a file in your local system into the HDFS
Each file would be stored in one or more “blocks”
The default block size is 128MB.
The block size used can be override by users

Hadoop fs -get path_in_hdfs [path_in_local]

Get a file from the hadoop cluster to the local file system.

8
Other File Shell commands
-stat returns stat information of a path
-cat/tail output to stdout
-setrep set replication factor

For a complete lists just do

hadoop fs
Or
visit http://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/
hadoop-common/FileSystemShell.html

9
YARN
YARN: Yet Another Resource Manager
Managing computing resources within Hadoop cluster
All jobs should be submitted to yarn to run.
E.g. using either yarn Jar or hadoop Jar
When use other hadoop-supported application, please
also specify YARN as resource manager. Such as SPARK.

YARN commands
Show cluster status
Help managing jobs running inside the Hadoop cluster.

10
YARN Commands
yarn application
-list to list applications submitted to YARN
default will show active/queued application

-kill to kill application specified by application-ID.

-appStates/appTypes filter options

11
YARN Commands
yarn node
-list list of status of data nodes
Let us know if there is less than expected live data nodes.

yarn logs
dump logs of a finished application
-applicationID specific log from which application
-containerID specify log from which container

12
Running Hadoop Application
All Hadoop application can be run as a console
command.

The basic format is like following:

hadoop jar java_jar_name java_class_name

[parameters]

The user can use –D to specify more Hadoop options.

e.g.
-D mapred.map.tasks #number of map instances to be generated.
-D mapred.reduce.tasks # number of reduce instances to be used.

13
Many Relevant Cluster/Job Settings
Matters
# of mapper, mapred.map.tasks
# of reducers, mapred.reduce.tasks
# of executors, spark.executor.instances
# of core-per-executor spark.executor.cores

# of core-per-container yarn.nodemanager.resource.cpu-vcores
Memory per container yarn.scheduler.minimum-alloca8on-vcores
yarn.scheduler.maximum-alloca8on-vcores
Memory per executors yarn.nodemanager.resource.memory-mb
Memory per mapper yarn.scheduler.minimum-alloca8on-mb
yarn.scheduler.maximum-alloca8on-mb
Memory per reducers yarn.scheduler.increment-alloca8on-mb
….

Interfacing with other programming
languages
•  Enabling MR jobs with other language
•  Python, Perl, R, C, etc...

•  User need to provide scripts/programs for Map

and Reduce processing
•  The input/output format need to be compatible with
key-value pair

•  Intermediate data are passed through stdin,

stdout
•  A trade-off between convenience and performance
Hadoop Streaming API
hadoop jar hadoop-streaming.jar
-input /path/to/input/in/hdfs #input file location
-output /path/to/output/in/hdfs #output file location
-mapper map # mapper implementation
-reducer reduce #reduce implementation
-file map #location of the map code on local file system
-file reduce # location of the reduce code on local file system.

The map and reduce could be implemented in any

programming language, even with bash script.
17
WordCount using Bash with
Hadoop streaming
The map code:
WordCount using Bash with
Hadoop streaming
The reduce code
WordCount using Bash with
Hadoop streaming

Putting together:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.map.tasks=512 \
-D mapred.reduce.tasks=256 \
-D stream.num.map.output.key.fields=1 \
-input /tmp/data/20news-all/alt.atheism \
-output wiki_wc_bash \
-mapper ./mapwc.sh -reducer ./reducewc.sh \
-file ./mapwc.sh -file ./reducewc.sh
Note, with Hadoop streaming, you could use higher
number of mapper.
Why?
Running Wordcount Example
code
The on-screen
output will show
job running status

21
Try Hadoop Locally
•  Main referenence
•  https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/
SingleCluster.html

•  Prerequisite
•  Java
•  SSH

•  Download
•  e.g from http://apache.claz.org/hadoop/common/hadoop-2.8.0/

•  Unpack

•  Try bin/hadoop

Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Hadoop 2 Quick Start Guide PDF
100% (1)
Hadoop 2 Quick Start Guide PDF
736 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Big Data - Tomas Iglesias IV
No ratings yet
Big Data - Tomas Iglesias IV
37 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Hadoop Module1
No ratings yet
Hadoop Module1
37 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Lsde Workshop wk9
No ratings yet
Lsde Workshop wk9
31 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Another Intro To Hadoop
No ratings yet
Another Intro To Hadoop
23 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Power Bi Notes 1749806995
No ratings yet
Power Bi Notes 1749806995
15 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Synthetic Data For Deep Learning Generate Synthetic Data For Decision Making and Applications With Python and R 1st Edition Necmi Grsakal Download
No ratings yet
Synthetic Data For Deep Learning Generate Synthetic Data For Decision Making and Applications With Python and R 1st Edition Necmi Grsakal Download
90 pages
Unit 3
No ratings yet
Unit 3
61 pages
1.mrplab Intro
No ratings yet
1.mrplab Intro
18 pages
12 13 14 Map Reduce
No ratings yet
12 13 14 Map Reduce
57 pages
Getting Started With Hadoop
No ratings yet
Getting Started With Hadoop
47 pages
Hadoop Single Node Cluster Setup Steps
No ratings yet
Hadoop Single Node Cluster Setup Steps
7 pages
Bda 2
No ratings yet
Bda 2
25 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Dataset - Website Content Crawler - 2024 12 03 - 08 25 43 824
No ratings yet
Dataset - Website Content Crawler - 2024 12 03 - 08 25 43 824
235 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Big Data Unit - 2
No ratings yet
Big Data Unit - 2
18 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
All About Qualitative Research
67% (6)
All About Qualitative Research
38 pages
Big Data File
No ratings yet
Big Data File
16 pages
Assignment 1 Write-Up
No ratings yet
Assignment 1 Write-Up
8 pages
Fundamentals Aerodynamics - Anderson J.D.jr.
No ratings yet
Fundamentals Aerodynamics - Anderson J.D.jr.
293 pages
Assignment 10
No ratings yet
Assignment 10
5 pages
Inti Go Docs
No ratings yet
Inti Go Docs
12 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Apache Hadoop
No ratings yet
Apache Hadoop
11 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
K-01: Pendahuluan: Tim Pengajar: Dr. Ir. Yulisuharnoto, M.Eng Maulana Ibrahim Rau, ST, MSC Sutoyo, STP, Msi
No ratings yet
K-01: Pendahuluan: Tim Pengajar: Dr. Ir. Yulisuharnoto, M.Eng Maulana Ibrahim Rau, ST, MSC Sutoyo, STP, Msi
39 pages
002.SAP ABAP On HANA Training Videos - Materials - Course Content Details
No ratings yet
002.SAP ABAP On HANA Training Videos - Materials - Course Content Details
4 pages
Minor Project Report of Haldiram Made For Report
No ratings yet
Minor Project Report of Haldiram Made For Report
60 pages
Unit 2
No ratings yet
Unit 2
9 pages
15 Citing Related Literature Using
No ratings yet
15 Citing Related Literature Using
19 pages
Wachemo University College of Social Science and Humanities Department of History and Heritage Management
100% (1)
Wachemo University College of Social Science and Humanities Department of History and Heritage Management
19 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
UiPath Certified Professional - Specialized AI Pro Exam Description
No ratings yet
UiPath Certified Professional - Specialized AI Pro Exam Description
15 pages
Data Science Comprehensive Overview
No ratings yet
Data Science Comprehensive Overview
42 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
File Handling With Linked List in C++
0% (1)
File Handling With Linked List in C++
3 pages
Clarion CX Basics Fundamentals
No ratings yet
Clarion CX Basics Fundamentals
15 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
LS Retail NAV and LS POS Data Connections Setup Guide
100% (1)
LS Retail NAV and LS POS Data Connections Setup Guide
7 pages
ATM Advantages
No ratings yet
ATM Advantages
15 pages
PDF Azure Interview Questions
No ratings yet
PDF Azure Interview Questions
9 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Lecture 4 Writing Images and Data Classes
No ratings yet
Lecture 4 Writing Images and Data Classes
4 pages
Aktu One View by Aktu SDC - pdf2006480100030
No ratings yet
Aktu One View by Aktu SDC - pdf2006480100030
5 pages
CHAPTER-2 Editedbutnotyetdone
No ratings yet
CHAPTER-2 Editedbutnotyetdone
7 pages
Metadata-Drainage Classes
No ratings yet
Metadata-Drainage Classes
3 pages
Week 1: Intro To SQL: INFS1603/COMM1822 Business Databases
No ratings yet
Week 1: Intro To SQL: INFS1603/COMM1822 Business Databases
38 pages
HR Abap Quetsions
No ratings yet
HR Abap Quetsions
19 pages
Chapter 4 (NEW)
No ratings yet
Chapter 4 (NEW)
12 pages
Congratulations, You Passed The Quiz IBM Data Resilience L1 Course
No ratings yet
Congratulations, You Passed The Quiz IBM Data Resilience L1 Course
6 pages
NetApp - TR3742
No ratings yet
NetApp - TR3742
32 pages
12 - 29 - 2017 - Unsteady N
No ratings yet
12 - 29 - 2017 - Unsteady N
28 pages
Part 02 AcessingHadoopAtTACC
No ratings yet
Part 02 AcessingHadoopAtTACC
22 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
Thermophysical Characteristics of Shear-Coaxial Lox-H Ames at Supercritical Pressure
No ratings yet
Thermophysical Characteristics of Shear-Coaxial Lox-H Ames at Supercritical Pressure
9 pages
Dork
No ratings yet
Dork
5 pages
Emerging Trends in MIS PDF
No ratings yet
Emerging Trends in MIS PDF
5 pages
SqlDependency - Start ( - Connect) Makes These DB Calls: Select Is - Broker - Enabled
No ratings yet
SqlDependency - Start ( - Connect) Makes These DB Calls: Select Is - Broker - Enabled
4 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Part 03 Intro To Hadoop

Uploaded by

Part 03 Intro To Hadoop

Uploaded by

Big Data Analysis Workshop:

Drs. Weijia Xu, Ruizhu Huang and Amit Gupta

Sept. 28~29, 2017

/tmp public writeable, used by many hadoop

The file system shell includes a set of command to work with

Hadoop fs -get path_in_hdfs [path_in_local]

For a complete lists just do

-kill to kill application specified by application-ID.

The basic format is like following:

hadoop jar java_jar_name java_class_name

The user can use –D to specify more Hadoop options.

• User need to provide scripts/programs for Map

• Intermediate data are passed through stdin,

The map and reduce could be implemented in any

You might also like

•  User need to provide scripts/programs for Map

•  Intermediate data are passed through stdin,