Bda - Module Ii
Bda - Module Ii
Bda - Module Ii
• $ grunt> quit
Apace Pig Examples - Mode
• To use Hadoop MapReduce, start Pig as
follows (or just enter pig):
• $ pig -x mapreduce
• $ pig -x tez
Pig Script
• /* id.pig */
A = load 'passwd' using PigStorage(':'); -- load the
passwd file
B = foreach A generate $0 as id; -- extract the
user IDs
dump B;
store B into 'id.out'; -- write the results to a
directory name id.out
• $ /bin/rm -r id.out/
$ pig -x local id.pig
Pig Script in Mapreduce mode
• To run the MapReduce version,
– use the same procedure;
– the only difference is that now all reading and
writing takes place in HDFS.
• $ hdfs dfs -rm -r id.out
$ pig id.pig
Apache Pig example
• Execute the Load Statement
• Now load the data from the
file student_data.txt into Pig by executing the
following Pig Latin statement in the Grunt shell.
• grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
Apache Pig Example
• Following is the description of the above statement.
• Relation name
– We have stored the data in the schema student.
• Input file path
– We are reading data from the file student_data.txt, which is in the
/pig_data/ directory of HDFS.
• Storage functionWe have used the PigStorage() function. It loads and
stores data as structured text files. It takes a delimiter using which each
entity of a tuple is separated, as a parameter. By default, it takes ‘\t’ as a
parameter.
• Schema
– We have stored the data using the following schema.
• columnidfirstnamelastnamephonecitydatatypeintchar arraychar arraychar
arraychar array
Apache Hive
• Apache Hive is a data warehouse infrastructure built on top of
Hadoop for providing
• data summarization, ad hoc queries, and
• the analysis of large data sets using a SQL-like language called
HiveQL.
• Hive is considered the de facto standard for interactive SQL queries
over petabytes of data using Hadoop and offers the following
features:
• Tools to enable easy data extraction, transformation, and loading
(ETL)
• A mechanism to impose structure on a variety of data formats
• Access to files stored either directly in HDFS or in other data
storage systems such as HBase
• Query execution via MapReduce and Tez (optimized MapReduce)
Apache Hive
• Hive provides users who are already familiar with
SQL the capability to query the data on Hadoop
clusters.
• At the same time, Hive makes it possible for
programmers who are familiar with the
MapReduce framework to add their custom
mappers and reducers to Hive queries.
• Hive queries can also be dramatically accelerated
using the Apache Tez framework under YARN in
Hadoop version 2
Hive Example Walk-Through
• To start Hive, simply enter the hive command.
If Hive starts correctly, you should get
a hive>prompt.
• $ hive
(some messages may show up here)
• hive>
Hive Example Walk-Through
• As a simple test, create and drop a table. Note that
Hive commands must end with a semicolon (;).
• hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 1.705 seconds
hive> SHOW TABLES;
OK
pokes
Time taken: 0.174 seconds, Fetched: 1 row(s)
hive> DROP TABLE pokes;
OK
Time taken: 4.038 seconds
Hive Example Walk-Through
• A more detailed example can be developed using
a web server log file to summarize message
types. First, create a table using the following
command:
• hive> CREATE TABLE logs(t1 string, t2 string, t3
string, t4 string, t5 string, t6 string, t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ' ';
OK
Time taken: 0.129 seconds
Hive Example Walk-Through
• Next, load the data—in this case, from
the sample.log file.
• Note that the file is found in the local directory and not
in HDFS.
• hive> LOAD DATA LOCAL INPATH 'sample.log'
OVERWRITE INTO TABLE logs;
Query ID = hdfs_20150327130000_d1e1a265-a5d7-4ed8-b785-
2c6569791368
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
OK
[DEBUG] 434
[ERROR] 3
[FATAL] 1
[INFO] 96
[TRACE] 816
[WARN] 4
Time taken: 32.624 seconds, Fetched: 6 row(s)
Hive Example Walk-Through
• To exit Hive, simply type exit;:
• hive> exit;
A More Advanced Hive Example
• The data files are collected from Movie-Lens
website (http://movielens.org).
• The files contain various numbers of movie
reviews,
• starting at 100,000 and going up to 20 million
entries
A More Advanced Hive Example
• In this example, 100,000 records will be
transformed
• from
– userid, movieid, rating, unixtime
• to
– userid, movieid, rating, and weekday
• using Apache Hive and a Python program
• (i.e., the UNIX time notation will be transformed
to the day of the week).
A More Advanced Hive Example
• The first step is to download and extract the
data:
• $ wget
http://files.grouplens.org/datasets/movielens
/ml-100k.zip
$ unzip ml-100k.zip
$ cd ml-100k
A More Advanced Hive Example
• Before we use Hive, we will create a short Python program
called weekday_mapper.py with following contents:
• import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday =
datetime.datetime.fromtimestamp(float(unixtime)).isowee
kday()
print '\t'.join([userid, movieid, rating, str(weekday)])LOAD
DATA LOCAL INPATH
'./u.data' OVERWRITE INTO TABLE u_data;
A More Advanced Hive Example
• Next, start Hive and create the data table
(u_data) by entering the following at
the hive> prompt:
• CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
A More Advanced Hive Example
• Load the movie data into the table with the
following command:
• hive> LOAD DATA LOCAL INPATH './u.data'
OVERWRITE INTO TABLE u_data;
A More Advanced Hive Example
• hive > SELECT COUNT(*) FROM u_data;
• This command will start a single MapReduce job and
should finish with the following lines:
...
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU:
2.26 sec HDFS Read: 1979380
HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 260 msec
OK
100000
Time taken: 28.366 seconds, Fetched: 1 row(s)
A More Advanced Hive Example
• Now that the table data are loaded, use the
following command to make the new table
(u_data_new):
• hive> CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
A More Advanced Hive Example
• The next command adds
the weekday_mapper.py to Hive resources:
• hive> add FILE weekday_mapper.py;
A More Advanced Hive Example
• Once weekday_mapper.py is successfully loaded,
we can enter the transformation query:
• hive> INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
A More Advanced Hive Example
• if the transformation was successful, the following final
portion of the output should be displayed:
...
Table default.u_data_new stats: [numFiles=1,
numRows=100000, totalSize=1179173,
rawDataSize=1079173]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.44
sec HDFS Read: 1979380 HDFS Write:
1179256 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 440 msec
OK
Time taken: 24.06 seconds
A More Advanced Hive Example
• The final query will sort and group the reviews by weekday:
• hive> SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;
• Final output for the review counts by weekday should look like the following:
...
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.39 sec HDFS Read: 1179386
HDFS Write: 56 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 390 msec
OK
1 13278
2 14816
3 15426
4 13774
5 17964
6 12318
7 12424
Time taken: 22.645 seconds, Fetched: 7 row(s)
HIVE- Drop a Table
• $ hive -e 'drop table u_data_new'
$ hive -e 'drop table u_data'
USING APACHE SQOOP TO ACQUIRE
RELATIONAL DATA
• Sqoop is a tool designed to transfer data between
Hadoop and relational databases.
• You can use Sqoop to import data from a relational
database management system (RDBMS) into the
Hadoop Distributed File System (HDFS),
• transform the data in Hadoop, and then export the
data back into an RDBMS.
• Sqoop can be used with any Java Database
Connectivity (JDBC)–compliant database and has been
tested on Microsoft SQL Server, PostgresSQL, MySQL,
and Oracle
Apache Sqoop Import and Export
Methods
• Figure 7.1 describes the Sqoop data import (to HDFS)
process.
• The data import is done in two steps.
• In the first step, shown in the figure, Sqoop examines
the database to gather the necessary metadata for the
data to be imported.
• The second step is a map-only (no reduce step) Hadoop
job that Sqoop submits to the cluster.
• This job does the actual data transfer using the
metadata captured in the previous step.
• Note that each node doing the import must have
access to the database.
Data Import
• The imported data are saved in an HDFS directory.
• Sqoop will use the database name for the directory, or
the user can specify any alternative directory where
the files should be populated.
• By default, these files contain comma-delimited fields,
with new lines separating different records.
• You can easily override the format in which data are
copied over by explicitly specifying the field separator
and record terminator characters.
• Once placed in HDFS, the data are ready for processing.
Data Export
• The export step again uses a map-only
Hadoop job to write the data to the database.
• Sqoop divides the input data set into splits,
then uses individual map tasks to push the
splits to the database.
• Again, this process assumes the map tasks
have access to the database.
Connectors
Sqoop Example Walk-Through
• To install Sqoop using the HDP distribution RPM files, simply enter:
• # yum install sqoop sqoop-metastore
• For this example, we will use the world example database from the
MySQL site (http://dev.mysql.com/doc/world-
setup/en/index.html).
• This database has three tables:
• Country: information about countries of the world
• City: information about some of the cities in those countries
• CountryLanguage: languages spoken in each country
• To get the database, use wget to download and then extract the
file:
• $ wget http://downloads.mysql.com/docs/world_innodb.sql.gz
$ gunzip world_innodb.sql.gz
• Next, log into MySQL (assumes you have privileges to create a
database) and
• import the desired database by following these steps:
• $ mysql -u root -p
mysql> CREATE DATABASE world;
mysql> USE world;
mysql> SOURCE world_innodb.sql;
mysql> SHOW TABLES;
+-----------------+
| Tables_in_world |
+-----------------+
| City |
| Country |
| CountryLanguage |
+-----------------+
3 rows in set (0.01 sec
Add Sqoop User Permissions for the
Local Machine and Cluster
• In MySQL, add the following privileges for
user sqoop to MySQL.
• Note that you must use both the local host name and
the cluster subnet for Sqoop to work properly.
• Also, for the purposes of this example, the sqoop
password is sqoop.
• mysql> GRANT ALL PRIVILEGES ON world.* To
'sqoop'@'limulus' IDENTIFIED BY 'sqoop';
• information_schema
• test
• world
• sqoop list-tables --connect
jdbc:mysql://limulus/world --username sqoop
--password sqoop
• City
• Country
• CountryLanguage
• To import data, we need to make a directory in
HDFS:
• $ hdfs dfs -mkdir sqoop-mysql-import
• import
--connect
jdbc:mysql://limulus/world
--username
sqoop
--password
sqoop
• $ sqoop --options-file world-options.txt --table City -m 1 --target-dir
/user/hdfs/sqoop-mysql-import/city
SQL Query in the import step
• It is also possible to include an SQL Query in
the import step.
• For example, suppose we want just cities in
Canada:
• sqoop --options-file world-options.txt -m 1 --
target-dir /user/hdfs/sqoop-mysql-
import/canada-city --query "SELECT ID,Name
from City WHERE CountryCode='CAN' AND
\$CONDITIONS"
SQL Query in the import step
• Inspecting the results confirms that only cities from Canada
have been imported:
• $ hdfs dfs -cat sqoop-mysql-import/canada-city/part-m-
00000
1810,MontrÄal
1811,Calgary
1812,Toronto
...
1856,Sudbury
1857,Kelowna
1858,Barrie
Split Option in Query
• sqoop --options-file world-options.txt -m 4 --target-dir /user/hdfs/sqoop-mysql-
import/canada-city --query "SELECT ID,Name from City WHERE
CountryCode='CAN' AND \$CONDITIONS" --split-by ID
• $ hdfs dfs -ls sqoop-mysql-import/canada-city
Found 5 items
-rw-r--r-- 2 hdfs hdfs 0 2014-08-18 21:31 sqoop-mysql-import/
canada-city/_SUCCESS
-rw-r--r-- 2 hdfs hdfs 175 2014-08-18 21:31 sqoop-mysql-import/canada-city/
part-m-00000
-rw-r--r-- 2 hdfs hdfs 153 2014-08-18 21:31 sqoop-mysql-import/canada-city/
part-m-00001
-rw-r--r-- 2 hdfs hdfs 186 2014-08-18 21:31 sqoop-mysql-import/canada-city/
part-m-00002
-rw-r--r-- 2 hdfs hdfs 182 2014-08-18 21:31 sqoop-mysql-import/canada-city/
part-m-00003
Export Data from HDFS to MySQL
• Sqoop can also be used to export data from
HDFS.
• The first step is to create tables for exported data.
• There are actually two tables needed for each
exported table.
• The first table holds the exported data
(CityExport), and the second is used for staging
the exported data (CityExportStaging).
• Enter the following MySQL commands to create
these tables:
• mysql> CREATE TABLE 'CityExport' (
'ID' int(11) NOT NULL AUTO_INCREMENT,
'Name' char(35) NOT NULL DEFAULT '',
'CountryCode' char(3) NOT NULL DEFAULT '',
'District' char(20) NOT NULL DEFAULT '',
'Population' int(11) NOT NULL DEFAULT '0',
PRIMARY KEY ('ID'));
mysql> CREATE TABLE 'CityExportStaging' (
'ID' int(11) NOT NULL AUTO_INCREMENT,
'Name' char(35) NOT NULL DEFAULT '',
'CountryCode' char(3) NOT NULL DEFAULT '',
'District' char(20) NOT NULL DEFAULT '',
'Population' int(11) NOT NULL DEFAULT '0',
PRIMARY KEY ('ID'));
• sqoop --options-file cities-export-options.txt
--table CityExport --staging-table
CityExportStaging --clear-staging-table -m 4
--export-dir /user/hdfs/sqoop-mysql-
import/city
• $ mysql> select * from CityExport limit 10;
•
+----+----------------+-------------+---------------+------------+
| ID | Name | CountryCode | District |
Population |
+----+----------------+-------------+---------------+------------+
| 1 | Kabul | AFG | Kabol | 1780000 |
| 2 | Qandahar | AFG | Qandahar | 237500
|
| 3 | Herat | AFG | Herat | 186800 |
• mysql> drop table 'CityExportStaging';
• $ convert-to-tsv.sh Apple-stock.csv
• $ hdfs dfs -put Apple-stock.tsv /tmp
• Finally, ImportTsv is run using the following command line.
• Note the column designation in the -Dimporttsv.columns option.
• In the example, the HBASE_ROW_KEY is set as the first column—
that is, the date for the data.
• $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -
Dimporttsv.columns=HBASE_ROW_KEY
• The ImportTsv command will use MapReduce to load the data into
HBase
Apache HBase Web Interface
MANAGE HADOOP WORKFLOWS
WITH APACHE OOZIE
• Oozie is a workflow director system designed to run
and manage multiple related Apache Hadoop jobs.
• For instance, complete data input and analysis may
require several discrete Hadoop jobs to be run as a
workflow in which the output of one job serves as the
input for a successive job.
• Oozie is designed to construct and manage these
workflows.
• Oozie is not a substitute for the YARN scheduler.
• That is, YARN manages resources for individual Hadoop
jobs, and Oozie provides a way to connect and control
Hadoop jobs on the cluster
• Oozie workflow jobs are represented as directed acyclic
graphs (DAGs) of actions. (DAGs are basically graphs that
cannot have directed loops.)
• Three types of Oozie jobs are permitted:
• Workflow—a specified sequence of Hadoop jobs with
outcome-based decision points and control dependency.
– Progress from one action to another cannot happen until the
first action is complete.
• Coordinator—a scheduled workflow job that can run at
various time intervals or when data become available.
• Bundle—a higher-level Oozie abstraction that will batch a
set of coordinator jobs.
• Oozie is integrated with the rest of the Hadoop stack,
– supporting several types of Hadoop jobs out of the box
– (e.g., Java MapReduce,
– Streaming MapReduce,
– Pig,
– Hive, and Sqoop)
– as well as system-specific jobs (e.g., Java programs and
shell scripts).
– Oozie also provides a CLI and a web UI for monitoring jobs.
Workflow
• Oozie workflow definitions are written in hPDL (an XML Process Definition
Language). Such workflows contain several types of nodes:
• Control flow nodes define the beginning and the end of a workflow. They
include start, end, and optional fail nodes.
• Action nodes are where the actual processing tasks are defined. When an
action node finishes, the remote systems notify Oozie and the next node
in the workflow is executed. Action nodes can also include HDFS
commands.
• Fork/join nodes enable parallel execution of tasks in the workflow. The
fork node enables two or more tasks to run at the same time. A join node
represents a rendezvous point that must wait until all forked tasks
complete.
• Control flow nodes enable decisions to be made about the previous task.
Control decisions are based on the results of the previous action (e.g., file
size or file existence). Decision nodes are essentially switch-case
statements that use JSP EL (Java Server Pages—Expression Language) that
evaluate to either true or false.
Oozie Example Walk-Through
• Step 1: Download Oozie Examples
• For HDP 2.1, the following command can be used to extract the files
into the working directory used for the demo:
• $ tar xvzf /usr/share/doc/oozie-4.0.0.2.1.2.1/oozie-examples.tar.gz
• For HDP 2.2, the following command will extract the files:
• $ tar xvzf /usr/hdp/2.2.4.2-2/oozie/doc/oozie-examples.tar.gz
• Once extracted, rename the examples directory to oozie-
examples so that you will not confuse it with the other examples
directories.
• $ mv examples oozie-examples
• The examples must also be placed in HDFS. Enter the following
command to move the example files into HDFS:
• $ hdfs dfs -put oozie-examples/ oozie-examples
Oozie Example Walk-Through
• The example applications are found under
the oozie-examples/app directory, one directory
per example.
• Each directory contains at
least workflow.xml and job.properties files.
• Other files needed for each example are also in
its directory.
• The inputs for all examples are in the oozie-
examples/input-data directory.
• The examples will create output under
the examples/output-data directory in HDFS.
• Move to the simple MapReduce example directory:
• $ cd oozie-examples/apps/map-reduce/
• This directory contains two files and a lib directory. The
files are:
• The job.properties file defines parameters (e.g., path
names, ports) for a job. This file may change per job.
• The workflow.xml file provides the actual workflow for
the job. In this case, it is a simple MapReduce
(pass/fail). This file usually stays the same between
jobs.
Run the Simple MapReduce Example
• The job.properties file included in the examples requires a few edits to work properly.
• Using a text editor, change the following lines by adding the host name of the
NameNode and ResourceManager (indicated by jobTracker in the file).
• nameNode=hdfs://localhost:8020
jobTracker=localhost:8032
• to the following (note the port change for jobTracker):
• nameNode=hdfs://_HOSTNAME_:8020
jobTracker=_HOSTNAME_:8050
• The examplesRoot variable must also be changed to oozie-examples, reflecting the
change made previously:
• examplesRoot=oozie-examples
• These changes must be done for the all the job.properties files in the Oozie examples
that you choose to run.
• For example, for the cluster created with Ambari in Chapter 2, the lines were changed
to
• nameNode=hdfs://limulus:8020
jobTracker=limulus:8050
• The DAG for the simple MapReduce example is shown
in Figure 7.6. The workflow.xml file describes these simple
steps and has the following workflow nodes:
• To run the Oozie MapReduce example job from
the oozie-examples/apps/map-reduce directory, enter
the following line:
• $ oozie job -run -oozie http://limulus:11000/oozie -
config job.properties
• When Oozie accepts the job, a job ID will be printed:
• job: 0000001-150424174853048-oozie-oozi-W
• You will need to change the “limulus” host name to
match the name of the node running your Oozie server.
The job ID can be used to track and control job
progress.
• When trying to run Oozie, you may get the puzzling error:
• oozie is not allowed to impersonate oozie
• If you receive this message, make sure the following is defined in the core-site.xml
file:
• To avoid having to provide the -oozie option with the
Oozie URL every time you run the ooziecommand, set
the OOZIE_URL environment variable as follows (using
your Oozie server host name in place of “limulus”):
• $ export OOZIE_URL="http://limulus:11000/oozie"
• You can now run all subsequent Oozie commands
without specifying the -oozie URL option. For instance,
using the job ID, you can learn about a particular job’s
progress by issuing the following command:
• $ oozie job -info 0000001-150424174853048-oozie-
oozi-W
• The resulting output (line length compressed) is shown in the following listing.
• Because this job is just a simple test, it may be complete by the time you issue the -
info command.
• If it is not complete, its progress will be indicated in the listing.
Step 3: Run the Oozie Demo
Application
• A more sophisticated example can be found in the
demo directory (oozie-examples/apps/demo).
• This workflow includes MapReduce, Pig, and file
system tasks as well as fork, join, decision, action, start,
stop, kill, and end nodes.
• Move to the demo directory and edit
the job.properties file as described previously.
• Entering the following command runs the workflow
(assuming the OOZIE_URL environment variable has
been set):
• $ oozie job -run -config job.properties
Web GUI for OOZIE
• $ firefox http://limulus:11000/oozie/
Short Summary of OOZIE Commands
• Run a workflow job (returns _OOZIE_JOB_ID_):
• $ oozie job -run -config JOB_PROPERITES
• Submit a workflow job (returns _OOZIE_JOB_ID_ but does not start):
• $ oozie job -submit -config JOB_PROPERTIES
• Start a submitted job:
• $ oozie job -start _OOZIE_JOB_ID_
• Check a job’s status:
• $ oozie job -info _OOZIE_JOB_ID_
• Suspend a workflow:
• $ oozie job -suspend _OOZIE_JOB_ID_
• Resume a workflow:
• $ oozie job -resume _OOZIE_JOB_ID_
• Rerun a workflow:
• $ oozie job -rerun _OOZIE_JOB_ID_ -config JOB_PROPERTIES
• Kill a job:
• $ oozie job -kill _OOZIE_JOB_ID_
• View server logs:
• $ oozie job -logs _OOZIE_JOB_ID_
• Full logs are available at /var/log/oozie on the Oozie server.
Hadoop2 - YARN
Hadoop 1 Framework
Motivation for Hadoop 2
YARN Architecture
YARN Components
WordCount Example in YARN
YARN For Distributed Applications
YARN Distributed Shell
YARN Distributed Shell
• The introduction of Hadoop version 2 has
drastically increased the number and scope of
new applications.
• By splitting the version 1 monolithic MapReduce
engine into two parts, a scheduler and the
MapReduce framework, Hadoop has become a
general-purpose large-scale data analytics
platform.
• A simple example of a non-MapReduce Hadoop
application is the YARN Distributed-Shell
described in this chapter.
YARN DISTRIBUTED-SHELL
• The Hadoop YARN project includes the
Distributed-Shell application, which is an example
of a Hadoop non-MapReduce application built on
top of YARN.
• Distributed-Shell is a simple mechanism for
running shell commands and scripts in containers
on multiple nodes in a Hadoop cluster.
• This application is not meant to be a production
administration tool, but rather a demonstration
of the non-MapReduce capability that can be
implemented on top of YARN.
USING THE YARN DISTRIBUTED-SHELL
• For the purpose of the examples presented in the remainder of this
chapter, we assume and assign the following installation path,
based on Hortonworks HDP 2.2, the Distributed-Shell application:
• $ export YARN_DS=/usr/hdp/current/hadoop-yarn-client/hadoop-
yarn-applications-distributedshell.jar
• For the pseudo-distributed install using Apache Hadoop version
2.6.0, the following path will run the Distributed-Shell application
(assuming $HADOOP_HOME is defined to reflect the location
Hadoop):
• $ export YARN_DS=$HADOOP_HOME/share/hadoop/yarn/hadoop-
yarn-applications- distributedshell-2.6.0.jar
• If another distribution is used, search for the file hadoop-yarn-
applications-distributedshell*.jar and set $YARN_DS based on its
location.
YARN DISTRIBUTED-SHELL
• Distributed-Shell exposes various options that
can be found by running the following
command:
• $ yarn
org.apache.hadoop.yarn.applications.distribut
edshell.Client -jar $YARN_DS -help
• The output of this command follows; we will
explore some of these options in the examples
illustrated in this chapter.
YARN DISTRIBUTED-SHELL
• A Simple Example
• The simplest use-case for the Distributed-Shell application is to run an arbitrary
shell command in a container.
• We will demonstrate the use of the uptime command as an example.
• This command is run on the cluster using Distributed-Shell as follows:
• $ yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar
$YARN_DS -shell_command uptime
• By default, Distributed-Shell spawns only one instance of a given shell command.
• When this command is run, you can see progress messages on the screen but
nothing about the actual shell command.
• If the shell command succeeds, the following should appear at the end of the
output:
• 15/05/27 14:48:53 INFO distributedshell.Client: Application completed successfully
• If the shell command did not work for whatever reason, the following message will
be displayed:
• 15/05/27 14:58:42 ERROR distributedshell.Client: Application failed to complete
successfully
UPTIME Command - Output
• The next step is to examine the output for the application.
• Distributed-Shell redirects the output of the individual shell
commands run on the cluster nodes into the log files,
• which are found either on the individual nodes or aggregated onto
HDFS,
– depending on whether log aggregation is enabled.
• Assuming log aggregation is enabled,
– the results for each instance of the command can be found by using
the yarn logs command.
• For the previous uptime example, the following command can be
used to inspect the logs:
• $ yarn logs -applicationId application_1432831236474_0001
• The applicationId can be found from the program output or by
using the yarn application command
•Notice that there are two containers.
•The first container (con..._000001) is the
ApplicationMaster for the job.
•The second container (con..._000002) is the actual shell
script.
•The output for the uptime command is located in the second
containers stdout after the Log Contents: label.
Using More Containers
• Distributed-Shell can run commands to be executed on any
number of containers by way of the
– -num_containers argument.
• For example, to see on which nodes the Distributed-Shell
command was run, the following command can be used:
• $ yarn
org.apache.hadoop.yarn.applications.distributedshell.Clie
nt -jar $YARN_DS -shell_command hostname -
num_containers 4
• If we now examine the results for this job,
– there will be five containers in the log.
• The four command containers (2 through 5) will print the
name of the node on which the container was run.
Distributed-Shell Examples with Shell
Arguments
• Arguments can be added to the shell command using the -
shell_args option. For example, to do a ls -l in the directory from where
the shell command was run, we can use the following commands:
• $ yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar
$YARN_DS -shell_command ls -shell_args -l
• The resulting output from the log file is as follows:
• total 20
-rw-r--r-- 1 yarn hadoop 74 May 28 10:37 container_tokens
-rwx------ 1 yarn hadoop 643 May 28 10:37
default_container_executor_session.sh
-rwx------ 1 yarn hadoop 697 May 28 10:37 default_container_executor.sh
-rwx------ 1 yarn hadoop 1700 May 28 10:37 launch_container.sh
drwx--x--- 2 yarn hadoop 4096 May 28 10:37 tmp
• Found 1 items
-rw-r--r-- 2 hdfs hdfs 3288746 2015-06-24
19:56 /user/hdfs/war-and-peace-
input/war-and-peace.txt
Snapshot
• If the file is deleted, it can be restored from the snapshot:
• $ hdfs dfs -rm -skipTrash /user/hdfs/war-and-peace-input/war-and-
peace.txt
•
Deleted /user/hdfs/war-and-peace-input/war-and-peace.txt
• $ hdfs dfs -ls /user/hdfs/war-and-peace-input/
• The restoration process is basically a simple copy from the snapshot
to the previous directory (or anywhere else).
• Note the use of the ~/.snapshot/wapi-snap-1 path to restore the
file:
• $ hdfs dfs -cp /user/hdfs/war-and-peace-input/.snapshot/wapi-
snap-1/war-and-peace
.txt /user/hdfs/war-and-peace-input
Snapshot
• To delete a snapshot, give the following command:
• $ hdfs dfs -deleteSnapshot /user/hdfs/war-and-peace-
input wapi-snap-1
• To make a directory “un-snapshottable” (or go back to
the default state), use the following command:
• $ hdfs dfsadmin -disallowSnapshot /user/hdfs/war-
and-peace-input
•
Disallowing snapshot on /user/hdfs/war-and-peace-
input succeeded
Configuring an NFSv3 Gateway to
HDFS
• HDFS supports an NFS version 3 (NFSv3) gateway.
• This feature enables files to be easily moved between HDFS and client
systems.
• The NFS gateway supports NFSv3 and allows HDFS to be mounted as part
of the client’s local file system.
• Currently the NFSv3 gateway supports the following capabilities:
– Users can browse the HDFS file system through their local file system using an
NFSv3 client-compatible operating system.
– Users can download files from the HDFS file system to their local file system.
– Users can upload files from their local file system directly to the HDFS file
system.
– Users can stream data directly to HDFS through the mount point. File append
is supported, but random write is not supported.
• The gateway must be run on the same host as a DataNode, NameNode, or
any HDFS client
• Several properties need to be added to
the /etc/hadoop/config/core-site.xml file
• <property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
• Next, move to the Advanced hdfs-site.xml section
and set the following property:
• <property>
<name>dfs.namenode.accesstime.precision</n
ame>
<value>3600000</value>
</property>
• This property ensures client mounts with access
time updates work properly. (See
the mount default atime option.)
• Finally, move to the Custom hdfs-site section, click the Add Property
link, and add the following property:
• property>
<name>dfs.nfs3.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
• The NFSv3 dump directory is needed because the NFS client often
reorders writes.
• Sequential writes can arrive at the NFS gateway in random order.
• This directory is used to temporarily save out-of-order writes before
writing to HDFS.
• Make sure the dump directory has enough space.
• For example, if the application uploads 10 files, each of size 100MB,
it is recommended that this directory have 1GB of space to cover a
worst-case write reorder for every file.
• To confirm the gateway is working, issue the following
command. The output should look like the following:
CAPACITY SCHEDULER
• The Capacity scheduler is the default scheduler for YARN that
enables multiple groups to securely share a large Hadoop cluster.
• Developed by the original Hadoop team at Yahoo!, the Capacity
scheduler has successfully run many of the largest Hadoop clusters.
• To use the Capacity scheduler, one or more queues are configured
with a predetermined fraction of the total slot (or processor)
capacity.
• This assignment guarantees a minimum amount of resources for
each queue.
• Administrators can configure soft limits and optional hard limits on
the capacity allocated to each queue.
• Each queue has strict ACLs (Access Control Lists) that control which
users can submit applications to individual queues.
• Also, safeguards are in place to ensure that users cannot view or
modify applications from other users.
CAPACITY SCHEDULER
• The Capacity scheduler permits sharing a cluster while giving each user or
group certain minimum capacity guarantees.
• These minimum amounts are not given away in the absence of demand
(i.e., a group is always guaranteed a minimum number of resources is
available).
• Excess slots are given to the most starved queues, based on the number of
running tasks divided by the queue capacity.
• Thus, the fullest queues as defined by their initial minimum capacity
guarantee get the most needed resources.
• Idle capacity can be assigned and provides elasticity for the users in a cost-
effective manner.
• Administrators can change queue definitions and properties, such as
capacity and ACLs, at run time without disrupting users.
• They can also add more queues at run time, but cannot delete queues at
run time.
• In addition, administrators can stop queues at run time to ensure that
while existing applications run to completion, no new applications can be
submitted.
CAPACITY SCHEDULER
• The Capacity scheduler currently supports memory-intensive
applications, where an application can optionally specify higher
memory resource requirements than the default.
• Using information from the NodeManagers, the Capacity scheduler
can then place containers on the best-suited nodes.
• The Capacity scheduler works best when the workloads are well
known, which helps in assigning the minimum capacity.
• For this scheduler to work most effectively, each queue should be
assigned a minimal capacity that is less than the maximal expected
workload.
• Within each queue, multiple jobs are scheduled using hierarchical
(first in, first out) FIFO queues similar to the approach used with the
stand-alone FIFO scheduler.
• If there are no queues configured, all jobs are placed in the default
queue
CONFIGURING YARN
• In a Hadoop cluster, it’s vital to balance the usage of RAM, CPU and disk so
that processing is not constrained by any one of these cluster resources.
• As a general recommendation, we’ve found that allowing for 1-2
Containers per disk and per core gives the best balance for cluster
utilization.
• So with our example cluster node with 12 disks and 12 cores,
• we will allow for 20 maximum Containers to be allocated to each node.
• Each machine in our cluster has 48 GB of RAM.
• Some of this RAM should be reserved for Operating System usage.
• On each node, we’ll assign 40 GB RAM for YARN to use and keep 8 GB for
the Operating System.
• The following property sets the maximum memory YARN can utilize on the
node:
Container Memory Config
• In yarn-site.xml
•
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>
•
The next step is to provide YARN guidance on how to break up the total
resources available into Containers.
• You do this by specifying the minimum unit of RAM to allocate for a
Container.
• We want to allow for a maximum of 20 Containers, and thus need (40 GB
total RAM) / (20 # of Containers) = 2 GB minimum per container:
• In yarn-site.xml
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
•
YARN will allocate Containers with RAM amounts greater than
the yarn.scheduler.minimum-allocation-mb.
CONFIGURING MAPREDUCE 2
• MapReduce 2 runs on top of YARN and utilizes YARN Containers to
schedule and execute its map and reduce tasks.
• When configuring MapReduce 2 resource utilization on YARN, there are
three aspects to consider:
• Physical RAM limit for each Map And Reduce task
• The JVM heap size limit for each task
• The amount of virtual memory each task will get
• You can define how much maximum memory each Map and Reduce task
will take.
• Since each Map and each Reduce will run in a separate Container, these
maximum memory settings should be at least equal to or more than the
YARN minimum Container allocation.
• For our example cluster, we have the minimum RAM for a Container
(yarn.scheduler.minimum-allocation-mb) = 2 GB.
• We’ll thus assign 4 GB for Map task Containers, and 8 GB for Reduce tasks
Containers.
Map & Reduce – Memory Config
• In mapred-site.xml:
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
•
Each Container will run JVMs for the Map and Reduce
tasks.
• The JVM heap size should be set to lower than the Map and
Reduce memory defined above, so that they are within the
bounds of the Container memory allocated by YARN.
Heap Config
• In mapred-site.xml:
<name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>