Bda - Module Ii

Download as pdf or txt
Download as pdf or txt
You are on page 1of 239

MODULE II

• Essential Hadoop Tools,


• Hadoop YARN Applications,
• Managing Hadoop with Apache Ambari,
• Basic Hadoop Administration Procedures
Essential Hadoop Tools
• APACHE PIG
• Apache Pig is a high-level language that enables
programmers to write complex MapReduce
transformations using a simple scripting language.
• Pig Latin (the actual language) defines a set of
transformations on a data set such as aggregate, join,
and sort.
• Pig is often used to extract, transform, and load (ETL)
data pipelines,
• quick research on raw data, and iterative data
processing.
APACHE PIG
• Apache Pig has several usage modes.
• The first is a local mode in which all processing
is done on the local machine.
• The non-local (cluster) modes are MapReduce
and Tez.
• These modes execute the job on the cluster
using either the MapReduce engine or the
optimized Tez engine.
Apace Pig Examples
• To begin the example, copy the passwd file to
a working directory for local Pig operation:
• $ cp /etc/passwd .
• Next, copy the data file into HDFS for Hadoop
MapReduce operation:
• $ hdfs dfs -put passwd passwd
Apace Pig Examples
• You can confirm the file is in HDFS by entering
the following command:
• hdfs dfs -ls passwd
-rw-r--r-- 2 hdfs hdfs 2526 2015-03-17
11:08 passwd
Apace Pig Examples
• In the following example of local Pig operation, all
processing is done on the local machine (Hadoop is not
used).
• First, the interactive command line is started:
• $ pig -x local
• If Pig starts correctly, you will see a grunt> prompt.
• You may also see a bunch of INFO messages, which you can
ignore.
• Next, enter the following commands to load the passwd file
and
• then grab the user name and dump it to the terminal.
• Note that Pig commands must end with a semicolon (;).
Apace Pig Examples
• grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;

• $ grunt> quit
Apace Pig Examples - Mode
• To use Hadoop MapReduce, start Pig as
follows (or just enter pig):
• $ pig -x mapreduce

• $ pig -x tez
Pig Script
• /* id.pig */
A = load 'passwd' using PigStorage(':'); -- load the
passwd file
B = foreach A generate $0 as id; -- extract the
user IDs
dump B;
store B into 'id.out'; -- write the results to a
directory name id.out
• $ /bin/rm -r id.out/
$ pig -x local id.pig
Pig Script in Mapreduce mode
• To run the MapReduce version,
– use the same procedure;
– the only difference is that now all reading and
writing takes place in HDFS.
• $ hdfs dfs -rm -r id.out
$ pig id.pig
Apache Pig example
• Execute the Load Statement
• Now load the data from the
file student_data.txt into Pig by executing the
following Pig Latin statement in the Grunt shell.
• grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
Apache Pig Example
• Following is the description of the above statement.
• Relation name
– We have stored the data in the schema student.
• Input file path
– We are reading data from the file student_data.txt, which is in the
/pig_data/ directory of HDFS.
• Storage functionWe have used the PigStorage() function. It loads and
stores data as structured text files. It takes a delimiter using which each
entity of a tuple is separated, as a parameter. By default, it takes ‘\t’ as a
parameter.
• Schema
– We have stored the data using the following schema.
• columnidfirstnamelastnamephonecitydatatypeintchar arraychar arraychar
arraychar array
Apache Hive
• Apache Hive is a data warehouse infrastructure built on top of
Hadoop for providing
• data summarization, ad hoc queries, and
• the analysis of large data sets using a SQL-like language called
HiveQL.
• Hive is considered the de facto standard for interactive SQL queries
over petabytes of data using Hadoop and offers the following
features:
• Tools to enable easy data extraction, transformation, and loading
(ETL)
• A mechanism to impose structure on a variety of data formats
• Access to files stored either directly in HDFS or in other data
storage systems such as HBase
• Query execution via MapReduce and Tez (optimized MapReduce)
Apache Hive
• Hive provides users who are already familiar with
SQL the capability to query the data on Hadoop
clusters.
• At the same time, Hive makes it possible for
programmers who are familiar with the
MapReduce framework to add their custom
mappers and reducers to Hive queries.
• Hive queries can also be dramatically accelerated
using the Apache Tez framework under YARN in
Hadoop version 2
Hive Example Walk-Through
• To start Hive, simply enter the hive command.
If Hive starts correctly, you should get
a hive>prompt.
• $ hive
(some messages may show up here)

• hive>
Hive Example Walk-Through
• As a simple test, create and drop a table. Note that
Hive commands must end with a semicolon (;).
• hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 1.705 seconds
hive> SHOW TABLES;
OK
pokes
Time taken: 0.174 seconds, Fetched: 1 row(s)
hive> DROP TABLE pokes;
OK
Time taken: 4.038 seconds
Hive Example Walk-Through
• A more detailed example can be developed using
a web server log file to summarize message
types. First, create a table using the following
command:
• hive> CREATE TABLE logs(t1 string, t2 string, t3
string, t4 string, t5 string, t6 string, t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ' ';

OK
Time taken: 0.129 seconds
Hive Example Walk-Through
• Next, load the data—in this case, from
the sample.log file.
• Note that the file is found in the local directory and not
in HDFS.
• hive> LOAD DATA LOCAL INPATH 'sample.log'
OVERWRITE INTO TABLE logs;

Loading data to table default.logs


Table default.logs stats: [numFiles=1, numRows=0,
totalSize=99271, rawDataSize=0]
OK
Time taken: 0.953 seconds
Hive Example Walk-Through
• hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%'
GROUP BY t4;

Query ID = hdfs_20150327130000_d1e1a265-a5d7-4ed8-b785-
2c6569791368
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
OK
[DEBUG] 434
[ERROR] 3
[FATAL] 1
[INFO] 96
[TRACE] 816
[WARN] 4
Time taken: 32.624 seconds, Fetched: 6 row(s)
Hive Example Walk-Through
• To exit Hive, simply type exit;:
• hive> exit;
A More Advanced Hive Example
• The data files are collected from Movie-Lens
website (http://movielens.org).
• The files contain various numbers of movie
reviews,
• starting at 100,000 and going up to 20 million
entries
A More Advanced Hive Example
• In this example, 100,000 records will be
transformed
• from
– userid, movieid, rating, unixtime
• to
– userid, movieid, rating, and weekday
• using Apache Hive and a Python program
• (i.e., the UNIX time notation will be transformed
to the day of the week).
A More Advanced Hive Example
• The first step is to download and extract the
data:
• $ wget
http://files.grouplens.org/datasets/movielens
/ml-100k.zip

$ unzip ml-100k.zip

$ cd ml-100k
A More Advanced Hive Example
• Before we use Hive, we will create a short Python program
called weekday_mapper.py with following contents:
• import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday =
datetime.datetime.fromtimestamp(float(unixtime)).isowee
kday()
print '\t'.join([userid, movieid, rating, str(weekday)])LOAD
DATA LOCAL INPATH
'./u.data' OVERWRITE INTO TABLE u_data;
A More Advanced Hive Example
• Next, start Hive and create the data table
(u_data) by entering the following at
the hive> prompt:
• CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
A More Advanced Hive Example
• Load the movie data into the table with the
following command:
• hive> LOAD DATA LOCAL INPATH './u.data'
OVERWRITE INTO TABLE u_data;
A More Advanced Hive Example
• hive > SELECT COUNT(*) FROM u_data;
• This command will start a single MapReduce job and
should finish with the following lines:
...
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU:
2.26 sec HDFS Read: 1979380
HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 260 msec
OK
100000
Time taken: 28.366 seconds, Fetched: 1 row(s)
A More Advanced Hive Example
• Now that the table data are loaded, use the
following command to make the new table
(u_data_new):
• hive> CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
A More Advanced Hive Example
• The next command adds
the weekday_mapper.py to Hive resources:
• hive> add FILE weekday_mapper.py;
A More Advanced Hive Example
• Once weekday_mapper.py is successfully loaded,
we can enter the transformation query:
• hive> INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
A More Advanced Hive Example
• if the transformation was successful, the following final
portion of the output should be displayed:
...
Table default.u_data_new stats: [numFiles=1,
numRows=100000, totalSize=1179173,
rawDataSize=1079173]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.44
sec HDFS Read: 1979380 HDFS Write:
1179256 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 440 msec
OK
Time taken: 24.06 seconds
A More Advanced Hive Example
• The final query will sort and group the reviews by weekday:
• hive> SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;
• Final output for the review counts by weekday should look like the following:
...
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.39 sec HDFS Read: 1179386
HDFS Write: 56 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 390 msec
OK
1 13278
2 14816
3 15426
4 13774
5 17964
6 12318
7 12424
Time taken: 22.645 seconds, Fetched: 7 row(s)
HIVE- Drop a Table
• $ hive -e 'drop table u_data_new'
$ hive -e 'drop table u_data'
USING APACHE SQOOP TO ACQUIRE
RELATIONAL DATA
• Sqoop is a tool designed to transfer data between
Hadoop and relational databases.
• You can use Sqoop to import data from a relational
database management system (RDBMS) into the
Hadoop Distributed File System (HDFS),
• transform the data in Hadoop, and then export the
data back into an RDBMS.
• Sqoop can be used with any Java Database
Connectivity (JDBC)–compliant database and has been
tested on Microsoft SQL Server, PostgresSQL, MySQL,
and Oracle
Apache Sqoop Import and Export
Methods
• Figure 7.1 describes the Sqoop data import (to HDFS)
process.
• The data import is done in two steps.
• In the first step, shown in the figure, Sqoop examines
the database to gather the necessary metadata for the
data to be imported.
• The second step is a map-only (no reduce step) Hadoop
job that Sqoop submits to the cluster.
• This job does the actual data transfer using the
metadata captured in the previous step.
• Note that each node doing the import must have
access to the database.
Data Import
• The imported data are saved in an HDFS directory.
• Sqoop will use the database name for the directory, or
the user can specify any alternative directory where
the files should be populated.
• By default, these files contain comma-delimited fields,
with new lines separating different records.
• You can easily override the format in which data are
copied over by explicitly specifying the field separator
and record terminator characters.
• Once placed in HDFS, the data are ready for processing.
Data Export
• The export step again uses a map-only
Hadoop job to write the data to the database.
• Sqoop divides the input data set into splits,
then uses individual map tasks to push the
splits to the database.
• Again, this process assumes the map tasks
have access to the database.
Connectors
Sqoop Example Walk-Through
• To install Sqoop using the HDP distribution RPM files, simply enter:
• # yum install sqoop sqoop-metastore

• For this example, we will use the world example database from the
MySQL site (http://dev.mysql.com/doc/world-
setup/en/index.html).
• This database has three tables:
• Country: information about countries of the world
• City: information about some of the cities in those countries
• CountryLanguage: languages spoken in each country
• To get the database, use wget to download and then extract the
file:
• $ wget http://downloads.mysql.com/docs/world_innodb.sql.gz
$ gunzip world_innodb.sql.gz
• Next, log into MySQL (assumes you have privileges to create a
database) and
• import the desired database by following these steps:
• $ mysql -u root -p
mysql> CREATE DATABASE world;
mysql> USE world;
mysql> SOURCE world_innodb.sql;
mysql> SHOW TABLES;
+-----------------+
| Tables_in_world |
+-----------------+
| City |
| Country |
| CountryLanguage |
+-----------------+
3 rows in set (0.01 sec
Add Sqoop User Permissions for the
Local Machine and Cluster
• In MySQL, add the following privileges for
user sqoop to MySQL.
• Note that you must use both the local host name and
the cluster subnet for Sqoop to work properly.
• Also, for the purposes of this example, the sqoop
password is sqoop.
• mysql> GRANT ALL PRIVILEGES ON world.* To
'sqoop'@'limulus' IDENTIFIED BY 'sqoop';

• mysql> GRANT ALL PRIVILEGES ON world.* To


'sqoop'@'10.0.0.%' IDENTIFIED BY 'sqoop';
mysql> quit
• Next, log in as sqoop to test the permissions:
• $ mysql -u sqoop -p
mysql> USE world;
mysql> SHOW TABLES;
+-----------------+
| Tables_in_world |
+-----------------+
| City |
| Country |
| CountryLanguage |
+-----------------+
3 rows in set (0.01 sec)
mysql> quit
Import Data Using Sqoop
• sqoop list-databases --connect
jdbc:mysql://limulus/world --username sqoop --
password sqoop

• information_schema

• test

• world
• sqoop list-tables --connect
jdbc:mysql://limulus/world --username sqoop
--password sqoop

• City

• Country

• CountryLanguage
• To import data, we need to make a directory in
HDFS:
• $ hdfs dfs -mkdir sqoop-mysql-import

• $ sqoop import --connect


jdbc:mysql://limulus/world --username sqoop --
password sqoop --table Country -m 1 --target-dir
/user/hdfs/sqoop-mysql-import/country
• The import can be confirmed by examining HDFS:
• $ hdfs dfs -ls sqoop-mysql-import/country
Found 2 items
-rw-r--r-- 2 hdfs hdfs 0 2014-08-18 16:47
sqoop-mysql-import/
world/_SUCCESS
-rw-r--r-- 2 hdfs hdfs 31490 2014-08-18
16:47 sqoop-mysql-import/world/
part-m-00000
• The file can be viewed using the hdfs dfs -
cat command:
• $ hdfs dfs -cat sqoop-mysql-
import/country/part-m-00000
ABW,Aruba,North
America,Caribbean,193.0,null,103000,78.4,82
8.0,793.0,Aruba,
Nonmetropolitan
Territory of The Netherlands,Beatrix,129,AW
Options File
• To make the Sqoop command more convenient, you can create an options
file and use it on the command line.
• Such a file enables you to avoid having to rewrite the same options.
• For example, a file called world-options.txt with the following contents will
include the import command, --connect, --username, and --
password options:

• import
--connect
jdbc:mysql://limulus/world
--username
sqoop
--password
sqoop
• $ sqoop --options-file world-options.txt --table City -m 1 --target-dir
/user/hdfs/sqoop-mysql-import/city
SQL Query in the import step
• It is also possible to include an SQL Query in
the import step.
• For example, suppose we want just cities in
Canada:
• sqoop --options-file world-options.txt -m 1 --
target-dir /user/hdfs/sqoop-mysql-
import/canada-city --query "SELECT ID,Name
from City WHERE CountryCode='CAN' AND
\$CONDITIONS"
SQL Query in the import step
• Inspecting the results confirms that only cities from Canada
have been imported:
• $ hdfs dfs -cat sqoop-mysql-import/canada-city/part-m-
00000
1810,MontrÄal
1811,Calgary
1812,Toronto
...
1856,Sudbury
1857,Kelowna
1858,Barrie
Split Option in Query
• sqoop --options-file world-options.txt -m 4 --target-dir /user/hdfs/sqoop-mysql-
import/canada-city --query "SELECT ID,Name from City WHERE
CountryCode='CAN' AND \$CONDITIONS" --split-by ID
• $ hdfs dfs -ls sqoop-mysql-import/canada-city
Found 5 items
-rw-r--r-- 2 hdfs hdfs 0 2014-08-18 21:31 sqoop-mysql-import/
canada-city/_SUCCESS
-rw-r--r-- 2 hdfs hdfs 175 2014-08-18 21:31 sqoop-mysql-import/canada-city/
part-m-00000
-rw-r--r-- 2 hdfs hdfs 153 2014-08-18 21:31 sqoop-mysql-import/canada-city/
part-m-00001
-rw-r--r-- 2 hdfs hdfs 186 2014-08-18 21:31 sqoop-mysql-import/canada-city/
part-m-00002
-rw-r--r-- 2 hdfs hdfs 182 2014-08-18 21:31 sqoop-mysql-import/canada-city/
part-m-00003
Export Data from HDFS to MySQL
• Sqoop can also be used to export data from
HDFS.
• The first step is to create tables for exported data.
• There are actually two tables needed for each
exported table.
• The first table holds the exported data
(CityExport), and the second is used for staging
the exported data (CityExportStaging).
• Enter the following MySQL commands to create
these tables:
• mysql> CREATE TABLE 'CityExport' (
'ID' int(11) NOT NULL AUTO_INCREMENT,
'Name' char(35) NOT NULL DEFAULT '',
'CountryCode' char(3) NOT NULL DEFAULT '',
'District' char(20) NOT NULL DEFAULT '',
'Population' int(11) NOT NULL DEFAULT '0',
PRIMARY KEY ('ID'));
mysql> CREATE TABLE 'CityExportStaging' (
'ID' int(11) NOT NULL AUTO_INCREMENT,
'Name' char(35) NOT NULL DEFAULT '',
'CountryCode' char(3) NOT NULL DEFAULT '',
'District' char(20) NOT NULL DEFAULT '',
'Population' int(11) NOT NULL DEFAULT '0',
PRIMARY KEY ('ID'));
• sqoop --options-file cities-export-options.txt
--table CityExport --staging-table
CityExportStaging --clear-staging-table -m 4
--export-dir /user/hdfs/sqoop-mysql-
import/city
• $ mysql> select * from CityExport limit 10;


+----+----------------+-------------+---------------+------------+
| ID | Name | CountryCode | District |
Population |
+----+----------------+-------------+---------------+------------+
| 1 | Kabul | AFG | Kabol | 1780000 |
| 2 | Qandahar | AFG | Qandahar | 237500
|
| 3 | Herat | AFG | Herat | 186800 |
• mysql> drop table 'CityExportStaging';

• To remove the data in a table, enter this


command:
– mysql> delete from CityExportStaging;

• To clean up imported files, enter this command:


– $ hdfs dfs -rm -r -skipTrash sqoop-mysql-
import/{country,city, canada-city}
FLUME
FLUME AGENT
USING APACHE FLUME TO ACQUIRE
DATA STREAMS
• Apache Flume is an independent agent
designed to collect, transport, and store data
into HDFS.
• Often data transport involves a number of
Flume agents that may traverse a series of
machines and locations.
• Flume is often used for log files, social media-
generated data, email messages, and just
about any continuous data source.
Flume agent is composed of three
components
Flume agent is composed of three
components
• Source. The source component receives data and
sends it to a channel. It can send the data to
more than one channel. The input data can be
from a real-time source (e.g., weblog) or another
Flume agent.
• Channel. A channel is a data queue that forwards
the source data to the sink destination. It can be
thought of as a buffer that manages input
(source) and output (sink) flow rates.
• Sink. The sink delivers data to destination such as
HDFS, a local file, or another Flume agent.
Flume Pipeline Agent
Flume Pipeline Agent
• In a Flume pipeline, the sink from one agent is
connected to the source of another.
• The data transfer format normally used by Flume,
which is called Apache Avro, provides several useful
features.
• First, Avro is a data serialization/deserialization system
that uses a compact binary format.
• The schema is sent as part of the data exchange and is
defined using JSON (JavaScript Object Notation).
• Avro also uses remote procedure calls (RPCs) to send
data. That is, an Avro sink will contact an Avro source
to send data.
Flume Consolidation
Flume Example Walk-Through
• # yum install flume flume-agent
• In addition, for the simple example, telnet will
be needed:
• # yum install telnet
Flume Simple Test
• A simple test of Flume can be done on a single
machine.
• To start the Flume agent, enter the
– flume-ng command shown here.
• This command uses the simple-
example.conf file to configure the agent.
• $ flume-ng agent --conf conf --conf-file simple-
example.conf --name simple_agent -
Dflume.root.logger=INFO,console
• In another terminal window, use telnet to contact
the agent:
• $ telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection
refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
testing 1 2 3
OK
• If Flume is working correctly, the window
where the Flume agent was started will show
the testing message entered in the telnet
window:

• 14/08/14 16:20:58 INFO sink.LoggerSink:


Event: { headers:{} body: 74 65 73 74 69
6E 67 20 20 31 20 32 20 33 0D testing 1 2 3.
}
Weblog Example

• In this example, a record from the weblogs from


the local machine (Ambari output) will be placed
into HDFS using Flume.
• This example is easily modified to use other
weblogs from different machines.
• Two files are needed to configure Flume.
– web-server-target-agent.conf—the target Flume agent
that writes the data to HDFS
– web-server-source-agent.conf—the source Flume
agent that captures the weblog data
• The weblog is also mirrored on the local file
system by the agent that writes to HDFS.
• To run the example, create the directory
as root:
• # mkdir /var/log/flume-hdfs
# chown hdfs:hadoop /var/log/flume-hdfs/
• Next, as user hdfs, make a Flume data
directory in HDFS:
• $ hdfs dfs -mkdir /user/hdfs/flume-channel/
• start the Flume target agent
• $ flume-ng agent -c conf -f web-server-
target-agent.conf -n collector
• This agent writes the data into HDFS and
should be started before the source agent.
(The source reads the weblogs.)
• # flume-ng agent -c conf -f web-server-
source-agent.conf -n source_agent
Flume Result Checking
• To see if Flume is working correctly, check the
local log by using the tail command.
• Also confirm that the flume-ng agents are not
reporting any errors (the file name will vary).
• $ tail -f /var/log/flume-hdfs/1430164482581-1
• Contents in hdfs
– $ hdfs dfs -tail flume-
channel/apache_access_combined/150427/FlumeDat
a.1430164801381
• web-server-source-agent.conf, the following lines set the
source.
• The HDFS settings are placed in the web-server-target-
agent.conf file
HBASE
• HBase is a distributed column-oriented database built on top of the
Hadoop file system. It is an open-source project and is horizontally
scalable.
• HBase is a data model that is similar to Google’s big table designed
to provide quick random access to huge amounts of structured
data. It leverages the fault tolerance provided by the Hadoop File
System (HDFS).
• It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
• One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using
HBase. HBase sits on top of the Hadoop File System and provides
read and write access.
Storage Mechanism in HBase
• HBase is a column-oriented database and the tables in it are sorted
by row.
• The table schema defines only column families, which are the key
value pairs.
• A table have multiple column families and each column family can
have any number of columns.
• Subsequent column values are stored contiguously on the disk.
• Each cell value of the table has a timestamp.
• In short, in an HBase:
– Table is a collection of rows.
– Row is a collection of column families.
– Column family is a collection of columns.
– Column is a collection of key value pairs.
Features
• Linear and modular scalability
• Strictly consistent reads and writes
• Automatic and configurable sharding of tables
• Automatic failover support between
RegionServers
• Convenient base classes for backing Hadoop
MapReduce jobs with Apache HBase tables
• Easy-to-use Java API for client access
HBase Data Model Overview
• A table in HBase is similar to other databases, having rows and columns.
• Columns in HBase are grouped into column families, all with the same
prefix.
• For example, consider a table of daily stock prices.
• There may be a column family called “price” that has four members—
price:open, price:close, price:low, and price:high.
• A column does not need to be a family.
• For instance, the stock table may have a column named “volume”
indicating how many shares were traded.
• All column family members are stored together in the physical file system.
• Specific HBase cell values are identified by a row key, column (column
family and column), and version (timestamp).
• It is possible to have many versions of data within an HBase cell.
• A version is specified as a timestamp and is created each time data are
written to a cell.
HBase Example Walk-Through
• HBase provides a shell for interactive use. To
enter the shell, type the following as a user:
• we will use a small set of daily stock price data
for Apple computer. The data have the
following form:
• $ wget -O Apple-stock.csv
http://www.google.com/finance/historical?q=
NASDAQ:AAPL\&authuser=0\&output=csv
Create the Database

• The next step is to create the database in


HBase using the following command:
• hbase(main):006:0> create 'apple', 'price' ,
'volume'
0 row(s) in 0.8150 seconds
• For instance, the preceding data can be
entered by using the following commands:
• put 'apple','6-May-15','price:open','126.56'
put 'apple','6-May-15','price:high','126.75'
put 'apple','6-May-15','price:low','123.36'
put 'apple','6-May-15','price:close','125.01'
put 'apple','6-May-15','volume','71820387'
Inspect the Database
• The entire database can be listed using the scan command. Be
careful when using this command with large databases. This
example is for one row.
• scan 'apple'
hbase(main):006:0> scan 'apple'
ROW COLUMN+CELL
6-May-15 column=price:close, timestamp=1430955128359,
value=125.01
6-May-15 column=price:high, timestamp=1430955126024,
value=126.75
6-May-15 column=price:low, timestamp=1430955126053,
value=123.36
6-May-15 column=price:open, timestamp=1430955125977,
value=126.56
6-May-15 column=volume:, timestamp=1430955141440,
value=71820
Get a Row
• You can use the row key to access an individual row. In the stock
price database, the date is the row key.
• hbase(main):008:0> get 'apple', '6-May-15'
COLUMN CELL
price:close timestamp=1430955128359, value=125.01
price:high timestamp=1430955126024, value=126.75
price:low timestamp=1430955126053, value=123.36
price:open timestamp=1430955125977, value=126.56
volume: timestamp=1430955141440,
value=71820387
5 row(s) in 0.0130 seconds
Get Table Cells
• A single cell can be accessed using
the get command and the COLUMN option:
• hbase(main):013:0> get 'apple', '5-May-15',
{COLUMN => 'price:low'}
COLUMN CELL
price:low timestamp=143102076
7444, value=125.78
1 row(s) in 0.0080 seconds
• In a similar fashion, multiple columns can be
accessed as follows:
• hbase(main):012:0> get 'apple', '5-May-15',
{COLUMN => ['price:low', 'price:high']}
COLUMN CELL
price:high timestamp=14310207674
44, value=128.45
price:low timestamp=14310207674
44, value=125.78
2 row(s) in 0.0070 seconds
Delete a Cell
• A specific cell can be deleted using the following command:
• hbase(main):009:0> delete 'apple', '6-May-15' , 'price:low'
• If the row is inspected using get, the price:low cell is not listed.
• hbase(main):010:0> get 'apple', '6-May-15'
COLUMN CELL
price:close timestamp=1430955128359, value=125.01
price:high timestamp=1430955126024, value=126.75
price:open timestamp=1430955125977, value=126.46
volume: timestamp=1430955141440,
value=71820387
4 row(s) in 0.0130 seconds
Delete a Row
• You can delete an entire row by giving
the deleteall command as follows:
• hbase(main):009:0> deleteall 'apple', '6-May-15'
• Remove a Table
• To remove (drop) a table, you must first disable it.
The following two commands remove
the apple table from Hbase:
• hbase(main):009:0> disable 'apple'
hbase(main):010:0> drop 'apple'
Scripting Input
• Commands to the HBase shell can be placed in bash
scripts for automated processing. For instance, the
following can be placed in a bash script:
• echo "put 'apple','6-May-15','price:open','126.56'" |
hbase shell
• (input_to_hbase.sh) that imports the Apple-
stock.csv file into HBase using this method.
• It also removes the column titles in the first line.
• The script will load the entire file into HBase when you
issue the following command:
• $ input_to_hbase.sh Apple-stock.csv
Adding Data in Bulk
• Comman seperated file to tab seperated file

• $ convert-to-tsv.sh Apple-stock.csv
• $ hdfs dfs -put Apple-stock.tsv /tmp
• Finally, ImportTsv is run using the following command line.
• Note the column designation in the -Dimporttsv.columns option.
• In the example, the HBASE_ROW_KEY is set as the first column—
that is, the date for the data.
• $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -
Dimporttsv.columns=HBASE_ROW_KEY
• The ImportTsv command will use MapReduce to load the data into
HBase
Apache HBase Web Interface
MANAGE HADOOP WORKFLOWS
WITH APACHE OOZIE
• Oozie is a workflow director system designed to run
and manage multiple related Apache Hadoop jobs.
• For instance, complete data input and analysis may
require several discrete Hadoop jobs to be run as a
workflow in which the output of one job serves as the
input for a successive job.
• Oozie is designed to construct and manage these
workflows.
• Oozie is not a substitute for the YARN scheduler.
• That is, YARN manages resources for individual Hadoop
jobs, and Oozie provides a way to connect and control
Hadoop jobs on the cluster
• Oozie workflow jobs are represented as directed acyclic
graphs (DAGs) of actions. (DAGs are basically graphs that
cannot have directed loops.)
• Three types of Oozie jobs are permitted:
• Workflow—a specified sequence of Hadoop jobs with
outcome-based decision points and control dependency.
– Progress from one action to another cannot happen until the
first action is complete.
• Coordinator—a scheduled workflow job that can run at
various time intervals or when data become available.
• Bundle—a higher-level Oozie abstraction that will batch a
set of coordinator jobs.
• Oozie is integrated with the rest of the Hadoop stack,
– supporting several types of Hadoop jobs out of the box
– (e.g., Java MapReduce,
– Streaming MapReduce,
– Pig,
– Hive, and Sqoop)
– as well as system-specific jobs (e.g., Java programs and
shell scripts).
– Oozie also provides a CLI and a web UI for monitoring jobs.
Workflow
• Oozie workflow definitions are written in hPDL (an XML Process Definition
Language). Such workflows contain several types of nodes:
• Control flow nodes define the beginning and the end of a workflow. They
include start, end, and optional fail nodes.
• Action nodes are where the actual processing tasks are defined. When an
action node finishes, the remote systems notify Oozie and the next node
in the workflow is executed. Action nodes can also include HDFS
commands.
• Fork/join nodes enable parallel execution of tasks in the workflow. The
fork node enables two or more tasks to run at the same time. A join node
represents a rendezvous point that must wait until all forked tasks
complete.
• Control flow nodes enable decisions to be made about the previous task.
Control decisions are based on the results of the previous action (e.g., file
size or file existence). Decision nodes are essentially switch-case
statements that use JSP EL (Java Server Pages—Expression Language) that
evaluate to either true or false.
Oozie Example Walk-Through
• Step 1: Download Oozie Examples
• For HDP 2.1, the following command can be used to extract the files
into the working directory used for the demo:
• $ tar xvzf /usr/share/doc/oozie-4.0.0.2.1.2.1/oozie-examples.tar.gz
• For HDP 2.2, the following command will extract the files:
• $ tar xvzf /usr/hdp/2.2.4.2-2/oozie/doc/oozie-examples.tar.gz
• Once extracted, rename the examples directory to oozie-
examples so that you will not confuse it with the other examples
directories.
• $ mv examples oozie-examples
• The examples must also be placed in HDFS. Enter the following
command to move the example files into HDFS:
• $ hdfs dfs -put oozie-examples/ oozie-examples
Oozie Example Walk-Through
• The example applications are found under
the oozie-examples/app directory, one directory
per example.
• Each directory contains at
least workflow.xml and job.properties files.
• Other files needed for each example are also in
its directory.
• The inputs for all examples are in the oozie-
examples/input-data directory.
• The examples will create output under
the examples/output-data directory in HDFS.
• Move to the simple MapReduce example directory:
• $ cd oozie-examples/apps/map-reduce/
• This directory contains two files and a lib directory. The
files are:
• The job.properties file defines parameters (e.g., path
names, ports) for a job. This file may change per job.
• The workflow.xml file provides the actual workflow for
the job. In this case, it is a simple MapReduce
(pass/fail). This file usually stays the same between
jobs.
Run the Simple MapReduce Example
• The job.properties file included in the examples requires a few edits to work properly.
• Using a text editor, change the following lines by adding the host name of the
NameNode and ResourceManager (indicated by jobTracker in the file).
• nameNode=hdfs://localhost:8020
jobTracker=localhost:8032
• to the following (note the port change for jobTracker):
• nameNode=hdfs://_HOSTNAME_:8020
jobTracker=_HOSTNAME_:8050
• The examplesRoot variable must also be changed to oozie-examples, reflecting the
change made previously:
• examplesRoot=oozie-examples
• These changes must be done for the all the job.properties files in the Oozie examples
that you choose to run.

• For example, for the cluster created with Ambari in Chapter 2, the lines were changed
to
• nameNode=hdfs://limulus:8020
jobTracker=limulus:8050
• The DAG for the simple MapReduce example is shown
in Figure 7.6. The workflow.xml file describes these simple
steps and has the following workflow nodes:
• To run the Oozie MapReduce example job from
the oozie-examples/apps/map-reduce directory, enter
the following line:
• $ oozie job -run -oozie http://limulus:11000/oozie -
config job.properties
• When Oozie accepts the job, a job ID will be printed:
• job: 0000001-150424174853048-oozie-oozi-W
• You will need to change the “limulus” host name to
match the name of the node running your Oozie server.
The job ID can be used to track and control job
progress.
• When trying to run Oozie, you may get the puzzling error:
• oozie is not allowed to impersonate oozie
• If you receive this message, make sure the following is defined in the core-site.xml
file:
• To avoid having to provide the -oozie option with the
Oozie URL every time you run the ooziecommand, set
the OOZIE_URL environment variable as follows (using
your Oozie server host name in place of “limulus”):
• $ export OOZIE_URL="http://limulus:11000/oozie"
• You can now run all subsequent Oozie commands
without specifying the -oozie URL option. For instance,
using the job ID, you can learn about a particular job’s
progress by issuing the following command:
• $ oozie job -info 0000001-150424174853048-oozie-
oozi-W
• The resulting output (line length compressed) is shown in the following listing.
• Because this job is just a simple test, it may be complete by the time you issue the -
info command.
• If it is not complete, its progress will be indicated in the listing.
Step 3: Run the Oozie Demo
Application
• A more sophisticated example can be found in the
demo directory (oozie-examples/apps/demo).
• This workflow includes MapReduce, Pig, and file
system tasks as well as fork, join, decision, action, start,
stop, kill, and end nodes.
• Move to the demo directory and edit
the job.properties file as described previously.
• Entering the following command runs the workflow
(assuming the OOZIE_URL environment variable has
been set):
• $ oozie job -run -config job.properties
Web GUI for OOZIE
• $ firefox http://limulus:11000/oozie/
Short Summary of OOZIE Commands
• Run a workflow job (returns _OOZIE_JOB_ID_):
• $ oozie job -run -config JOB_PROPERITES
• Submit a workflow job (returns _OOZIE_JOB_ID_ but does not start):
• $ oozie job -submit -config JOB_PROPERTIES
• Start a submitted job:
• $ oozie job -start _OOZIE_JOB_ID_
• Check a job’s status:
• $ oozie job -info _OOZIE_JOB_ID_
• Suspend a workflow:
• $ oozie job -suspend _OOZIE_JOB_ID_
• Resume a workflow:
• $ oozie job -resume _OOZIE_JOB_ID_
• Rerun a workflow:
• $ oozie job -rerun _OOZIE_JOB_ID_ -config JOB_PROPERTIES
• Kill a job:
• $ oozie job -kill _OOZIE_JOB_ID_
• View server logs:
• $ oozie job -logs _OOZIE_JOB_ID_
• Full logs are available at /var/log/oozie on the Oozie server.
Hadoop2 - YARN
Hadoop 1 Framework
Motivation for Hadoop 2
YARN Architecture
YARN Components
WordCount Example in YARN
YARN For Distributed Applications
YARN Distributed Shell
YARN Distributed Shell
• The introduction of Hadoop version 2 has
drastically increased the number and scope of
new applications.
• By splitting the version 1 monolithic MapReduce
engine into two parts, a scheduler and the
MapReduce framework, Hadoop has become a
general-purpose large-scale data analytics
platform.
• A simple example of a non-MapReduce Hadoop
application is the YARN Distributed-Shell
described in this chapter.
YARN DISTRIBUTED-SHELL
• The Hadoop YARN project includes the
Distributed-Shell application, which is an example
of a Hadoop non-MapReduce application built on
top of YARN.
• Distributed-Shell is a simple mechanism for
running shell commands and scripts in containers
on multiple nodes in a Hadoop cluster.
• This application is not meant to be a production
administration tool, but rather a demonstration
of the non-MapReduce capability that can be
implemented on top of YARN.
USING THE YARN DISTRIBUTED-SHELL
• For the purpose of the examples presented in the remainder of this
chapter, we assume and assign the following installation path,
based on Hortonworks HDP 2.2, the Distributed-Shell application:
• $ export YARN_DS=/usr/hdp/current/hadoop-yarn-client/hadoop-
yarn-applications-distributedshell.jar
• For the pseudo-distributed install using Apache Hadoop version
2.6.0, the following path will run the Distributed-Shell application
(assuming $HADOOP_HOME is defined to reflect the location
Hadoop):

• $ export YARN_DS=$HADOOP_HOME/share/hadoop/yarn/hadoop-
yarn-applications- distributedshell-2.6.0.jar
• If another distribution is used, search for the file hadoop-yarn-
applications-distributedshell*.jar and set $YARN_DS based on its
location.
YARN DISTRIBUTED-SHELL
• Distributed-Shell exposes various options that
can be found by running the following
command:
• $ yarn
org.apache.hadoop.yarn.applications.distribut
edshell.Client -jar $YARN_DS -help
• The output of this command follows; we will
explore some of these options in the examples
illustrated in this chapter.
YARN DISTRIBUTED-SHELL
• A Simple Example
• The simplest use-case for the Distributed-Shell application is to run an arbitrary
shell command in a container.
• We will demonstrate the use of the uptime command as an example.
• This command is run on the cluster using Distributed-Shell as follows:
• $ yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar
$YARN_DS -shell_command uptime
• By default, Distributed-Shell spawns only one instance of a given shell command.
• When this command is run, you can see progress messages on the screen but
nothing about the actual shell command.
• If the shell command succeeds, the following should appear at the end of the
output:
• 15/05/27 14:48:53 INFO distributedshell.Client: Application completed successfully
• If the shell command did not work for whatever reason, the following message will
be displayed:
• 15/05/27 14:58:42 ERROR distributedshell.Client: Application failed to complete
successfully
UPTIME Command - Output
• The next step is to examine the output for the application.
• Distributed-Shell redirects the output of the individual shell
commands run on the cluster nodes into the log files,
• which are found either on the individual nodes or aggregated onto
HDFS,
– depending on whether log aggregation is enabled.
• Assuming log aggregation is enabled,
– the results for each instance of the command can be found by using
the yarn logs command.
• For the previous uptime example, the following command can be
used to inspect the logs:
• $ yarn logs -applicationId application_1432831236474_0001
• The applicationId can be found from the program output or by
using the yarn application command
•Notice that there are two containers.
•The first container (con..._000001) is the
ApplicationMaster for the job.
•The second container (con..._000002) is the actual shell
script.
•The output for the uptime command is located in the second
containers stdout after the Log Contents: label.
Using More Containers
• Distributed-Shell can run commands to be executed on any
number of containers by way of the
– -num_containers argument.
• For example, to see on which nodes the Distributed-Shell
command was run, the following command can be used:
• $ yarn
org.apache.hadoop.yarn.applications.distributedshell.Clie
nt -jar $YARN_DS -shell_command hostname -
num_containers 4
• If we now examine the results for this job,
– there will be five containers in the log.
• The four command containers (2 through 5) will print the
name of the node on which the container was run.
Distributed-Shell Examples with Shell
Arguments
• Arguments can be added to the shell command using the -
shell_args option. For example, to do a ls -l in the directory from where
the shell command was run, we can use the following commands:
• $ yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar
$YARN_DS -shell_command ls -shell_args -l
• The resulting output from the log file is as follows:
• total 20
-rw-r--r-- 1 yarn hadoop 74 May 28 10:37 container_tokens
-rwx------ 1 yarn hadoop 643 May 28 10:37
default_container_executor_session.sh
-rwx------ 1 yarn hadoop 697 May 28 10:37 default_container_executor.sh
-rwx------ 1 yarn hadoop 1700 May 28 10:37 launch_container.sh
drwx--x--- 2 yarn hadoop 4096 May 28 10:37 tmp

$ yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar


$YARN_DS -shell_command cat -shell_args launch_container.sh
• When we explore further by giving a pwd command for Distributed-
Shell,
– the following directory is listed and created on the node that ran the
shell command:
• /hdfs2/hadoop/yarn/local/usercache/hdfs/appcache/application
_1432831236474_0003/
container_1432831236474_0003_01_000002/
• Searching for this directory will prove to be problematic because
these transient files are used by YARN to run the Distributed-Shell
application
• removed once the application finishes.
• You can preserve these files for a specific interval by adding the
– following lines to the yarn-site.xml configuration file and restarting
YARN:
STRUCTURE OF YARN APPLICATIONS
• YARN ResourceManager runs as a scheduling daemon on a dedicated machine and
acts as the central authority for allocating resources to the various competing
applications in the cluster.
• The ResourceManager has a central and global view of all cluster resources and,
therefore, can ensure fairness, capacity, and locality are shared across all users.
• Depending on the application demand, scheduling priorities, and resource
availability, the ResourceManager dynamically allocates resource containers to
applications to run on particular nodes.
• A container is a logical bundle of resources (e.g., memory, cores) bound to a
particular cluster node.
• To enforce and track such assignments, the ResourceManager interacts with a
special system daemon running on each node called the NodeManager.
• Communications between the ResourceManager and NodeManagers are
heartbeat based for scalability.
• NodeManagers are responsible for local monitoring of resource availability, fault
reporting, and container life-cycle management (e.g., starting and killing jobs).
• The ResourceManager depends on the NodeManagers for its “global view” of the
cluster.
STRUCTURE OF YARN APPLICATIONS
• User applications are submitted to the ResourceManager via a
public protocol and go through an admission control phase during
which security credentials are validated and various operational and
administrative checks are performed.
• Those applications that are accepted pass to the scheduler and are
allowed to run.
• Once the scheduler has enough resources to satisfy the request, the
application is moved from an accepted state to a running state.
• Aside from internal bookkeeping,
– this process involves allocating a container for the single
ApplicationMaster and spawning it on a node in the cluster.
• Often called container 0, the ApplicationMaster does not have any
additional resources at this point,
– but rather must request additional resources from the
ResourceManager.
STRUCTURE OF YARN APPLICATIONS
• The ApplicationMaster is the “master” user
job that manages all application life-cycle
aspects,
– including dynamically increasing and decreasing
resource consumption (i.e., containers),
– managing the flow of execution (e.g., in case of
MapReduce jobs, running reducers against the
output of maps),
– handling faults and computation skew, and
performing other local optimizations.
STRUCTURE OF YARN APPLICATIONS
• typically, an ApplicationMaster will need to harness the processing power of
multiple servers to complete a job.
• To achieve this, the ApplicationMaster issues resource requests to the
ResourceManager.
• The form of these requests includes specification of locality preferences (e.g., to
accommodate HDFS use) and properties of the containers.
• The ResourceManager will attempt to satisfy the resource requests coming from
each application according to availability and scheduling policies.
• When a resource is scheduled on behalf of an ApplicationMaster, the
ResourceManager generates a lease for the resource, which is acquired by a
subsequent ApplicationMaster heartbeat.
• The ApplicationMaster then works with the NodeManagers to start the resource.
• A token-based security mechanism guarantees its authenticity when the
ApplicationMaster presents the container lease to the NodeManager.
• In a typical situation, running containers will communicate with the
ApplicationMaster through an application-specific protocol to report status and
health information and to receive framework-specific commands.
• Figure 8.1 illustrates the relationship between the
application and YARN components.
• The YARN components appear as the large outer boxes
(ResourceManager and NodeManagers), and
• the two applications appear as smaller boxes
(containers), one dark and one light.
• Each application uses a different ApplicationMaster;
• the darker client is running a Message Passing Interface
(MPI) application and
• the lighter client is running a traditional MapReduce
application.
STRUCTURE OF YARN APPLICATIONS
SEMINAR TOPIC
YARN APPLICATION FRAMEWORKS
Managing Hadoop with Apache
Ambari
• In This Chapter:
• A tour of the Apache Ambari graphical
management tool is provided.
• The procedure for restarting a stopped
Hadoop service is explained.
• The procedure for changing Hadoop
properties and tracking configurations is
presented.
Managing Hadoop with Apache
Ambari
• Managing a Hadoop installation by hand can be
tedious and time consuming.
• In addition to keeping configuration files
synchronized across a cluster, starting, stopping,
and restarting Hadoop services and dependent
services in the right order is not a simple task.
• The Apache Ambari graphical management tool is
designed to help you easily manage these and
other Hadoop administrative issues.
• This chapter provides some basic navigation and
usage scenarios for Apache Ambari.
Managing Hadoop with Apache
Ambari
• Along with being an installation tool, Ambari can be
used as a centralized point of administration for a
Hadoop cluster.
• Using Ambari, the user can configure cluster services,
monitor the status of cluster hosts (nodes) or services,
visualize hotspots by service metric, start or stop
services, and add new hosts to the cluster.
• All of these features infuse a high level of agility into
the processes of managing and monitoring a
distributed computing environment.
• Ambari also attempts to provide real-time reporting of
important metrics.
QUICK TOUR OF APACHE AMBARI
• $ firefox localhost:8080
QUICK TOUR OF APACHE AMBARI
• The default login and password
are admin and admin, respectively.
• The dashboard view provides a number of high-
level metrics for many of the installed services.
• A glance at the dashboard should allow you to
get a sense of how the cluster is performing.
• two of the services managed by Ambari are
Nagios and Ganglia;
– the standard cluster management services installed by
Ambari, they are used to provide cluster monitoring
(Nagios) and metrics (Ganglia).
Dashboard View
• The Dashboard view provides small status widgets for many of the services
running on the cluster.
• The actual services are listed on the left-side vertical menu.
• Installed Services into weidget
• Moving: Click and hold a widget while it is moved about the grid.
• Edit: Place the mouse on the widget and click the gray edit symbol in the upper-
right corner of the widget. You can change several different aspects (including
thresholds) of the widget.
• Remove: Place the mouse on the widget and click the X in the upper-left corner.
• Add: Click the small triangle next to the Metrics tab and select Add. The available
widgets will be displayed. Select the widgets you want to add and click Apply.
• Some widgets provide additional information when you move the mouse over
them.
• For instance, the DataNodes widget displays the number of live, dead, and
decommissioning hosts.
• Clicking directly on a graph widget provides an enlarged view. For instance, Figure
9.2 provides a detailed view of the CPU Usage widget from Figure 9.1.
HeatMap
• The Dashboard view also includes a heatmap view of
the cluster.
• Cluster heatmaps physically map selected metrics
across the cluster.
• When you click the Heatmaps tab, a heatmap for the
cluster will be displayed.
• To select the metric used for the heatmap, choose the
desired option from the Select Metric pull-down menu.
• Note that the scale and color ranges are different for
each metric. The heatmap for percentage host memory
used is displayed in Figure 9.3.
Configuration
• Configuration history is the final tab in the
dashboard window.
• This view provides a list of configuration
changes made to the cluster.
• As shown in Figure 9.4, Ambari enables
configurations to be sorted by Service,
Configuration Group, Data, and Author.
• To find the specific configuration settings, click
the service name.
Services View
• The Services menu provides a detailed look at
each service running on the cluster.
• It also provides a graphical method for
configuring each service (i.e., instead of hand-
editing the /etc/hadoop/confXML files).
• The summary tab provides a current Summary
view of important service metrics and an
Alerts and Health Checks sub-window.
• Similar to the Dashboard view, the currently installed services are listed on
the left-side menu.
• To select a service, click the service name in the menu.
• When applicable, each service will have its own Summary, Alerts and
Health Monitoring, and Service Metrics windows.
• For example, Figure 9.5 shows the Service view for HDFS.
• Important information such as the status of NameNode,
SecondaryNameNode, DataNodes, uptime, and available disk space is
displayed in the Summary window.
• The Alerts and Health Checks window provides the latest status of the
service and its component systems.
• Finally, several important real-time service metrics are displayed as
widgets at the bottom of the screen.
• As on the dashboard, these widgets can be expanded to display a more
detailed view.
• Clicking the Configs tab will open an options
form, shown in Figure 9.6, for the service.
• The options (properties) are the same ones
that are set in the Hadoop XML files.
• When using Ambari, the user has complete
control over the XML files and should manage
them only through the Ambari interface—that
is, the user should not edit the files by hand.
Hosts View
• Selecting the Hosts menu item provides the
information shown in Figure 9.7. The host
name, IP address, number of cores, memory,
disk usage, current load average, and Hadoop
components are listed in this window in
tabular form.
• To display the Hadoop components installed
on each host, click the links in the rightmost
columns.
• You can also add new hosts by using the
Actions pull-down menu.
• The new host must be running the Ambari
agent
• Further details for a particular host can be found by clicking the
host name in the left column.
• As shown in Figure 9.8, the individual host view provides three sub-
windows: Components, Host Metrics, and Summary information.
• The Components window lists the services that are currently
running on the host.
• Each service can be stopped, restarted, decommissioned, or placed
in maintenance mode.
• The Metrics window displays widgets that provide important
metrics (e.g., CPU, memory, disk, and network usage).
• Clicking the widget displays a larger version of the graphic.
• The Summary window provides basic information about the host,
including the last time a heartbeat was received.
ADMIN VIEW
• The Administration (Admin) view provides three options.
• The first, as shown in Figure 9.9, displays a list of installed
software.
• This Repositories listing generally reflects the version of
Hortonworks Data Platform (HDP) used during the
installation process.
• The Service Accounts option lists the service accounts
added when the system was installed.
• These accounts are used to run various services and tests
for Ambari.
• The third option, Security, sets the security on the cluster.
Views View
• Ambari Views is a framework offering a systematic way to plug in
user interface capabilities that provide for custom visualization,
management, and monitoring features in Ambari.
• Views allows you to extend and customize Ambari to meet your
specific needs..
• Admin Pull-Down Menu
• The Administrative (Admin) pull-down menu provides the following
options:
• About—Provides the current version of Ambari.
• Manage Ambari—Open the management screen where Users,
Groups, Permissions, and Ambari Views can be created and
configured.
• Settings—Provides the option to turn off the progress window.
• Sign Out—Exits the interface.
MANAGING HADOOP SERVICES
• During the course of normal Hadoop cluster operation, services
may fail for any number of reasons.
• Ambari monitors all of the Hadoop services and reports any service
interruption to the dashboard.
• In addition, when the system was installed, an administrative email
for the Nagios monitoring system was required.
• All service interruption notifications are sent to this email address.
• Figure 9.10 shows the Ambari dashboard reporting a down
DataNode.
• The service error indicator numbers next to the HDFS service and
Hosts menu item indicate this condition.
• The DataNode widget also has turned red and indicates that 3/4
DataNodes are operating.
• Clicking the HDFS service link in the left
vertical menu will bring up the service
summary screen shown in Figure 9.11.
• The Alerts and Health Checks window
confirms that a DataNode is down.
Node Down in Host View
• The specific host (or hosts) with an issue can be
found by examining the Hosts window.
• As shown in Figure 9.12, the status of host n1
has changed from a green dot with a check mark
inside to a yellow dot with a dash inside.
• An orange dot with a question mark inside
indicates the host is not responding and is
probably down.
• Other service interruption indicators may also be
set as a result of the unresponsive node.
After Restart Dashboard View
• Once the DataNode has been restarted
successfully, the dashboard will reflect the
new status (e.g., 4/4 DataNodes are Live).
• As shown in Figure 9.16, all four DataNodes
are now working and the service error
indicators are beginning to slowly disappear.
CHANGING HADOOP PROPERTIES
• One of the challenges of managing a Hadoop cluster is managing
changes to cluster-wide configuration properties.
• In addition to modifying a large number of properties, making
changes to a property often requires restarting daemons (and
dependent daemons) across the entire cluster.
• This process is tedious and time consuming. Fortunately, Ambari
provides an easy way to manage this process.
• As described previously, each service provides a Configs tab that
opens a form displaying all the possible service properties.
• Any service property can be changed (or added) using this interface.
• As an example, the configuration properties for the YARN scheduler
are shown in Figure 9.17.
YARN LOG AGGREGATION PROPERTY
Restart after new Property values
• Once the new property is changed, an orange Restart
button will appear at the top left of the window.
• The new property will not take effect until the required
services are restarted.
• As shown in Figure 9.21, the Restart button provides
two options: Restart All and Restart NodeManagers.
• To be safe, the Restart All should be used.
• Note that Restart All does not mean all the Hadoop
services will be restarted;
– rather, only those that use the new property will be
restarted.
Progress of restart
Previous Version of Config
Ambari versioning tool:
• There are several important points to remember about
the Ambari versioning tool:
• Every time you change the configuration, a new
version is created.
– Reverting to a previous version creates a new version.
• You can view or compare a version to other versions
without having to change or restart services. (See the
buttons in the V11 box in Figure 9.26.)
• Each service has its own version record.
• Every time you change the properties, you must
restart the service by using the Restart button.
– When in doubt, restart all services.
Basic Hadoop Administration
Procedures
• Hadoop configuration is accomplished through
the use of XML configuration files.
• The basic files and their function are as follows:
• core-default.xml: System-wide properties
• hdfs-default.xml: Hadoop Distributed File System
properties
• mapred-default.xml: Properties for the YARN
MapReduce framework
• yarn-default.xml: YARN properties
BASIC HADOOP YARN
ADMINISTRATION
• YARN has several built-in administrative
features and commands.
• The main administration command is yarn
rmadmin (resource manager administration).
• Enter yarn rmadmin -help to learn more about
the various options.
Decommissioning YARN Nodes
• If a NodeManager host/node needs to be
removed from the cluster, it should be
decommissioned first.
• Assuming the node is responding, you can easily
decommission it from the Ambari web UI.
• Simply go to the Hosts view, click on the host, and
select Decommission from the pull-down menu
next to the NodeManager component.
• Note that the host may also be acting as a HDFS
DataNode. Use the Ambari Hosts view to
decommission the HDFS host in a similar fashion.
• By default, the proxy runs as part of the Resource
Manager itself,
– but it can be configured to run in a stand-alone mode by
adding the configuration property
• yarn.web-proxy.address to yarn-site.xml.
– (Using Ambari, go to the YARN Configs view, scroll to the bottom, and
select Custom yarn-site.xml/Add property.)
– In stand-alone mode,
– yarn.web-proxy.principal and yarn.web-
proxy.keytab control the Kerberos principal name and the
corresponding keytab, respectively, for use in secure
mode.
– These elements can be added to the yarn-site.xml if
required.
• Using the JobHistoryServer
• The removal of the JobTracker and migration of
MapReduce from a system to an application-level
framework necessitated creation of a place to
store MapReduce job history.
• The JobHistoryServer provides all YARN
MapReduce applications with a central location in
which to aggregate completed jobs for historical
reference and debugging.
• The settings for the JobHistoryServer can be
found in the mapred-site.xml file.
Managing YARN Jobs
• YARN jobs can be managed using the yarn
application command.
• The following options, including -kill, -list,
and -status, are available to the administrator
with this command.
• MapReduce jobs can also be controlled with
the mapred job command.
• Neither the YARN ResourceManager UI nor
the Ambari UI can be used to kill YARN
applications.
• If a job needs to be killed,
– give the yarn application command to find
the Application ID and then use the -kill argument.
Setting Container Memory
• YARN manages application resource containers over the
entire cluster.
• Controlling the amount of container memory takes place
through three important values in the yarn-site.xml file:
• yarn.nodemanager.resource.memory-mb is the amount of
memory the NodeManager can use for containers.
• scheduler.minimum-allocation-mb is the smallest container
allowed by the ResourceManager.
– A requested container smaller than this value will result in an
allocated container of this size (default 1024MB).
• yarn.scheduler.maximum-allocation-mb
– is the largest container allowed by the ResourceManager
(default 8192MB).
Setting Container Cores
• You can set the number of cores for containers using the following
properties in the yarn-stie.xml:
• yarn.scheduler.minimum-allocation-vcores:
– The minimum allocation for every container request at the ResourceManager,
in terms of virtual CPU cores.
– Requests smaller than this allocation will not take effect, and the specified
value will be allocated the minimum number of cores.
– The default is 1 core.
• yarn.scheduler.maximum-allocation-vcores:
– The maximum allocation for every container request at the ResourceManager,
in terms of virtual CPU cores.
– Requests larger than this allocation will not take effect, and the number of
cores will be capped at this value.
– The default is 32.
• yarn.nodemanager.resource.cpu-vcores:
– The number of CPU cores that can be allocated for containers. The default is 8.
Setting MapReduce Properties
• MapReduce now runs as a YARN application.
• Consequently, it may be necessary to adjust some of the mapred-
site.xml properties as they relate to the map and reduce containers.
• The following properties are used to set some Java arguments and
memory size for both the map and reduce containers:
• mapred.child.java.opts provides a larger or smaller heap size for
child JVMs of maps (e.g., --Xmx2048m).
• mapreduce.map.memory.mb provides a larger or smaller resource
limit for maps (default = 1536MB).
• mapreduce.reduce.memory.mb provides a larger heap size for child
JVMs of maps (default = 3072MB).
• mapreduce.reduce.java.opts provides a larger or smaller heap size
for child reducers.
BASIC HDFS ADMINISTRATION
• Firefox localhost:50070
Adding Users to HDFS
• To quickly create user accounts manually on a
Linux-based system, perform the following
steps:
• 1. Add the user to the group for your
operating system on the HDFS client system.
• In most cases, the groupname should be that
of the HDFS superuser,
• which is often hadoop or hdfs.
• Create the username directory in HDFS.
• hdfs dfs -mkdir /user/<username>
• 3. Give that account ownership over its
directory in HDFS.
• hdfs dfs -chown <username>:<groupname>
/user/<username>
Perform an FSCK on HDFS
• To check the health of HDFS, you can issue
the hdfs fsck <path> (file system check)
command.
• The entire HDFS namespace can be checked,
– or a subdirectory can be entered as an argument
to the command.
• The following example checks the entire HDFS
namespace.
Fsck options
• -move moves corrupted files to /lost+found.
• -delete deletes corrupted files.
• -files prints out files being checked.
• -openforwrite prints out files opened for writes during check.
• -includeSnapshots includes snapshot data. The path indicates the
existence of a snapshot table directory or the presence of snapshot
table directories under it.
• -list-corruptfileblocks prints out a list of missing blocks and the files
to which they belong.
• -blocks prints out a block report.
• -locations prints out locations for every block.
• -racks prints out network topology for data-node locations.
Balancing HDFS
• Based on usage patterns and DataNode availability, the number of
data blocks across the DataNodes may become unbalanced.
• To avoid over-utilized DataNodes, the HDFS balancer tool
rebalances data blocks across the available DataNodes.
• Data blocks are moved from over-utilized to under-utilized nodes to
within a certain percent threshold.
• Rebalancing can be done when new DataNodes are added or when
a DataNode is removed from service.
• This step does not create more space in HDFS, but rather improves
efficiency.
• The HDFS superuser must run the balancer. The simplest way to run
the balancer is to enter the following command:
• $ hdfs balancer
Balancing HDFS
• By default, the balancer will continue to rebalance the
nodes until the number of data blocks on all
DataNodes are within 10% of each other.
• The balancer can be stopped, without harming HDFS,
at any time by entering a Ctrl-C.
• Lower or higher thresholds can be set using the -
threshold argument.
• For example, giving the following command sets a 5%
threshold:
• $ hdfs balancer -threshold 5
• The lower the threshold, the longer the balancer will
run.
HDFS Safe Mode
• when the NameNode starts,
– it loads the file system state from the fsimage and then applies the edits log
file.
– It then waits for DataNodes to report their blocks.
– During this time, the NameNode stays in a read-only Safe Mode.
– The NameNode leaves Safe Mode automatically after the DataNodes have
reported that most file system blocks are available.
• The administrator can place HDFS in Safe Mode by giving the following command:
• $ hdfs dfsadmin -safemode enter
• Entering the following command turns off Safe Mode:
• $ hdfs dfsadmin -safemode leave
• HDFS may drop into Safe Mode if a major issue arises within the file system (e.g., a
full DataNode).
• The file system will not leave Safe Mode until the situation is resolved.
• To check whether HDFS is in Safe Mode, enter the following command:
• $ hdfs dfsadmin -safemode get
HDFS Snapshots
• HDFS snapshots are read-only, point-in-time copies of HDFS.
• Snapshots can be taken on a subtree of the file system or the entire
file system.
• Some common use-cases for snapshots are data backup, protection
against user errors, and disaster recovery.
• Snapshots can be taken on any directory once the directory has
been set as snapshottable.
• A snapshottable directory is able to accommodate 65,536
simultaneous snapshots.
• There is no limit on the number of snapshottable directories.
Administrators may set any directory to be snapshottable, but
nested snapshottable directories are not allowed.
• For example, a directory cannot be set to snapshottable if one of its
ancestors/descendants is a snapshottable directory.
Snapshot
• The following example walks through the
procedure for creating a snapshot.
• The first step is to declare a directory as
“snapshottable” using the following command:
• $ hdfs dfsadmin -allowSnapshot /user/hdfs/war-
and-peace-input

Allowing snapshot on /user/hdfs/war-and-peace-
input succeeded
Snapshot
• Once the directory has been made snapshottable,
the snapshot can be taken with the following
command.
• The command requires the directory path and a
name for the snapshot—in this case, wapi-snap-
1.
• $ hdfs dfs -createSnapshot /user/hdfs/war-and-
peace-input wapi-snap-1

Created snapshot /user/hdfs/war-and-peace-
input/.snapshot/wapi-snap-1
Snapshot
• The path of the snapshot is /user/hdfs/war-and-
peace-input/.snapshot/wapi-snap-1.
• The /user/hdfs/war-and-peace-input directory
has one file, as shown by issuing the following
command:
• $ hdfs dfs -ls /user/hdfs/war-and-peace-input/

• Found 1 items
-rw-r--r-- 2 hdfs hdfs 3288746 2015-06-24
19:56 /user/hdfs/war-and-peace-
input/war-and-peace.txt
Snapshot
• If the file is deleted, it can be restored from the snapshot:
• $ hdfs dfs -rm -skipTrash /user/hdfs/war-and-peace-input/war-and-
peace.txt

Deleted /user/hdfs/war-and-peace-input/war-and-peace.txt
• $ hdfs dfs -ls /user/hdfs/war-and-peace-input/
• The restoration process is basically a simple copy from the snapshot
to the previous directory (or anywhere else).
• Note the use of the ~/.snapshot/wapi-snap-1 path to restore the
file:
• $ hdfs dfs -cp /user/hdfs/war-and-peace-input/.snapshot/wapi-
snap-1/war-and-peace
.txt /user/hdfs/war-and-peace-input
Snapshot
• To delete a snapshot, give the following command:
• $ hdfs dfs -deleteSnapshot /user/hdfs/war-and-peace-
input wapi-snap-1
• To make a directory “un-snapshottable” (or go back to
the default state), use the following command:
• $ hdfs dfsadmin -disallowSnapshot /user/hdfs/war-
and-peace-input

Disallowing snapshot on /user/hdfs/war-and-peace-
input succeeded
Configuring an NFSv3 Gateway to
HDFS
• HDFS supports an NFS version 3 (NFSv3) gateway.
• This feature enables files to be easily moved between HDFS and client
systems.
• The NFS gateway supports NFSv3 and allows HDFS to be mounted as part
of the client’s local file system.
• Currently the NFSv3 gateway supports the following capabilities:
– Users can browse the HDFS file system through their local file system using an
NFSv3 client-compatible operating system.
– Users can download files from the HDFS file system to their local file system.
– Users can upload files from their local file system directly to the HDFS file
system.
– Users can stream data directly to HDFS through the mount point. File append
is supported, but random write is not supported.
• The gateway must be run on the same host as a DataNode, NameNode, or
any HDFS client
• Several properties need to be added to
the /etc/hadoop/config/core-site.xml file
• <property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
• Next, move to the Advanced hdfs-site.xml section
and set the following property:
• <property>
<name>dfs.namenode.accesstime.precision</n
ame>
<value>3600000</value>
</property>
• This property ensures client mounts with access
time updates work properly. (See
the mount default atime option.)
• Finally, move to the Custom hdfs-site section, click the Add Property
link, and add the following property:
• property>
<name>dfs.nfs3.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
• The NFSv3 dump directory is needed because the NFS client often
reorders writes.
• Sequential writes can arrive at the NFS gateway in random order.
• This directory is used to temporarily save out-of-order writes before
writing to HDFS.
• Make sure the dump directory has enough space.
• For example, if the application uploads 10 files, each of size 100MB,
it is recommended that this directory have 1GB of space to cover a
worst-case write reorder for every file.
• To confirm the gateway is working, issue the following
command. The output should look like the following:
CAPACITY SCHEDULER
• The Capacity scheduler is the default scheduler for YARN that
enables multiple groups to securely share a large Hadoop cluster.
• Developed by the original Hadoop team at Yahoo!, the Capacity
scheduler has successfully run many of the largest Hadoop clusters.
• To use the Capacity scheduler, one or more queues are configured
with a predetermined fraction of the total slot (or processor)
capacity.
• This assignment guarantees a minimum amount of resources for
each queue.
• Administrators can configure soft limits and optional hard limits on
the capacity allocated to each queue.
• Each queue has strict ACLs (Access Control Lists) that control which
users can submit applications to individual queues.
• Also, safeguards are in place to ensure that users cannot view or
modify applications from other users.
CAPACITY SCHEDULER
• The Capacity scheduler permits sharing a cluster while giving each user or
group certain minimum capacity guarantees.
• These minimum amounts are not given away in the absence of demand
(i.e., a group is always guaranteed a minimum number of resources is
available).
• Excess slots are given to the most starved queues, based on the number of
running tasks divided by the queue capacity.
• Thus, the fullest queues as defined by their initial minimum capacity
guarantee get the most needed resources.
• Idle capacity can be assigned and provides elasticity for the users in a cost-
effective manner.
• Administrators can change queue definitions and properties, such as
capacity and ACLs, at run time without disrupting users.
• They can also add more queues at run time, but cannot delete queues at
run time.
• In addition, administrators can stop queues at run time to ensure that
while existing applications run to completion, no new applications can be
submitted.
CAPACITY SCHEDULER
• The Capacity scheduler currently supports memory-intensive
applications, where an application can optionally specify higher
memory resource requirements than the default.
• Using information from the NodeManagers, the Capacity scheduler
can then place containers on the best-suited nodes.
• The Capacity scheduler works best when the workloads are well
known, which helps in assigning the minimum capacity.
• For this scheduler to work most effectively, each queue should be
assigned a minimal capacity that is less than the maximal expected
workload.
• Within each queue, multiple jobs are scheduled using hierarchical
(first in, first out) FIFO queues similar to the approach used with the
stand-alone FIFO scheduler.
• If there are no queues configured, all jobs are placed in the default
queue
CONFIGURING YARN
• In a Hadoop cluster, it’s vital to balance the usage of RAM, CPU and disk so
that processing is not constrained by any one of these cluster resources.
• As a general recommendation, we’ve found that allowing for 1-2
Containers per disk and per core gives the best balance for cluster
utilization.
• So with our example cluster node with 12 disks and 12 cores,
• we will allow for 20 maximum Containers to be allocated to each node.
• Each machine in our cluster has 48 GB of RAM.
• Some of this RAM should be reserved for Operating System usage.
• On each node, we’ll assign 40 GB RAM for YARN to use and keep 8 GB for
the Operating System.
• The following property sets the maximum memory YARN can utilize on the
node:
Container Memory Config
• In yarn-site.xml

<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>

The next step is to provide YARN guidance on how to break up the total
resources available into Containers.
• You do this by specifying the minimum unit of RAM to allocate for a
Container.
• We want to allow for a maximum of 20 Containers, and thus need (40 GB
total RAM) / (20 # of Containers) = 2 GB minimum per container:

• In yarn-site.xml
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>

YARN will allocate Containers with RAM amounts greater than
the yarn.scheduler.minimum-allocation-mb.
CONFIGURING MAPREDUCE 2
• MapReduce 2 runs on top of YARN and utilizes YARN Containers to
schedule and execute its map and reduce tasks.
• When configuring MapReduce 2 resource utilization on YARN, there are
three aspects to consider:
• Physical RAM limit for each Map And Reduce task
• The JVM heap size limit for each task
• The amount of virtual memory each task will get
• You can define how much maximum memory each Map and Reduce task
will take.
• Since each Map and each Reduce will run in a separate Container, these
maximum memory settings should be at least equal to or more than the
YARN minimum Container allocation.
• For our example cluster, we have the minimum RAM for a Container
(yarn.scheduler.minimum-allocation-mb) = 2 GB.
• We’ll thus assign 4 GB for Map task Containers, and 8 GB for Reduce tasks
Containers.
Map & Reduce – Memory Config
• In mapred-site.xml:

<name>mapreduce.map.memory.mb</name>
<value>4096</value>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>

Each Container will run JVMs for the Map and Reduce
tasks.
• The JVM heap size should be set to lower than the Map and
Reduce memory defined above, so that they are within the
bounds of the Container memory allocated by YARN.
Heap Config
• In mapred-site.xml:

<name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>

The above settings configure the upper limit of


the physical RAM that Map and Reduce tasks will
use.
ACKNOWELDGMENT
• Photographs & content used in this presentation
are taken from the following textbooks.
– Douglas Eadline,"Hadoop 2 Quick-Start Guide: Learn
the Essentials of Big Data Computing in the Apache
Hadoop 2 Ecosystem", 1st Edition, Pearson Education,
2016. ISBN-13: 978-9332570351
– Anil Maheshwari, “Data Analytics”, 1st Edition,
McGraw Hill Education, 2017. ISBN-13: 978-
9352604180

You might also like