Hadoop 2 Quick Start Guide PDF
Hadoop 2 Quick Start Guide PDF
This E-Book
link. Click the link to view the print-fidelity code image. To return to the
previous page viewed, click the Back button on your device or app.
Hadoop® 2
Quick-Start Guide
Learn the Essentials of Big
Data Computing in the Apache
Hadoop® 2 Ecosystem
Douglas Eadline
2015030746
Copyright © 2016 Pearson Education, Inc.
Apache®, Apache Hadoop®, and Hadoop® are trademarks of The Apache
Software Foundation. Used with permission. No endorsement by The Apache
Software Foundation is implied by the use of these marks.
All rights reserved. Printed in the United States of America. This publication is
protected by copyright, and permission must be obtained from the publisher
prior to any prohibited reproduction, storage in a retrieval system, or
transmission in any form or by any means, electronic, mechanical,
photocopying, recording, or likewise. To obtain permission to use material from
this work, please submit a written request to Pearson Education, Inc.,
Permissions Department, 200 Old Tappan Road, Old Tappan, New Jersey 07675,
or you may fax your request to (201) 236-3290.
ISBN-13: 978-0-13-404994-6
ISBN-10: 0-13-404994-2
Text printed in the United States on recycled paper at RR Donnelley in
Crawfordsville, Indiana.
First printing, November 2015
Contents
Foreword
Preface
Acknowledgments
About the Author
1 Background and Concepts
Defining Apache Hadoop
A Brief History of Apache Hadoop
Defining Big Data
Hadoop as a Data Lake
Using Hadoop: Administrator, User, or Both
First There Was MapReduce
Apache Hadoop Design Principles
Apache Hadoop MapReduce Example
MapReduce Advantages
Apache Hadoop V1 MapReduce Operation
Moving Beyond MapReduce with Hadoop V2
Hadoop V2 YARN Operation Design
The Apache Hadoop Project Ecosystem
Summary and Additional Resources
2 Installation Recipes
Core Hadoop Services
Hadoop Configuration Files
Planning Your Resources
Hardware Choices
Software Choices
Installing on a Desktop or Laptop
Installing Hortonworks HDP 2.2 Sandbox
Installing Hadoop from Apache Sources
Installing Hadoop with Ambari
Performing an Ambari Installation
Undoing the Ambari Install
Installing Hadoop in the Cloud Using Apache Whirr
Step 1: Install Whirr
Step 2: Configure Whirr
Step 3: Launch the Cluster
Step 4: Take Down Your Cluster
Summary and Additional Resources
3 Hadoop Distributed File System Basics
Hadoop Distributed File System Design Features
HDFS Components
HDFS Block Replication
HDFS Safe Mode
Rack Awareness
NameNode High Availability
HDFS Namespace Federation
HDFS Checkpoints and Backups
HDFS Snapshots
HDFS NFS Gateway
HDFS User Commands
Brief HDFS Command Reference
General HDFS Commands
List Files in HDFS
Make a Directory in HDFS
Copy Files to HDFS
Copy Files from HDFS
Copy Files within HDFS
Delete a File within HDFS
Delete a Directory in HDFS
Get an HDFS Status Report
HDFS Web GUI
Using HDFS in Programs
HDFS Java Application Example
HDFS C Application Example
Summary and Additional Resources
4 Running Example Programs and Benchmarks
Running MapReduce Examples
Listing Available Examples
Running the Pi Example
Using the Web GUI to Monitor Examples
Running Basic Hadoop Benchmarks
Running the Terasort Test
Running the TestDFSIO Benchmark
Managing Hadoop MapReduce Jobs
Summary and Additional Resources
5 Hadoop MapReduce Framework
The MapReduce Model
MapReduce Parallel Data Flow
Fault Tolerance and Speculative Execution
Speculative Execution
Hadoop MapReduce Hardware
Summary and Additional Resources
6 MapReduce Programming
Compiling and Running the Hadoop WordCount Example
Using the Streaming Interface
Using the Pipes Interface
Compiling and Running the Hadoop Grep Chaining Example
Debugging MapReduce
Listing, Killing, and Job Status
Hadoop Log Management
Summary and Additional Resources
7 Essential Hadoop Tools
Using Apache Pig
Pig Example Walk-Through
Using Apache Hive
Hive Example Walk-Through
A More Advanced Hive Example
Using Apache Sqoop to Acquire Relational Data
Apache Sqoop Import and Export Methods
Apache Sqoop Version Changes
Sqoop Example Walk-Through
Using Apache Flume to Acquire Data Streams
Flume Example Walk-Through
Manage Hadoop Workflows with Apache Oozie
Oozie Example Walk-Through
Using Apache HBase
HBase Data Model Overview
HBase Example Walk-Through
Summary and Additional Resources
8 Hadoop YARN Applications
YARN Distributed-Shell
Using the YARN Distributed-Shell
A Simple Example
Using More Containers
Distributed-Shell Examples with Shell Arguments
Structure of YARN Applications
YARN Application Frameworks
Distributed-Shell
Hadoop MapReduce
Apache Tez
Apache Giraph
Hoya: HBase on YARN
Dryad on YARN
Apache Spark
Apache Storm
Apache REEF: Retainable Evaluator Execution Framework
Hamster: Hadoop and MPI on the Same Cluster
Apache Flink: Scalable Batch and Stream Data Processing
Apache Slider: Dynamic Application Management
Summary and Additional Resources
9 Managing Hadoop with Apache Ambari
Quick Tour of Apache Ambari
Dashboard View
Services View
Hosts View
Admin View
Views View
Admin Pull-Down Menu
Managing Hadoop Services
Changing Hadoop Properties
Summary and Additional Resources
10 Basic Hadoop Administration Procedures
Basic Hadoop YARN Administration
Decommissioning YARN Nodes
YARN WebProxy
Using the JobHistoryServer
Managing YARN Jobs
Setting Container Memory
Setting Container Cores
Setting MapReduce Properties
Basic HDFS Administration
The NameNode User Interface
Adding Users to HDFS
Perform an FSCK on HDFS
Balancing HDFS
HDFS Safe Mode
Decommissioning HDFS Nodes
SecondaryNameNode
HDFS Snapshots
Configuring an NFSv3 Gateway to HDFS
Capacity Scheduler Background
Hadoop Version 2 MapReduce Compatibility
Enabling ApplicationMaster Restarts
Calculating the Capacity of a Node
Running Hadoop Version 1 Applications
Summary and Additional Resources
A Book Webpage and Code Download
B Getting Started Flowchart and Troubleshooting Guide
Getting Started Flowchart
General Hadoop Troubleshooting Guide
Rule 1: Don’t Panic
Rule 2: Install and Use Ambari
Rule 3: Check the Logs
Rule 4: Simplify the Situation
Rule 5: Ask the Internet
Other Helpful Tips
C Summary of Apache Hadoop Resources by Topic
General Hadoop Information
Hadoop Installation Recipes
HDFS
Examples
MapReduce
MapReduce Programming
Essential Tools
YARN Application Frameworks
Ambari Administration
Basic Hadoop Administration
D Installing the Hue Hadoop GUI
Hue Installation
Steps Performed with Ambari
Install and Configure Hue
Starting Hue
Hue User Interface
E Installing Apache Spark
Spark Installation on a Cluster
Starting Spark across the Cluster
Installing and Starting Spark on the Pseudo-distributed Single-Node
Installation
Run Spark Examples
Index
Foreword
Apache Hadoop 2 introduced new methods of processing and working with data
that moved beyond the basic MapReduce paradigm of the original Hadoop
implementation. Whether you are a newcomer to Hadoop or a seasoned
professional who has worked with the previous version, this book provides a
fantastic introduction to the concepts and tools within Hadoop 2.
Over the past few years, many projects have fallen under the umbrella of the
original Hadoop project to make storing, processing, and collecting large
quantities easier while integrating with the original Hadoop project. This book
introduces many of these projects in the larger Hadoop ecosystem, giving
readers the high-level basics to get them started using tools that fit their needs.
Doug Eadline adapted much of this material from his very popular video
series Hadoop Fundamentals Live Lessons. However, his qualifications don’t
stop there. As a coauthor on the in-depth book Apache Hadoop™ YARN: Moving
beyond MapReduce and Batch Processing with Apache Hadoop™ 2, few are as
well qualified to deliver coverage of Hadoop 2 and the new features it brings to
users.
I’m excited about the great wealth of knowledge that Doug has brought to the
series with his books covering Hadoop and its related projects. This book will be
a great resource for both newcomers looking to learn more about the problems
that Hadoop can help them solve and for existing users looking to learn about the
benefits of upgrading to the new version.
—Paul Dix, Series Editor
Preface
Apache Hadoop 2 has changed the data analytics landscape. The Hadoop 2
ecosystem has moved beyond a single MapReduce data processing methodology
and framework. That is, Hadoop version 2 offers the Hadoop version 1
methodology to almost any type of data processing and provides full backward
compatibility with the vulnerable MapReduce paradigm from version 1.
This change has already had a dramatic effect on many areas of data
processing and data analytics. The increased volume of online data has invited
new and scalable approaches to data analytics. As discussed in Chapter 1, the
concept of the Hadoop data lake represents a paradigm shift away from many
established approaches to online data usage and storage. A Hadoop version 2
installation is an extensible platform that can grow and adapt as both data
volumes increase and new processing models become available.
For this reason, the “Hadoop approach” is important and should not be
dismissed as a simple “one-trick pony” for Big Data applications. In addition,
the open source nature of Hadoop and much of the surrounding ecosystem
provides an important incentive for adoption. Thanks to the Apache Software
Foundation (ASF), Hadoop has always been an open source project whose inner
workings are available to anyone. The open model has allowed vendors and
users to share a common goal without lockin or legal barriers that might
otherwise splinter a huge and important project such as Hadoop. All software
used in this book is open source and is freely available. Links leading to the
software are provided at the end of each chapter and in Appendix C.
Book Structure
The basic structure of this book was adapted from my video tutorial, Hadoop
Fundamentals LiveLessons, Second Edition and Apache Hadoop YARN
Fundamentals LiveLessons from Addison-Wesley. Almost all of the examples
are identical to those found in the videos. Some readers may find it beneficial to
watch the videos in conjunction with reading the book as I carefully step through
all the examples.
A few small pieces have been borrowed from Apache Hadoop™ YARN:
Moving beyond MapReduce and Batch Processing with Apache Hadoop™ 2, a
book that I coauthored. If you want to explore YARN application development
in more detail, you may want to consider reading this book and viewing its
companion video.
Much of this book uses the Hortonworks Data Platform (HDP) for Hadoop.
The HDP is a fully open source Hadoop distribution made available by
Hortonworks. While it is possible to download and install the core Hadoop
system and tools (as is discussed in Chapter 2), using an integrated distribution
reduces many of the issues that may arise from the “roll your own” approach. In
addition, the Apache Ambari graphical installation and management tool is too
good to pass up and supports the Hortonworks HDP packages. HDP version 2.2
and Ambari 1.7 were used for this book. As I write this preface, Hortonworks
has just announced the launch of HDP version 2.3 with Apache Ambari 2.0. (So
much for staying ahead of the curve in the Hadoop world!) Fortunately, the
fundamentals remain the same and the examples are all still relevant.
The chapters in this text have been arranged to provide a flexible introduction
for new readers. As delineated in Appendix B, “Getting Started Flowchart and
Troubleshooting Guide,” there are two paths you can follow: read Chapters 1, 3,
and 5 and then start playing with the examples, or jump right in and run the
examples in Chapter 4. If you don’t have a Hadoop environment, Chapter 2
provides a way to install Hadoop on a variety of systems, including a laptop or
small desk-side computer, a cluster, or even in the cloud. Presumably after
running examples, you will go back and read the background chapters.
Chapter 1 provides essential background on Hadoop technology and history.
The Hadoop data lake is introduced, along with an overview of the MapReduce
process found in version 1 of Hadoop. The big changes in Hadoop version 2 are
described, and the YARN resource manager is introduced as a way forward for
almost any computing model. Finally, a brief overview of the many software
projects that make up the Hadoop ecosystem is presented. This chapter provides
an underpinning for the rest of the book.
If you need access to a Hadoop system, a series of installation recipes is
provided in Chapter 2. There is also an explanation of the core Hadoop services
and the way in which they are configured. Some general advice for choosing
hardware and software environments is provided, but the main focus is on
providing a platform to learn about Hadoop. Fortunately, there are two ways to
do this without purchasing or renting any hardware. The Hortonworks Hadoop
sandbox provides a Linux virtual machine that can be run on almost any
platform. The sandbox is a full Hadoop install and provides an environment
through which to explore Hadoop. As an alternative to the sandbox, the
installation of Hadoop on a single Linux machine provides a learning platform
and offers some insights into the Hadoop core components. Chapter 2 also
addresses cluster installation using Apache Ambari for a local cluster or Apache
Whirr for a cloud deployment.
All Hadoop applications use the Hadoop Distributed File System (HDFS).
Chapter 3 covers some essential HDFS features and offers quick tips on how to
navigate and use the file system. The chapter concludes with some HDFS
programming examples. It provides important background and should be
consulted before trying the examples in later chapters.
Chapter 4 provides a show-and-tell walk-through of some Hadoop examples
and benchmarks. The Hadoop Resource Manager web GUI is also introduced as
a way to observe application progress. The chapter concludes with some tips on
controlling Hadoop MapReduce jobs. Use this chapter to get a feel for how
Hadoop applications run and operate.
The MapReduce programming model, while simple in nature, can be a bit
confusing when run across a cluster. Chapter 5 provides a basic introduction to
the MapReduce programming model using simple examples. The chapter
concludes with a simplified walk-through of the parallel Hadoop MapReduce
process. This chapter will help you understand the basic Hadoop MapReduce
terminology.
If you are interested in low-level Hadoop programming, Chapter 6 provides an
introduction to Hadoop MapReduce programming. Several basic approaches are
covered, including Java, the streaming interface with Python, and the C++ Pipes
interface. A short example also explains how to view application logs. This
chapter is not essential for using Hadoop. In fact, many Hadoop users begin with
the high-level tools discussed in Chapter 7.
While many applications have been written to run on the native Hadoop Java
interface, a wide variety of tools are available that provide a high-level approach
to programing and data movement. Chapter 7 introduces (with examples)
essential Hadoop tools including Apache Pig (scripting language), Apache Hive
(SQL-like language), Apache Sqoop (RDMS import/export), and Apache Flume
(serial data import). An example demonstrating how to use the Oozie workflow
manager is also provided. The chapter concludes with an Apache HBase (big
table database) example.
If you are interested in learning more about Hadoop YARN applications,
Chapter 8 introduces non-MapReduce applications under Hadoop. As a simple
example, the YARN Distributed-Shell is presented, along with a discussion of
how YARN applications work under Hadoop version 2. A description of the
latest non-MapReduce YARN applications is provided as well.
If you installed Hadoop with Apache Ambari in Chapter 2, Chapter 9 provides
a tour of its capabilities and offers some examples that demonstrate how to use
Ambari on a real Hadoop cluster. A tour of Ambari features and procedures to
restart Hadoop services and change system-wide Hadoop properties is presented
as well. The basic steps outlined in this chapter are used in Chapter 10 to make
administrative changes to the cluster.
Chapter 10 provides some basic Hadoop administration procedures. Although
administrators will find information on basic procedures and advice in this
chapter, other users will also benefit by discovering how HDFS, YARN, and the
Capacity scheduler can be configured for their workloads.
Consult the appendixes for information on the book webpage, a getting started
flowchart, and a general Hadoop troubleshooting guide. The appendixes also
include a resources summary page and procedures for installing Apache Hue (a
high-level Hadoop GUI) and Apache Spark (a popular non-MapReduce
programming model).
Finally, the Hadoop ecosystem continues to grow rapidly. Many of the
existing Hadoop applications and tools were intentionally not covered in this text
because their inclusion would have turned this book into a longer and slower
introduction to Hadoop 2. And, there are many more tools and applications on
the way! Given the dynamic nature of the Hadoop ecosystem, this introduction
to Apache Hadoop 2 is meant to provide both a compass and some important
waypoints to aid in your navigation of the Hadoop 2 data lake.
Book Conventions
Code and file references are displayed in a monospaced font. Code input lines
that wrap because they are too long to fit on one line in this book are denoted
with this symbol: . Long output lines are wrapped at page boundaries without
the symbol.
Accompanying Code
Please see Appendix A, “Book Webpage and Code Download,” for the location
of all code used in this book.
Acknowledgments
Some of the figures and examples were inspired and derived from the Yahoo!
Hadoop Tutorial (https://developer.yahoo.com/hadoop/tutorial/), the Apache
Software Foundation (ASF; http://www.apache.org), Hortonworks
(http://hortonworks.com), and Michael Noll (http://www.michael-noll.com). Any
copied items either had permission for use granted by the author or were
available under an open sharing license.
Many people have worked behind the scenes to make this book possible.
Thank you to the reviewers who took the time to carefully read the rough drafts:
Jim Lux, Prentice Bisbal, Jeremy Fischer, Fabricio Cannini, Joshua Mora,
Matthew Helmke, Charlie Peck, and Robert P. J. Day. Your feedback was very
valuable and helped make for a sturdier book.
To Debra Williams Cauley of Addison-Wesley, your kind efforts and office at
the GCT
Oyster Bar made the book-writing process almost easy. I also cannot forget to
thank my support crew: Emily, Marlee, Carla, and Taylor—yes, another book
you know nothing about. And, finally, the biggest thank you to my patient and
wonderful wife, Maddy, for her constant support.
Currently, he is a writer and consultant to the HPC industry and leader of the
Limulus Personal Cluster Project (http://limulus.basement-
supercomputing.com). He is author of Hadoop Fundamentals LiveLessons and
Apache Hadoop YARN Fundamentals LiveLessons videos from Addison-Wesley
and book coauthor of Apache Hadoop™ YARN: Moving beyond MapReduce and
Batch Processing with Apache Hadoop™ 2.
1. Background and Concepts
In This Chapter:
The Apache Hadoop project is introduced along with a working definition
of Big Data.
The concept of a Hadoop data lake is developed and contrasted with
traditional data storage methods.
A basic overview of the Hadoop MapReduce process is presented.
The evolution of Hadoop version 1 (V1) to Hadoop version 2 (V2) with
YARN is explained.
The Hadoop ecosystem is explained and many of the important projects
are introduced.
Apache Hadoop represents a new way to process large amounts of data. Rather
than a single program or product, Hadoop is more of an approach to scalable
data processing. The Hadoop ecosystem encompasses many components, and
the current capabilities of Hadoop version 2 far exceed those of version 1. Many
of the important Hadoop concepts and components are introduced in this chapter.
Figure 1.2 Loading data into HDFS (Adapted from Yahoo Hadoop
Documentation)
After files are loaded into HDFS, the MapReduce engine can use them.
Consider the following simplified example. If we were to load the text file War
and Peace into HDFS, it would be transparently sliced and remain unchanged
from a content perspective.
The mapping step is where a user query is “mapped” to all nodes. That is, the
query is applied to all the slices independently. Actually, the map is applied to
logical splits of the data so that words (or records, or some other partitions) that
were physically split by data slicing are kept together. For example, the query
“How many times is the name Kutuzov mentioned in War and Peace?” might be
applied to each text slice or split. This process is depicted in Figure 1.3. In this
case, the mapping function takes an input list (data slices or splits) and produces
an output list—a count of how many times Kutuzov appears in the text slice. The
output list is a list of numbers.
Figure 1.3 Applying the mapping function to the sliced data (Adapted from
Yahoo Hadoop Documentation)
Once the mapping is done, the output list of the map processes becomes the
input list to the reduce process. For example, the individual sums, or the counts
for Kutuzov from each input list (the output list of the map step), are combined
to form a single number. As shown in Figure 1.4, the resulting data reduced in
this step. Like the mapping function, the reduction can take many forms and, in
general, collects and “reduces” the information from the mapping step. In this
example, the reduction is a sum.
Figure 1.4 Reducing the results of the mapping step to a single output value
(Adapted from Yahoo Hadoop Documentation)
MapReduce is a simple two-step algorithm that provides the user with
complete control over the mapping and reducing steps. The following is a
summary of the basic aspects of the MapReduce process:
1. Files loaded into HDFS (performed one time) Example: Load War and
Peace text file.
2. User query is “mapped” to all slices Example: How many times is the
name Kutuzov mentioned in this slice?
3. Results are “reduced” to one answer Example: Collect and sum the counts
for Kutuzov from each map step. The answer is a single number.
MapReduce Advantages
The MapReduce process can be considered a functional approach because the
original input data does not change. The steps in the MapReduce process only
create new data. Like the original input data, this intermediate data is not
changed by the MapReduce process.
As shown in the previous section, the actual processing is based on a single
one-way communication path from mappers to reducers (you can’t go back and
change data!). Since it is the same for all MapReduce processes, it can be made
transparent to the end user. There is no need for the user to specify
communication or data movement as part of the MapReduce process. This
design provides the following features:
Highly scalable. As input data size grows more nodes can be applied to the
problem (often linear scalability)
Easily managed work flow. Since all jobs use the same underlying
processes, the workflow and load can also be handled in a transparent
fashion. The user does not need to manage cluster resources as part of the
MapReduce process.
Fault tolerance. Inputs are immutable, so any result can be recalculated
because inputs have not changed. A failed process can be restarted on other
nodes. A single failure does not stop the entire MapReduce job. Multiple
failures can often be tolerated as well—depending on where they are
located. In general, a hardware failure may slow down the MapReduce
process but not stop it entirely.
MapReduce is a powerful paradigm for solving many problems. It is often
referred to as a data parallel problem or a single instruction/multiple data
(SIMD) paradigm.
In This Chapter:
The Core Apache Hadoop services and configuration files are introduced.
Background on basic Apache Hadoop Resource planning is provided.
Step-by-step single-machine installation procedures for a virtual Apache
Hadoop sandbox and a pseudo-distributed mode are provided.
A full cluster graphical installation is performed using the Apache Ambari
installation and modeling tool.
A cloud-based Apache Hadoop cluster is created and configured using the
Apache Whirr toolkit.
Installing Hadoop is an open-ended subject. As mentioned in Chapter 1,
“Background and Concepts,” Hadoop is an expanding ecosystem of tools for
processing various types and volumes of data. Any Hadoop installation is
ultimately dependent on your goals and project plans. In this chapter, we start by
installing on a single system, then move to a full local cluster install, and finish
with a recipe for installing Hadoop in the cloud. Each installation scenario has a
different goal—learning on a small scale or implementing a full production
cluster.
Hardware Choices
The first hardware choice is often whether to use a local machine or cloud
services. Both options depend on your needs and budget. In general, local
machines take longer to procure and provision, require administrative and power
costs, but offer fast internal data transfers and on-premises security. For their
part, cloud-based clusters are quickly procured and provisioned, do not require
on-site power and administration, but still require Hadoop administration and
off-site data transfer and storage. Each option has both advantages and
disadvantages. There is often a cloud-based feasibility stage for many Hadoop
projects that start in the cloud and end up in production on an internal cluster. In
addition, because Hadoop uses generic hardware, it is possible to cobble together
several older servers and easily create an internal test system.
Hadoop components are designed to work on commodity servers. These
systems are usually multicore x86-based servers with hard drives for storing
HDFS data. Newer systems employ 10-gigabit Ethernet (GbE) as a
communication network. The Hadoop design provides multiple levels of failover
that can tolerate a failed server or even an entire rack of servers. Building large
clusters is not a trivial process, however; it requires designs that provide
adequate network performance, support for failover strategies, server storage
capacity, processor size (cores), workflow policies, and more.
The various Hadoop distributers provide freely available guides that can help
in choosing the right hardware. Many hardware vendors also have Hadoop
recipes. Nevertheless, using a qualified consultant or Hadoop vendor is
recommended for large projects.
Software Choices
The system software requirements for a Hadoop installation are somewhat basic.
The installation of the official Apache Software Hadoop releases still relies on a
Linux host and file system such as ext3, ext4, XFS, and btrfs. A Java
Development Kit starting with the later versions of 1.6 or 1.7 is required. The
officially supported versions can be found at
http://wiki.apache.org/hadoop/HadoopJavaVersions. Various vendors have tested
both the Oracle JDK and the OpenJDK. The OpenJDK that comes with many
popular Linux distributions should work for most installs (make sure the version
number is 1.7 or higher). All of the major distributions of Linux should work as
a base operating system; these include Red Hat Enterprise Linux (or rebuilds like
CentOS), Fedora, SLES, Ubuntu, and Debian.
Hadoop versions 2.2 and later include native support for Windows. The
official Apache Hadoop releases do not include Windows binaries (as of July
2015). However, building a Windows package from the sources is fairly
straightforward.
Many decisions go into a production Hadoop cluster that can be ignored for
small feasibility projects (although the feasibility projects are certainly a good
place to test the various options before putting them into production). These
decisions include choices related to Secure Mode Hadoop operation, HDFS
Federation and High Availability, and checkpointing.
By default, Hadoop runs in non-secure mode in which no actual authentication
is required throughout the cluster other than the basic POSIX-level security.
When Hadoop is configured to run in secure mode, each user and service needs
to be authenticated by Kerberos to use Hadoop services. More information on
Secure Mode Hadoop can be found at
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/SecureMode.html. Security features of Hadoop consist of
authentication, service level authorization, authentication for web consoles, and
data confidentiality.
HDFS NameNode Federation and NameNode HA (High Availability) are the
two important decisions for most organizations. NameNode Federation
significantly improves the scalability and performance of HDFS by introducing
the ability to deploy multiple NameNodes for a single cluster. In addition to
federation, HDFS introduces built-in high availability for the NameNode via a
new feature called the Quorum Journal Manager (QJM). QJM-based HA
includes an active NameNode and a standby NameNode. The standby
NameNode can become active either by a manual process or automatically.
Background on these HDFS features is presented in Chapter 3.
The password is hadoop. Once you are logged in, you can use all of the
Hadoop features installed in the appliance. You can also connect via a web
interface by entering http://127.0.0.1:8888 into your host browser. On
first use, there is a Hortonworks registration screen that requests some basic
information. Once you enter the information, the web GUI is available for use.
# cd /root
# wget http://mirrors.ibiblio.org/apache/hadoop/common/hadoop-
2.6.0/hadoop-
2.6.0.tar.gz
Next, move to the YARN installation root and create the log directory and set
the owner and group as follows:
# cd /opt/hadoop-2.6.0
# mkdir logs
# chmod g+w logs
# chown -R yarn:hadoop .
export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=250
Finally, to stop some warnings about native Hadoop libraries, edit hadoop-
env.sh and add the following to the end:
Click here to view code image
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -
Djava.library.path=$HADOOP_HOME/lib/native "
Step 10: Format HDFS
For the HDFS NameNode to start, it needs to initialize the directory where it will
hold its data. The NameNode service tracks all the metadata for the file system.
The format process will use the value assigned to
dfs.namenode.name.dir in etc/hadoop/hdfs-site.xml earlier
(i.e., /var/data/hadoop/hdfs/nn). Formatting destroys everything in the
directory and sets up a new file system. Format the NameNode directory as the
HDFS superuser, which is typically the hdfs user account.
From the base of the Hadoop distribution, change directories to the bin
directory and execute the following commands:
# su - hdfs
$ cd /opt/hadoop-2.6.0/bin
$ ./hdfs namenode -format
If the command worked, you should see the following near the end of a long
list of messages:
Click here to view code image
This command should result in the following output (the logging file name has
the host name appended—in this case, the host name is limulus):
Click here to view code image
starting namenode, logging to /opt/hadoop-2.6.0/logs/hadoop-hdfs-
namenode-limulus.out
The SecondaryNameNode and DataNode services can be started in the same
way:
Click here to view code image
$ ./hadoop-daemon.sh start secondarynamenode
starting secondarynamenode, logging to /opt/hadoop-2.6.0/logs/hadoop-
hdfs-
secondarynamenode-limulus.out
$ ./hadoop-daemon.sh start datanode
starting datanode, logging to /opt/hadoop-2.6.0/logs/hadoop-hdfs-
datanode-limulus.out
If the daemon started, you should see responses that will point to the log file.
(Note that the actual log file is appended with .log, not .out.) As a sanity
check, issue a jps command to confirm that all the services are running. The
actual PID (Java process ID) values will be different than shown in this listing:
$ jps
15140 SecondaryNameNode
15015 NameNode
15335 Jps
15214 DataNode
If the process did not start, it may be helpful to inspect the log files. For
instance, examine the log file for the NameNode. (Note that the path is taken
from the preceding command and the host name is part of the file name.)
Click here to view code image
vi /opt/hadoop-2.6.0/logs/hadoop-hdfs-namenode-limulus.log
If you get warning messages that the system is “Unable to load native-hadoop
library for your platform,” you can ignore them. The Apache Hadoop
distribution is compiled for 32-bit operation, and this warning often appears
when it is run on 64-bit systems.
All Hadoop services can be stopped using the hadoop-daemon.sh script.
For example, to stop the DataNode service, enter the following command (as
user hdfs in the /opt/hadoop-2.6.0/sbin directory):
Click here to view code image
$ ./hadoop-daemon.sh stop datanode
The same can be done for the NameNode and SecondaryNameNode services.
The other service we will need is the MapReduce history server, which keeps
track of MapReduce jobs.
Click here to view code image
$ ./mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /opt/hadoop-2.6.0/logs/mapred-
yarn-
historyserver-limulus.out
As when the HDFS daemons were started in Step 12, the status of the running
daemons is sent to their respective log files. To check whether the services are
running, issue a jps command. The following shows all the services necessary
to run YARN on a single server:
$ jps
15933 Jps
15567 ResourceManager
15785 NodeManager
15919 JobHistoryServer
If there are missing services, check the log file for the specific service. Similar
to the case with HDFS services, the YARN services can be stopped by issuing a
stop argument to the daemon script:
Click here to view code image
./yarn-daemon.sh stop nodemanager
Step 13: Verify the Running Services Using the Web Interface
Both HDFS and the YARN ResourceManager have a web interface. These
interfaces offer a convenient way to browse many of the aspects of your Hadoop
installation. To monitor HDFS, enter the following:
Click here to view code image
$ firefox http://localhost:50070
Connecting to port 50070 will bring up a web interface similar to Figure 2.9.
$ firefox http://localhost:8088
If the program worked correctly, the following should be displayed at the end
of the program output stream:
Click here to view code image
Estimated value of Pi is 3.14250000000000000000
This example submits a MapReduce job to YARN from the included samples
in the share/hadoop/mapreduce directory. The master JAR file contains
several sample applications to test your YARN installation. After you submit the
job, its progress can be viewed by updating the ResourceManager webpage
shown in Figure 2.10.
Once the Pig tar file is downloaded (we will assume it is downloaded into
/root), it can be extracted into the /opt directory.
Click here to view code image
# cd /opt
# tar xvzf /root/pig-0.14.0.tar.gz
Similar to the case with the earlier Hadoop install, Pig defines may be placed
in /etc/profile.d so that when users log in, the defines are automatically
placed in their environment.
Click here to view code image
# echo 'export PATH=/opt/pig-0.14.0/bin:$PATH; export
PIG_HOME=/opt/pig-0.14.0/;
PIG_CLASSPATH=/opt/hadoop-2.6.0/etc/hadoop/' > /etc/profile.d/pig.sh
If the Pig environment variables are needed for this session, they can be added
by sourcing the new script:
Click here to view code image
# source /etc/profile.d/pig.sh
Pig is now installed and ready for use. See Chapter 7, “Essential Hadoop
Tools,” for examples of how to use Apache Pig.
# wget http://mirrors.ibiblio.org/apache/hive/hive-1.1.0/apache-hive-
1.1.0-bin.
tar.gz
# su - hdfs
$ hdfs dfs -mkdir /tmp
$ hdfs dfs -mkdir -p /user/hive/warehouse
$ hdfs dfs -chmod g+w /tmp
$ hdfs dfs -chmod g+w /user/hive/warehouse
Note
If you are using Hadoop 2.6.0 and Hive 1.1.0, there is a library mismatch
that will generate the following error message when you start Hive:
Click here to view code image
This error arises because Hive has upgraded to Jline2, but Jline 0.94 exists
in the Hadoop lib directory.
To fix the error, perform the following steps:
1. Delete jline from the Hadoop lib directory (it's pulled in transitively
from ZooKeeper):
Click here to view code image
# rm $HADOOP_HOME/share/hadoop/yarn/lib/jline-0.9.94.jar
If the Hive environment variables are needed for this session, they can be
added by sourcing the new script:
Click here to view code image
$ source /etc/profile.d/hive.sh
Hive is now installed and ready for use. See Chapter 7, “Essential Hadoop
Tools,” for examples of how to use Apache Hive.
Note
Apache Ambari cannot be installed on top of an existing Hadoop
installation. To maintain the cluster state, Ambari must perform the
installation of all Hadoop components.
The pdsh package needs to be installed only on the main Ambari server
node. For pdsh to work properly, root must be able to ssh without a
password from the Ambari server node to all worker nodes. This capability
requires that each worker node has the Ambari server root public ssh key
installed in /root/.ssh.
Once the Ambari agent is installed, the Ambari server host name must be set
on all nodes. Substitute _FQDN_ in the line below with the name of your Ambari
server (the server node nickname should work as well). Again, this task is easily
accomplished with pdsh.
Click here to view code image
# pdsh -w n[0-2] "sed -i 's/hostname=localhost/hostname=_FQHN_/g'
/etc/ambari-
agent/conf/ambari-agent.ini"
Finally, the Ambari agents can be started across the cluster. (Hint: Placing a
|sort after a pdsh command will sort the output by node.)
Click here to view code image
# pdsh -w n[0-2] "service ambari-agent start" | sort
The following is an example Ambari server dialog (inputs are in bold). In this
system, iptables has been configured to allow all traffic on the internal
cluster network.
Click here to view code image
Using python /usr/bin/python2.6
Setup ambari-server
Checking SELinux...
SELinux status is 'disabled'
Customize user account for ambari-server daemon [y/n] (n)? n
Adjusting ambari-server permissions and ownership...
Checking firewall...
WARNING: iptables is running. Confirm the necessary Ambari ports are
accessible.
Refer to the Ambari documentation for more details on ports.
OK to continue [y/n] (y)? y
Checking JDK...
WARNING: JAVA_HOME /usr/lib/jvm/java-1.7.0-openjdk.x86_64 must be
valid on ALL hosts
WARNING: JCE Policy files are required for configuring Kerberos
security. If you
plan to use Kerberos, please make sure JCE Unlimited Strength
Jurisdiction Policy
Files are valid on all hosts.
Completing setup...
Configuring database...
Enter advanced database configuration [y/n] (n)? n
Default properties detected. Using built-in database.
Checking PostgreSQL...
Running initdb: This may take upto a minute.
Initializing database: [ OK ]
Finally, the Ambari server and agent can be started on the main node by
entering the following commands:
# service ambari-agent start
# ambari-server start
If everything is working properly, you should see the sign-in screen shown in
Figure 2.12. The default username is admin and the password is admin. The
password should be changed after the cluster is installed.
Figure 2.20 Ambari assign slaves and clients (note limulus also serves as a
DataNode and NodeManager)
As with the nodes running Hadoop services, the roles of individual slave
nodes depend on your specific needs. In this example, slaves can take on all
roles (all boxes checked). In addition, the main node (limulus) is used as a
worker node. In a production system with more nodes, this configuration is not
recommended.
Once you’re finished with this screen, click Next to bring up the Customize
Services window, shown in Figure 2.21.
Figure 2.21 Ambari customization window
In this step, Hadoop services can be customized. These settings are placed in
the /etc/hadoop/conf XML configuration files. Each service can be tuned
to your specific needs using this screen. (As will be discussed in more detail in
Chapter 9, “Managing Hadoop with Apache Ambari,” you should not modify the
XML files by hand.) Make sure to check the NameNode, Secondary NameNode,
and DataNode directory assignments. An explanation of each setting can be
obtained by placing the mouse over the text box. Settings can be undone by
clicking the undo box that appears under the text box.
The services with red numbers near their names require user attention. In the
case of Hive, Nagios, and Oozie, passwords need to be assigned for the service.
In addition, Nagios requires a system administration email address to send alerts.
When you’re finished, click Next at the bottom of the page. Note that the Next
icon will be grayed out until all the required settings have been made. A Review
window will be presented with all the settings listed, as shown in Figure 2.22. If
you like, you can print this page for reference. If you find an issue or want to
make a change, it is possible to go back and make changes at this point.
At this point, it is probably a good idea to restart all the Ambari agents on the
nodes. For instance, you can restart the Ambari agent for the previous
installation example (on both the worker nodes and the main node):
Click here to view code image
# service ambari-agent restart
# pdsh -w n[0-2] "service ambari-agent restart" | sort
Next as was done previously, the Ambari server must be set up and restarted.
Click here to view code image
# ambari-server setup -j /usr/lib/jvm/java-1.7.0-openjdk.x86_64
# ambari-server start
# python /usr/lib/python2.6/site-packages/ambari_agent/HostCleanup.py
--help
Usage: HostCleanup.py [options]
Options:
-h, --help show this help message and exit
-v, --verbose output verbosity.
-f FILE, --file=FILE host check result file to read.
-o FILE, --out=FILE log file to store results.
-k SKIP, --skip=SKIP
(packages|users|directories|repositories|processes|alt
ernatives). Use , as separator.
-s, --silent Silently accepts default prompt values
Note
Some versions of the Java JDK have a bug that will cause Whirr to fail. If
you get an error that looks like this:
Click here to view code image
org.jclouds.rest.RestContext<org.jclouds.aws.ec2.AWSEC2Client, A>
cannot be
used as a key; it is not fully specified.
then you may want to try a different JDK. This bug is reported to have
surfaced in java 1.7u51 (java-1.7.0-openjdk-devel-1.7.0.51-
2.4.4.1.el6_5.x86_64).
For the example, Java 1.7u45 was used (java-1.7.0-openjdk-1.7.0.45-
2.4.3.2.el6_4.x86_64).
$ wget http://mirrors.ibiblio.org/apache/whirr/stable/whirr-
0.8.2.tar.gz
Whirr comes with recipes for setting up Hadoop clusters in the cloud. We will
use a basic Hadoop recipe, but the Whirr documentation offers tips on further
customization. To configure a Hadoop version 2.6 cluster, copy the recipe as
follows:
Click here to view code image
$ cp whirr-0.8.2/recipes/hadoop-yarn-ec2.properties.
to
whirr.hadoop.version=2.6.0
As shown in the following, comment out the following lines that set
credentials (add a # in front of the lines):
Click here to view code image
#whirr.provider=aws-ec2
#whirr.identity=${env:AWS_ACCESS_KEY_ID}
#whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
If you do not have a proper ssh public and private key in ~/.ssh, you will
need to run ssh-keygen (i.e., you should have both id_rsa and
id_rsa.pub in your ~/.ssh).
Finally, the roles of the cloud instances are set with the following line:
Click here to view code image
whirr.instance-templates=1 hadoop-namenode+yarn-
resourcemanager+mapreduce-
historyserver,3 hadoop-datanode+yarn-nodemanager
After some time, and if all goes well, you will see something similar to the
following output. A large number of messages will scroll across the screen while
the cluster boots; when the boot is finished, the following will be displayed (IP
address will be different):
Click here to view code image
[hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver]: ssh -
i /home/
hdfs/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o
StrictHostKeyChecking=no
[email protected]
[hadoop-datanode+yarn-nodemanager]: ssh -i /home/hdfs/.ssh/id_rsa -o
"UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no
[email protected]
[hadoop-datanode+yarn-nodemanager]: ssh -i /home/hdfs/.ssh/id_rsa -o
"UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no
[email protected]
[hadoop-datanode+yarn-nodemanager]: ssh -i /home/hdfs/.ssh/id_rsa -o
"UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no
[email protected]
To destroy cluster, run 'whirr destroy-cluster' with the same options
used to
launch it.
There are four IP addresses—one for the main node (54.146.139.132) and
three for the worker nodes. The following line will allow you to ssh to the main
node without a password. Whirr also creates an account under your local user
name and imports your ssh public key. (The command ssh
[email protected] should work as well.)
Click here to view code image
$ ssh -i /home/hdfs/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o
StrictHostKeyChecking=no [email protected]
If the login was successful, the following prompt showing the private IP
address should be available. The ip-10-234-19-148 will be different from
the public IP address used to reach the main node. This IP address is for the
private network that is available only within the cluster:
hdfs@ip-10-234-19-148:~$
-------------------------------------------------
Live datanodes (3):
...
The worker nodes’ private IP addresses can be found using the hdfs
command as follows:
Click here to view code image
$ hdfs dfsadmin -report|grep Name
Name: 10.12.93.252:50010 (ip-10-12-93-252.ec2.internal)
Name: 10.152.159.179:50010 (ip-10-152-159-179.ec2.internal)
Name: 10.166.54.102:50010 (ip-10-166-54-102.ec2.internal)
You can ssh without a password to these IP addresses from the main node
when working as user hdfs. Note that this user name is the local user name
from which you started your whirr cluster.
When working as user hadoop, you can give the jps command to verify
that the Namenode, ResourceManager, and JobHistoryServer are running:
$ jps
7226 JobHistoryServer
7771 Jps
7150 ResourceManager
5703 NameNode
Logging into the worker nodes can be done by using the private or public IP
address. For instance, you can log from the main node to a worker node:
Click here to view code image
Once on the worker node, you can check which services are running by using the
jps command. In this case, as specified in the properties file, the workers are
running as the DataNode and NodeManager daemons:
Click here to view code image
hdfs@ip-10-166-54-102:~$ sudo su - hadoop
$ jps
5590 DataNode
6730 Jps
6504 NodeManager
Finally, you can view the HDFS web interface on your local machine by
starting a browser with the following IP address and port number. (Your IP
address will be different.) For Firefox:
Click here to view code image
$ firefox http://54.146.139.132:50070
Similarly, the YARN web interface can viewed by entering this command:
Click here to view code image
$ firefox http://54.146.139.132:8088
Whirr has more features and options that can be explored by consulting the
project web page located at https://whirr.apache.org.
In This Chapter:
The design and operation of the Hadoop Distributed File System (HDFS)
are presented.
Important HDFS topics such as block replication, Safe Mode, rack
awareness, High Availability, Federation, backup, snapshots, NFS
mounting, and the HDFS web GUI are discussed.
Examples of basic HDFS user commands are provided.
HDFS programming examples using Java and C are provided.
The Hadoop Distributed File System is the backbone of Hadoop MapReduce
processing. New users and administrators often find HDFS different than most
other UNIX/Linux file systems. This chapter highlights the design goals and
capabilities of HDFS that make it useful for Big Data processing.
HDFS Components
The design of HDFS is based on two types of nodes: a NameNode and multiple
DataNodes. In a basic design, a single NameNode manages all the metadata
needed to store and retrieve the actual data from the DataNodes. No data is
actually stored on the NameNode, however. For a minimal Hadoop installation,
there needs to be a single NameNode daemon and a single DataNode daemon
running on at least one machine (see the section “Installing Hadoop from
Apache Sources” in Chapter 2, “Installation Recipes”).
The design is a master/slave architecture in which the master (NameNode)
manages the file system namespace and regulates access to files by clients. File
system namespace operations such as opening, closing, and renaming files and
directories are all managed by the NameNode. The NameNode also determines
the mapping of blocks to DataNodes and handles DataNode failures.
The slaves (DataNodes) are responsible for serving read and write requests
from the file system to the clients. The NameNode manages block creation,
deletion, and replication.
An example of the client/NameNode/DataNode interaction is provided in
Figure 3.1. When a client writes data, it first communicates with the NameNode
and requests to create a file. The NameNode determines how many blocks are
needed and provides the client with the DataNodes that will store the data. As
part of the storage process, the data blocks are replicated after they are written to
the assigned node. Depending on how many nodes are in the cluster, the
NameNode will attempt to write replicas of the data blocks on nodes that are in
other separate racks (if possible). If there is only one rack, then the replicated
blocks are written to other servers in the same rack. After the DataNode
acknowledges that the file block replication is complete, the client closes the file
and informs the NameNode that the operation is complete. Note that the
NameNode does not write any data directly to the DataNodes. It does, however,
give the client a limited amount of time to complete the operation. If it does not
complete in the time period, the operation is canceled.
Figure 3.1 Various system roles in an HDFS deployment
Reading data happens in a similar fashion. The client requests a file from the
NameNode, which returns the best DataNodes from which to read the data. The
client then accesses the data directly from the DataNodes.
Thus, once the metadata has been delivered to the client, the NameNode steps
back and lets the conversation between the client and the DataNodes proceed.
While data transfer is progressing, the NameNode also monitors the DataNodes
by listening for heartbeats sent from DataNodes. The lack of a heartbeat signal
indicates a potential node failure. In such a case, the NameNode will route
around the failed DataNode and begin re-replicating the now-missing blocks.
Because the file system is redundant, DataNodes can be taken offline
(decommissioned) for maintenance by informing the NameNode of the
DataNodes to exclude from the HDFS pool.
The mappings between data blocks and the physical DataNodes are not kept in
persistent storage on the NameNode. For performance reasons, the NameNode
stores all metadata in memory. Upon startup, each DataNode provides a block
report (which it keeps in persistent storage) to the NameNode. The block reports
are sent every 10 heartbeats. (The interval between reports is a configurable
property.) The reports enable the NameNode to keep an up-to-date account of all
data blocks in the cluster.
In almost all Hadoop deployments, there is a SecondaryNameNode. While not
explicitly required by a NameNode, it is highly recommended. The term
“SecondaryNameNode” (now called CheckPointNode) is somewhat misleading.
It is not an active failover node and cannot replace the primary NameNode in
case of its failure. (See the section “NameNode High Availability” later in this
chapter for more explanation.)
The purpose of the SecondaryNameNode is to perform periodic checkpoints
that evaluate the status of the NameNode. Recall that the NameNode keeps all
system metadata memory for fast access. It also has two disk files that track
changes to the metadata:
An image of the file system state when the NameNode was started. This
file begins with fsimage_* and is used only at startup by the
NameNode.
A series of modifications done to the file system after starting the
NameNode. These files begin with edit_* and reflect the changes made
after the fsimage_* file was read.
The location of these files is set by the dfs.namenode.name.dir property
in the hdfs-site.xml file.
The SecondaryNameNode periodically downloads fsimage and edits files,
joins them into a new fsimage, and uploads the new fsimage file to the
NameNode. Thus, when the NameNode restarts, the fsimage file is reasonably
up-to-date and requires only the edit logs to be applied since the last checkpoint.
If the SecondaryNameNode were not running, a restart of the NameNode could
take a prohibitively long time due to the number of changes to the file system.
Thus, the various roles in HDFS can be summarized as follows:
HDFS uses a master/slave model designed for large file reading/streaming.
The NameNode is a metadata server or “data traffic cop.”
HDFS provides a single namespace that is managed by the NameNode.
Data is redundantly stored on DataNodes; there is no data on the
NameNode.
The SecondaryNameNode performs checkpoints of NameNode file
system’s state but is not a failover node.
HDFS Snapshots
HDFS snapshots are similar to backups, but are created by administrators using
the hdfs dfs snapshot command. HDFS snapshots are read-only point-in-
time copies of the file system. They offer the following features:
Snapshots can be taken of a sub-tree of the file system or the entire file
system.
Snapshots can be used for data backup, protection against user errors, and
disaster recovery.
Snapshot creation is instantaneous.
Blocks on the DataNodes are not copied, because the snapshot files record
the block list and the file size. There is no data copying, although it
appears to the user that there are duplicate files.
Snapshots do not adversely affect regular HDFS operations.
See Chapter 10, “Basic Hadoop Administration Procedures,” for information
on creating HDFS snapshots.
Found 10 items
drwxrwxrwx - yarn hadoop 0 2015-04-29 16:52 /app-logs
drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:28 /apps
drwxr-xr-x - hdfs hdfs 0 2015-05-14 10:53 /benchmarks
drwxr-xr-x - hdfs hdfs 0 2015-04-21 15:18 /hdp
drwxr-xr-x - mapred hdfs 0 2015-04-21 14:26 /mapred
drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:26 /mr-history
drwxr-xr-x - hdfs hdfs 0 2015-04-21 14:27 /system
drwxrwxrwx - hdfs hdfs 0 2015-05-07 13:29 /tmp
drwxr-xr-x - hdfs hdfs 0 2015-04-27 16:00 /user
drwx-wx-wx - hdfs hdfs 0 2015-05-27 09:01 /var
Found 13 items
drwx------ - hdfs hdfs 0 2015-05-27 20:00 .Trash
drwx------ - hdfs hdfs 0 2015-05-26 15:43 .staging
drwxr-xr-x - hdfs hdfs 0 2015-05-28 13:03 DistributedShell
drwxr-xr-x - hdfs hdfs 0 2015-05-14 09:19 TeraGen-50GB
drwxr-xr-x - hdfs hdfs 0 2015-05-14 10:11 TeraSort-50GB
drwxr-xr-x - hdfs hdfs 0 2015-05-24 20:06 bin
drwxr-xr-x - hdfs hdfs 0 2015-04-29 16:52 examples
drwxr-xr-x - hdfs hdfs 0 2015-04-27 16:00 flume-channel
drwxr-xr-x - hdfs hdfs 0 2015-04-29 14:33 oozie-4.1.0
drwxr-xr-x - hdfs hdfs 0 2015-04-30 10:35 oozie-examples
drwxr-xr-x - hdfs hdfs 0 2015-04-29 20:35 oozie-oozi
drwxr-xr-x - hdfs hdfs 0 2015-05-24 18:11 war-and-peace-input
drwxr-xr-x - hdfs hdfs 0 2015-05-25 15:22 war-and-peace-output
Found 1 items
-rw-r--r-- 2 hdfs hdfs 12857 2015-05-29 13:12 stuff/test
Deleted stuff/test
-------------------------------------------------
report: Access denied for user deadline. Superuser privilege is
required
package org.myorg;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
Next, compile the program using 'hadoop classpath' to ensure all the
class paths are available:
Click here to view code image
$ javac -cp 'hadoop classpath' -d HDFSClient-classes HDFSClient.java
A simple file copy from the local system to HDFS can be accomplished using
the following command:
Click here to view code image
The file can be seen in HDFS by using the hdfs dfs -ls command:
Click here to view code image
$ hdfs dfs -ls NOTES.txt
-rw-r--r-- 2 hdfs hdfs 502 2015-06-03 15:43 NOTES.txt
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "hdfs.h"
The example can be built using the following steps. The following software
environment is assumed:
Operating system: Linux
Platform: RHEL 6.6
Hortonworks HDP 2.2 with Hadoop Version: 2.6
The first step loads the Hadoop environment paths. In particular, the
$HADOOP_LIB path is needed for the compiler.
Click here to view code image
$ . /etc/hadoop/conf/hadoop-env.sh
The program is compiled using gcc and the following command line. In
addition to $HADOOP_LIB, the $JAVA_HOME path is assumed to be in the
local environment. If the compiler issues errors or warnings, confirm that all
paths are correct for the Hadoop and Java environment.
Click here to view code image
$ gcc hdfs-simple-test.c -I$HADOOP_LIB/include -I$JAVA_HOME/include -
L$HADOOP_LIB/lib -L$JAVA_HOME/jre/lib/amd64/server -ljvm -lhdfs -o
hdfs-simple-test
The location of the run-time library path needs to be set with the following
command:
Click here to view code image
$ export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server:$HADOOP_LIB/lib
The Hadoop class path needs to be set with the following command. The -
glob option is required because Hadoop version 2 uses a wildcard syntax in the
output of the hadoop classpath command. Hadoop version 1 used the full
path to every jar file without wildcards. Unfortunately, Java does not expand the
wildcards automatically when launching an embedded JVM via JNI, so older
scripts may not work. The -glob option expands the wildcards.
Click here to view code image
$ export CLASSPATH='hadoop classpath -glob'
The program can be run using the following. There may be some warnings
that can be ignored.
$ /hdfs-simple-test
The new file contents can be inspected using the hdfs dfs -cat
command:
Click here to view code image
$ hdfs dfs -cat /tmp/testfile.txt
Hello, World!
In This Chapter:
The steps needed to run the Hadoop MapReduce examples are provided.
An overview of the YARN ResourceManager web GUI is presented.
The steps needed to run two important benchmarks are provided.
The mapred command is introduced as a way to list and kill MapReduce
jobs.
When using new or updated hardware or software, simple examples and
benchmarks help confirm proper operation. Apache Hadoop includes many
examples and benchmarks to aid in this task. This chapter provides instructions
on how to run, monitor, and manage some basic MapReduce examples and
benchmarks.
Once you define the examples path, you can run the Hadoop examples using
the commands discussed in the following sections.
Note
In previous versions of Hadoop, the command hadoop jar . . .
was used to run MapReduce programs. Newer versions provide the yarn
command, which offers more capabilities. Both commands will work for
these examples.
If the program runs correctly, you should see output similar to the following.
(Some of the Hadoop INFO messages have been removed for clarity.)
Click here to view code image
Number of Maps = 16
Samples per Map = 1000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
...
15/05/13 20:10:30 INFO mapreduce.Job: map 0% reduce 0%
15/05/13 20:10:37 INFO mapreduce.Job: map 19% reduce 0%
15/05/13 20:10:39 INFO mapreduce.Job: map 50% reduce 0%
15/05/13 20:10:46 INFO mapreduce.Job: map 56% reduce 0%
15/05/13 20:10:47 INFO mapreduce.Job: map 94% reduce 0%
15/05/13 20:10:48 INFO mapreduce.Job: map 100% reduce 100%
15/05/13 20:10:48 INFO mapreduce.Job: Job job_1429912013449_0047
completed
successfully
15/05/13 20:10:48 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=358
FILE: Number of bytes written=1949395
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=4198
HDFS: Number of bytes written=215
HDFS: Number of read operations=67
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=16
Launched reduce tasks=1
Data-local map tasks=16
Total time spent by all maps in occupied slots (ms)=158378
Total time spent by all reduces in occupied slots (ms)=8462
Total time spent by all map tasks (ms)=158378
Total time spent by all reduce tasks (ms)=8462
Total vcore-seconds taken by all map tasks=158378
Total vcore-seconds taken by all reduce tasks=8462
Total megabyte-seconds taken by all map tasks=243268608
Total megabyte-seconds taken by all reduce tasks=12997632
Map-Reduce Framework
Map input records=16
Map output records=32
Map output bytes=288
Map output materialized bytes=448
Input split bytes=2310
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=448
Reduce input records=32
Reduce output records=0
Spilled Records=64
Shuffled Maps=16
Failed Shuffles=0
Merged Map outputs=16
GC time elapsed (ms)=1842
CPU time spent (ms)=11420
Physical memory (bytes) snapshot=13405769728
Virtual memory (bytes) snapshot=33911930880
Total committed heap usage (bytes)=17026777088
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1888
File Output Format Counters
Bytes Written=97
Job Finished in 23.718 seconds
Estimated value of Pi is 3.14159125000000000000
Notice that the MapReduce progress is shown in the same way as Hadoop
version 1, but the application statistics are different. Most of the statistics are
self-explanatory. The one important item to note is that the YARN MapReduce
framework is used to run the program. (See Chapter 1, “Background and
Concepts,” and Chapter 8, “Hadoop YARN Applications,” for more information
about YARN frameworks.)
Figure 4.1 Hadoop RUNNING Applications web GUI for the pi example
For those readers who have used or read about Hadoop version 1, if you look
at the Cluster Metrics table, you will see some new information. First, you will
notice that the “Map/Reduce Task Capacity” has been replaced by the number of
running containers. If YARN is running a MapReduce job, these containers can
be used for both map and reduce tasks. Unlike in Hadoop version 1, the number
of mappers and reducers is not fixed. There are also memory metrics and links to
node status. If you click on the Nodes link (left menu under About), you can get
a summary of the node activity and state. For example, Figure 4.2 is a snapshot
of the node activity while the pi application is running. Notice the number of
containers, which are used by the MapReduce framework as either mappers or
reducers.
Figure 4.2 Hadoop YARN ResourceManager nodes status window
Going back to the main Applications/Running window (Figure 4.1), if you
click on the application_14299... link, the Application status window in
Figure 4.3 will appear. This window provides an application overview and
metrics, including the cluster node on which the ApplicationMaster container is
running.
To report results, the time for the actual sort (terasort) is measured and the
benchmark rate in megabytes/second (MB/s) is calculated. For best performance,
the actual terasort benchmark should be run with a replication factor of 1. In
addition, the default number of terasort reducer tasks is set to 1. Increasing
the number of reducers often helps with benchmark performance. For example,
the following command will instruct terasort to use four reducer tasks:
Click here to view code image
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar terasort -
Dmapred.reduce.tasks=4 /user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-
50GB
Also, do not forget to clean up the terasort data between runs (and after
testing is finished). The following command will perform the cleanup for the
previous example:
Click here to view code image
$ hdfs dfs -rm -r -skipTrash Tera*
Example results are as follows (date and time prefix removed). The large
standard deviation is due to the placement of tasks in the cluster on a small
four-node cluster.
Click here to view code image
fs.TestDFSIO: ----- TestDFSIO ----- : read
fs.TestDFSIO: Date & time: Thu May 14 10:44:09 EDT 2015
fs.TestDFSIO: Number of files: 16
fs.TestDFSIO: Total MBytes processed: 16000.0
fs.TestDFSIO: Throughput mb/sec: 32.38643494172466
fs.TestDFSIO: Average IO rate mb/sec: 58.72880554199219
fs.TestDFSIO: IO rate std deviation: 64.60017624360337
fs.TestDFSIO: Test exec time sec: 62.798
In This Chapter:
The MapReduce computation model is presented using simple examples.
The Apache Hadoop MapReduce framework data flow is explained.
MapReduce fault tolerance, speculative execution, and hardware are
discussed.
The MapReduce programming model is conceptually simple. Based on two
simple steps—applying a mapping process and then reducing
(condensing/collecting) the results—it can be applied to many real-world
problems. In this chapter, we examine the MapReduce process using basic
command-line tools. We then expand this concept into a parallel MapReduce
model.
Though not strictly a MapReduce process, this idea is quite similar to and
much faster than the manual process of counting the instances of Kutuzov in the
printed book. The analogy can be taken a bit further by using the two simple
(and naive) shell scripts shown in Listing 5.1 and Listing 5.2. The shell scripts
are available from the book download page (see Appendix A). We can perform
the same operation (much more slowly) and tokenize both the Kutuzov and
Petersburg strings in the text:
Click here to view code image
$ cat war-and-peace.txt |./mapper.sh |./reducer.sh
Kutuzov,315
Petersburg,128
Notice that more instances of Kutuzov have been found (the first grep
command ignored instances like “Kutuzov.” or “Kutuzov,”). The mapper
inputs a text file and then outputs data in a (key, value) pair (token-name, count)
format. Strictly speaking, the input to the script is the file and the keys are
Kutuzov and Petersburg. The reducer script takes these key–value pairs
and combines the similar tokens and counts the total number of instances. The
result is a new key–value pair (token-name, sum).
#!/bin/bash
while read line ; do
for token in $line; do
if [ "$token" = "Kutuzov" ] ; then
echo "Kutuzov,1"
elif [ "$token" = "Petersburg" ] ; then
echo "Petersburg,1"
fi
done
done
#!/bin/bash
kcount=0
pcount=0
while read line ; do
if [ "$line" = "Kutuzov,1" ] ; then
let kcount=kcount+1
elif [ "$line" = "Petersburg,1" ] ; then
let pcount=pcount+1
fi
done
echo "Kutuzov,$kcount"
echo "Petersburg,$pcount"
The reducer function is then applied to each key–value pair, which in turn
produces a collection of values in the same domain:
Click here to view code image
Reduce(key2, list (value2)) → list(value3)
Each reducer call typically produces either one value (value3) or an empty
response. Thus, the MapReduce framework transforms a list of (key, value) pairs
into a list of values.
The MapReduce model is inspired by the map and reduce functions
commonly used in many functional programming languages. The functional
nature of MapReduce has some important properties:
Data flow is in one direction (map to reduce). It is possible to use the
output of a reduce step as the input to another MapReduce process.
As with functional programing, the input data are not changed. By
applying the mapping and reduction functions to the input data, new data
are produced. In effect, the original state of the Hadoop data lake is always
preserved (see Chapter 1, “Background and Concepts”).
Because there is no dependency on how the mapping and reducing
functions are applied to the data, the mapper and reducer data flow can be
implemented in any number of ways to provide better performance.
Distributed (parallel) implementations of MapReduce enable large amounts of
data to be analyzed quickly. In general, the mapper process is fully scalable and
can be applied to any subset of the input data. Results from multiple parallel
mapping functions are then combined in the reducer phase.
As mentioned in Chapter 1, Hadoop accomplishes parallelism by using a
distributed file system (HDFS) to slice and spread data over multiple servers.
Apache Hadoop MapReduce will try to move the mapping tasks to the server
that contains the data slice. Results from each data slice are then combined in the
reducer step. This process is explained in more detail in the next section.
HDFS is not required for Hadoop MapReduce, however. A sufficiently fast
parallel file system can be used in its place. In these designs, each server in the
cluster has access to a high-performance parallel file system that can rapidly
provide any data slice. These designs are typically more expensive than the
commodity servers used for many Hadoop clusters.
The first thing MapReduce will do is create the data splits. For simplicity,
each line will be one split. Since each split will require a map task, there are
three mapper processes that count the number of words in the split. On a cluster,
the results of each map task are written to local disk and not to HDFS. Next,
similar keys need to be collected and sent to a reducer process. The shuffle step
requires data movement and can be expensive in terms of processing time.
Depending on the nature of the application, the amount of data that must be
shuffled throughout the cluster can vary from small to large.
Once the data have been collected and sorted by key, the reduction step can
begin (even if only partial results are available). It is not necessary—and not
normally recommended—to have a reducer for each key–value pair as shown in
Figure 5.1. In some cases, a single reducer will provide adequate performance;
in other cases, multiple reducers may be required to speed up the reduce phase.
The number of reducers is a tunable option for many applications. The final step
is to write the output to HDFS.
As mentioned, a combiner step enables some pre-reduction of the map output
data. For instance, in the previous example, one map produced the following
counts:
(run,1)
(spot,1)
(run,1)
As shown in Figure 5.2, the count for run can be combined into (run,2)
before the shuffle. This optimization can help minimize the amount of data
transfer needed for the shuffle phase.
Speculative Execution
One of the challenges with many large clusters is the inability to predict or
manage unexpected system bottlenecks or failures. In theory, it is possible to
control and monitor resources so that network traffic and processor load can be
evenly balanced; in practice, however, this problem represents a difficult
challenge for large systems. Thus, it is possible that a congested network, slow
disk controller, failing disk, high processor load, or some other similar problem
might lead to slow performance without anyone noticing.
When one part of a MapReduce process runs slowly, it ultimately slows down
everything else because the application cannot complete until all processes are
finished. The nature of the parallel MapReduce model provides an interesting
solution to this problem. Recall that input data are immutable in the MapReduce
process. Therefore, it is possible to start a copy of a running map process without
disturbing any other running mapper processes. For example, suppose that as
most of the map tasks are coming to a close, the ApplicationMaster notices that
some are still running and schedules redundant copies of the remaining jobs on
less busy or free servers. Should the secondary processes finish first, the other
first processes are then terminated (or vice versa). This process is known as
speculative execution. The same approach can be applied to reducer processes
that seem to be taking a long time. Speculative execution can reduce cluster
efficiency because redundant resources are assigned to applications that seem to
have a slow spot. It can also be turned off and on in the mapred-site.xml
configuration file (see Chapter 9, “Managing Hadoop with Apache Ambari”).
In This Chapter:
The classic Java WordCount program for Hadoop is compiled and run.
A Python WordCount application using the Hadoop streaming interface is
introduced.
The Hadoop Pipes interface is used to run a C++ version of WordCount.
An example of MapReduce chaining is presented using the Hadoop Grep
example.
Strategies for MapReduce debugging are presented.
At the base level, Hadoop provides a platform for Java-based MapReduce
programming. These applications run natively on most Hadoop installations. To
offer more variability, a streaming interface is provided that enables almost any
programming language to take advantage of the Hadoop MapReduce engine. In
addition, a pipes C++ interface is provided that can work directly with the
MapReduce components. This chapter provides programming examples of these
interfaces and presents some debugging strategies.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
The mapper implementation, via the map method, processes one line at a time
as provided by the specified TextInputFormat class. It then splits the line
into tokens separated by whitespaces using the StringTokenizer and emits
a key–value pair of <word, 1>. The relevant code section is as follows:
Click here to view code image
Given two input files with contents Hello World Bye World and
Hello Hadoop Goodbye Hadoop, the WordCount mapper will produce
two maps:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
a combiner
Click here to view code image
job.setCombinerClass(IntSumReducer.class);
and a reducer
Click here to view code image
job.setReducerClass(IntSumReducer.class);
Hence, the output of each map is passed through the local combiner (which sums
the values in the same way as the reducer) for local aggregation and then sends
the data on to the final reducer. Thus, each map above the combiner performs the
following pre-reductions:
< Bye, 1>
< Hello, 1>
< World, 2>
The reducer implementation, via the reduce method, simply sums the values,
which are the occurrence counts for each key. The relevant code section is as
follows:
Click here to view code image
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
The source code for WordCount.java is available from the book download
page (see Appendix A, “Book Webpage and Code Download”). To compile and
run the program from the command line, perform the following steps:
1. Make a local wordcount_classes directory.
$ mkdir wordcount_classes
4. To run the example, create an input directory in HDFS and place a text file
in the new directory. For this example, we will use the war-and-
peace.txt file (available from the book download page; see Appendix
A):
Click here to view code image
$ hdfs dfs -mkdir war-and-peace-input
$ hdfs dfs -put war-and-peace.txt war-and-peace-input
If everything is working correctly, Hadoop messages for the job should look
like the following (abbreviated version):
Click here to view code image
15/05/24 18:13:26 INFO impl.TimelineClientImpl: Timeline service
address:
http://limulus:8188/ws/v1/timeline/
15/05/24 18:13:26 INFO client.RMProxy: Connecting to ResourceManager
at
limulus/10.0.0.1:8050
15/05/24 18:13:26 WARN mapreduce.JobSubmitter: Hadoop command-line
option parsing
not performed. Implement the Tool interface and execute your
application with
ToolRunner to remedy this.
15/05/24 18:13:26 INFO input.FileInputFormat: Total input paths to
process : 1
15/05/24 18:13:27 INFO mapreduce.JobSubmitter: number of splits:1
[...]
File Input Format Counters
Bytes Read=3288746
File Output Format Counters
Bytes Written=467839
The complete list of word counts can be copied from HDFS to the working
directory with the following command:
Click here to view code image
$ hdfs dfs -get war-and-peace-output/part-r-00000.
If the WordCount program is run again using the same outputs, it will fail
when it tries to overwrite the war-and-peace-output directory. The output
directory and all contents can be removed with the following command:
Click here to view code image
$ hdfs dfs -rm -r -skipTrash war-and-peace-output
#!/usr/bin/env python
import sys
#!/usr/bin/env python
current_word = None
current_count = 0
word = None
Piping the results of the map into the sort command can create a simulated
shuffle phase:
Click here to view code image
$ echo "foo foo quux labs foo bar quux" | ./mapper.py|sort -k1,1
Bar 1
Foo 1
Foo 1
Foo 1
Labs 1
Quux 1
Quux 1
Make sure the output directory is removed from any previous test runs:
Click here to view code image
$ hdfs dfs -rm -r -skipTrash war-and-peace-output
The output will be the familiar (_SUCCESS and part-00000) in the war-
and-peace-output directory. The actual file name may be slightly different
depending on your Hadoop version. Also note that the Python scripts used in this
example could be Bash, Perl, Tcl, Awk, compiled C code, or any language that
can read and write from stdin and stdout.
Although the streaming interface is rather simple, it does have some
disadvantages over using Java directly. In particular, not all applications are
string and character based, and it would be awkward to try to use stdin and
stdout as a way to transmit binary data. Another disadvantage is that many
tuning parameters available through the full Java Hadoop API are not available
in streaming.
#include <algorithm>
#include <limits>
#include <string>
#include "stdint.h" // <--- to prevent uint64_t errors!
#include "Pipes.hh"
#include "TemplateFactory.hh"
#include "StringUtils.hh"
The wordcount.cpp source is available from the book download page (see
Appendix A) or from http://wiki.apache.org/hadoop/C++WordCount. The
location of the Hadoop include files and libraries may need to be specified
when compiling the code. If $HADOOP_HOME is defined, the following options
should provide the correct path. Check to make sure the paths are correct for
your installation.
Click here to view code image
-L$HADOOP_HOME/lib/native/ -I$HADOOP_HOME/include
As mentioned, the executable must be placed into HDFS so YARN can find
the program. Also, the output directory must be removed before running the
program:
Click here to view code image
To run the program, enter the following line (shown in multiple lines for
clarity). The lines specifying the recordreader and recordwriter
indicate that the default Java text versions should be used. Also note that the
location of the program in HDFS must be specified.
Click here to view code image
$ mapred pipes \
-D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input war-and-peace-input \
-output war-and-peace-output \
-program bin/wordcount
When run, the program will produce the familiar output (_SUCCESS and
part-00000) in the war-and-peace-output directory. The part-
00000 file should be identical to the Java WordCount version.
package org.apache.hadoop.examples;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.map.RegexMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.mapreduce.lib.reduce.LongSumReducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
In the preceding code, each mapper of the first job takes a line as input and
matches the user-provided regular expression against the line. The
RegexMapper class is used to perform this task and extracts text matching
using the given regular expression. The matching strings are output as
<matching string, 1> pairs. As in the previous WordCount example, each reducer
sums up the number of matching strings and employs a combiner to do local
sums. The actual reducer uses the LongSumReducer class that outputs the
sum of long values per reducer input key.
The second job takes the output of the first job as its input. The mapper is an
inverse map that reverses (or swaps) its input <key, value> pairs into <value,
key>. There is no reduction step, so the IdentityReducer class is used by
default. All input is simply passed to the output. (Note: There is also an
IdentityMapper class.) The number of reducers is set to 1, so the output is
stored in one file and it is sorted by count in descending order. The output text
file contains a count and a string per line.
The example also demonstrates how to pass a command-line parameter to a
mapper or a reducer.
The following discussion describes how to compile and run the Grep.java
example. The steps are similar to the previous WordCount example:
1. Create a directory for the application classes as follows:
$ mkdir Grep_classes
As always, make sure the output directory has been removed by issuing the
following command:
Click here to view code image
$ hdfs dfs -rm -r -skipTrash war-and-peace-output
As the example runs, two stages will be evident. Each stage is easily
recognizable in the program output. The results can be found by examining the
resultant output file.
Click here to view code image
$ hdfs dfs -cat war-and-peace-output/part-r-00000
530 Kutuzov
Debugging MapReduce
The best advice for debugging parallel MapReduce applications is this: Don’t.
The key word here is parallel. Debugging on a distributed system is hard and
should be avoided at all costs.
The best approach is to make sure applications run on a simpler system (i.e.,
the HDP Sandbox or the pseudo-distributed single-machine install) with smaller
data sets. Errors on these systems are much easier to locate and track. In
addition, unit testing applications before running at scale is important. If
applications can run successfully on a single system with a subset of real data,
then running in parallel should be a simple task because the MapReduce
algorithm is transparently scalable. Note that many higher-level tools (e.g., Pig
and Hive) enable local mode development for this reason. Should errors occur at
scale, the issue can be tracked from the log file (see the section “Hadoop Log
Management”) and may stem from a systems issue rather than a program
artifact.
When investigating program behavior at scale, the best approach is to use the
application logs to inspect the actual MapReduce progress. The time-tested
debug print statements are also visible in the logs.
Figure 6.1 Log information for map process (stdout, stderr, and syslog)
If log aggregation is not enabled, the logs will be placed locally on the cluster
nodes on which the mapper or reducer ran. The location of the unaggregated
local logs is given by the yarn.nodemanager.log-dirs property in the
yarn-site.xml file. Without log aggregation, the cluster nodes used by the
job must be noted, and then the log files must be obtained directly from the
nodes. Log aggregation is highly recommended.
Note
Log aggregation is disabled in the pseudo-distributed installation
presented in Chapter 2.
2. Add the following properties in the yarn-site.xml (on all nodes) and
restart all YARN services on all nodes (the ResourceManager,
NodeManagers, and JobHistoryServer).
Click here to view code image
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/yarn/logs</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
For example, after running the pi example program (discussed in Chapter 4),
the logs can be examined as follows:
Click here to view code image
$ hadoop jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar pi 16
100000
After the pi example completes, note the applicationId, which can be found
either from the application output or by using the yarn application
command. The applicationId will start with application_ and appear under
the Application-Id column.
Click here to view code image
$ yarn application -list -appStates FINISHED
Next, run the following command to produce a dump of all the logs for that
application. Note that the output can be long and is best saved to a file.
Click here to view code image
$ yarn logs -applicationId application_1432667013445_0001 > AppOut
The AppOut file can be inspected using a text editor. Note that for each
container, stdout, stderr, and syslog are provided (the same as the GUI
version in Figure 6.1). The list of actual containers can be found by using the
following command:
$ grep -B 1 ===== AppOut
For example (output truncated):
Click here to view code image
[...]
Container: container_1432667013445_0001_01_000008 on limulus_45454
====================================================================
--
Container: container_1432667013445_0001_01_000010 on limulus_45454
====================================================================
--
Container: container_1432667013445_0001_01_000001 on n0_45454
===============================================================
--
Container: container_1432667013445_0001_01_000023 on n1_45454
===============================================================
[...]
In This Chapter:
The Pig scripting tool is introduced as a way to quickly examine data both
locally and on a Hadoop cluster.
The Hive SQL-like query tool is explained using two examples.
The Sqoop RDBMS tool is used to import and export data from MySQL
to/from HDFS.
The Flume streaming data transport utility is configured to capture weblog
data into HDFS.
The Oozie workflow manager is used to run basic and complex Hadoop
workflows.
The distributed HBase database is used to store and access data on a
Hadoop cluster.
The Hadoop ecosystem offers many tools to help with data input, high-level
processing, workflow management, and creation of huge databases. Each tool is
managed as a separate Apache Software foundation project, but is designed to
operate with the core Hadoop services including HDFS, YARN, and
MapReduce. Background on each tool is provided in this chapter, along with a
start to finish example.
Next, copy the data file into HDFS for Hadoop MapReduce operation:
$ hdfs dfs -put passwd passwd
You can confirm the file is in HDFS by entering the following command:
Click here to view code image
hdfs dfs -ls passwd
-rw-r--r-- 2 hdfs hdfs 2526 2015-03-17 11:08 passwd
In the following example of local Pig operation, all processing is done on the
local machine (Hadoop is not used). First, the interactive command line is
started:
$ pig -x local
If Pig starts correctly, you will see a grunt> prompt. You may also see a
bunch of INFO messages, which you can ignore. Next, enter the following
commands to load the passwd file and then grab the user name and dump it to
the terminal. Note that Pig commands must end with a semicolon (;).
Click here to view code image
grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;
The processing will start and a list of user names will be printed to the screen.
To exit the interactive session, enter the command quit.
$ grunt> quit
To use Hadoop MapReduce, start Pig as follows (or just enter pig):
$ pig -x mapreduce
The same sequence of commands can be entered at the grunt> prompt. You
may wish to change the $0 argument to pull out other items in the passwd file.
In the case of this simple script, you will notice that the MapReduce version
takes much longer. Also, because we are running this application under Hadoop,
make sure the file is placed in HDFS.
If you are using the Hortonworks HDP distribution with tez installed, the
tez engine can be used as follows:
$ pig -x tez
Pig can also be run from a script. An example script (id.pig) is available
from the example code download (see Appendix A, “Book Webpage and Code
Download”). This script, which is repeated here, is designed to do the same
things as the interactive version:
Click here to view code image
/* id.pig */
A = load 'passwd' using PigStorage(':'); -- load the passwd file
B = foreach A generate $0 as id; -- extract the user IDs
dump B;
store B into 'id.out'; -- write the results to a directory name
id.out
Comments are delineated by /* */ and -- at the end of a line. The script
will create a directory called id.out for the results. First, ensure that the
id.out directory is not in your local directory, and then start Pig with the script
on the command line:
$ /bin/rm -r id.out/
$ pig -x local id.pig
If the script worked correctly, you should see at least one data file with the
results and a zero-length file with the name _SUCCESS. To run the MapReduce
version, use the same procedure; the only difference is that now all reading and
writing takes place in HDFS.
$ hdfs dfs -rm -r id.out
$ pig id.pig
If Apache tez is installed, you can run the example script using the -x tez
option. You can learn more about writing Pig scripts at
http://pig.apache.org/docs/r0.14.0/start.html.
As a simple test, create and drop a table. Note that Hive commands must end
with a semicolon (;).
Click here to view code image
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 1.705 seconds
hive> SHOW TABLES;
OK
pokes
Time taken: 0.174 seconds, Fetched: 1 row(s)
hive> DROP TABLE pokes;
OK
Time taken: 4.038 seconds
A more detailed example can be developed using a web server log file to
summarize message types. First, create a table using the following command:
Click here to view code image
Finally, apply the select step to the file. Note that this invokes a Hadoop
MapReduce operation. The results appear at the end of the output (e.g., totals for
the message types DEBUG, ERROR, and so on).
Click here to view code image
hive> SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%'
GROUP BY t4;
Query ID = hdfs_20150327130000_d1e1a265-a5d7-4ed8-b785-2c6569791368
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size:
1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1427397392757_0001, Tracking URL =
http://norbert:8088/proxy/
application_1427397392757_0001/
Kill Command = /opt/hadoop-2.6.0/bin/hadoop job -kill
job_1427397392757_0001
Hadoop job information for Stage-1: number of mappers: 1; number of
reducers: 1
2015-03-27 13:00:17,399 Stage-1 map = 0%, reduce = 0%
2015-03-27 13:00:26,100 Stage-1 map = 100%, reduce = 0%, Cumulative
CPU 2.14 sec
2015-03-27 13:00:34,979 Stage-1 map = 100%, reduce = 100%, Cumulative
CPU 4.07 sec
MapReduce Total cumulative CPU time: 4 seconds 70 msec
Ended Job = job_1427397392757_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.07 sec HDFS Read:
106384
HDFS Write: 63 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 70 msec
OK
[DEBUG] 434
[ERROR] 3
[FATAL] 1
[INFO] 96
[TRACE] 816
[WARN] 4
Time taken: 32.624 seconds, Fetched: 6 row(s)
Load the movie data into the table with the following command:
Click here to view code image
hive> LOAD DATA LOCAL INPATH './u.data' OVERWRITE INTO TABLE u_data;
The number of rows in the table can be reported by entering the following
command:
Click here to view code image
hive > SELECT COUNT(*) FROM u_data;
This command will start a single MapReduce job and should finish with the
following lines:
Click here to view code image
...
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.26 sec HDFS Read:
1979380
HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 260 msec
OK
100000
Time taken: 28.366 seconds, Fetched: 1 row(s)
Now that the table data are loaded, use the following command to make the
new table (u_data_new):
Click here to view code image
hive> CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
The next command adds the weekday_mapper.py to Hive resources:
Click here to view code image
hive> add FILE weekday_mapper.py;
If the transformation was successful, the following final portion of the output
should be displayed:
Click here to view code image
...
Table default.u_data_new stats: [numFiles=1, numRows=100000,
totalSize=1179173,
rawDataSize=1079173]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.44 sec HDFS Read: 1979380
HDFS Write:
1179256 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 440 msec
OK
Time taken: 24.06 seconds
The final query will sort and group the reviews by weekday:
Click here to view code image
hive> SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;
Final output for the review counts by weekday should look like the following:
Click here to view code image
...
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.39 sec HDFS Read:
1179386
HDFS Write: 56 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 390 msec
OK
1 13278
2 14816
3 15426
4 13774
5 17964
6 12318
7 12424
Time taken: 22.645 seconds, Fetched: 7 row(s)
As shown previously, you can remove the tables used in this example with the
DROP TABLE command. In this case, we are also using the -e command-line
option. Note that queries can be loaded from files using the -f option as well.
Click here to view code image
$ hive -e 'drop table u_data_new'
$ hive -e 'drop table u_data'
Figure 7.1 Two-step Apache Sqoop data import method (Adapted from
Apache Sqoop Documentation)
The imported data are saved in an HDFS directory. Sqoop will use the
database name for the directory, or the user can specify any alternative directory
where the files should be populated. By default, these files contain comma-
delimited fields, with new lines separating different records. You can easily
override the format in which data are copied over by explicitly specifying the
field separator and record terminator characters. Once placed in HDFS, the data
are ready for processing.
Data export from the cluster works in a similar fashion. The export is done in
two steps, as shown in Figure 7.2. As in the import process, the first step is to
examine the database for metadata. The export step again uses a map-only
Hadoop job to write the data to the database. Sqoop divides the input data set
into splits, then uses individual map tasks to push the splits to the database.
Again, this process assumes the map tasks have access to the database.
Figure 7.2 Two-step Sqoop data export method (Adapted from Apache Sqoop
Documentation)
For this example, we will use the world example database from the MySQL
site (http://dev.mysql.com/doc/world-setup/en/index.html). This database has
three tables:
Country: information about countries of the world
City: information about some of the cities in those countries
CountryLanguage: languages spoken in each country
To get the database, use wget to download and then extract the file:
Click here to view code image
$ wget http://downloads.mysql.com/docs/world_innodb.sql.gz
$ gunzip world_innodb.sql.gz
Next, log into MySQL (assumes you have privileges to create a database) and
import the desired database by following these steps:
Click here to view code image
$ mysql -u root -p
mysql> CREATE DATABASE world;
mysql> USE world;
mysql> SOURCE world_innodb.sql;
mysql> SHOW TABLES;
+-----------------+
| Tables_in_world |
+-----------------+
| City |
| Country |
| CountryLanguage |
+-----------------+
3 rows in set (0.01 sec)
The following MySQL command will let you see the table details (output
omitted for clarity):
Click here to view code image
mysql> SHOW CREATE TABLE Country;
mysql> SHOW CREATE TABLE City;
mysql> SHOW CREATE TABLE CountryLanguage;
Step 2: Add Sqoop User Permissions for the Local Machine and Cluster
In MySQL, add the following privileges for user sqoop to MySQL. Note that
you must use both the local host name and the cluster subnet for Sqoop to work
properly. Also, for the purposes of this example, the sqoop password is sqoop.
Click here to view code image
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'limulus'
IDENTIFIED BY 'sqoop';
mysql> GRANT ALL PRIVILEGES ON world.* To 'sqoop'@'10.0.0.%'
IDENTIFIED BY 'sqoop';
mysql> quit
In a similar fashion, you can use Sqoop to connect to MySQL and list the
tables in the world database:
Click here to view code image
sqoop list-tables --connect jdbc:mysql://limulus/world --username
sqoop --password sqoop
...
14/08/18 14:39:43 INFO sqoop.Sqoop: Running Sqoop version:
1.4.4.2.1.2.1-471
14/08/18 14:39:43 WARN tool.BaseSqoopTool: Setting your password on
the
command-line is insecure. Consider using -P instead.
14/08/18 14:39:43 INFO manager.MySQLManager: Preparing to use a MySQL
streaming
resultset.
City
Country
CountryLanguage
The file can be viewed using the hdfs dfs -cat command:
Click here to view code image
$ hdfs dfs -cat sqoop-mysql-import/country/part-m-00000
ABW,Aruba,North
America,Caribbean,193.0,null,103000,78.4,828.0,793.0,Aruba,
Nonmetropolitan
Territory of The Netherlands,Beatrix,129,AW
...
ZWE,Zimbabwe,Africa,Eastern
Africa,390757.0,1980,11669000,37.8,5951.0,8670.0,
Zimbabwe,
Republic,Robert G. Mugabe,4068,ZW
To make the Sqoop command more convenient, you can create an options file
and use it on the command line. Such a file enables you to avoid having to
rewrite the same options. For example, a file called world-options.txt
with the following contents will include the import command, --connect, -
-username, and --password options:
import
--connect
jdbc:mysql://limulus/world
--username
sqoop
--password
sqoop
The same import command can be performed with the following shorter line:
Click here to view code image
It is also possible to include an SQL Query in the import step. For example,
suppose we want just cities in Canada:
Click here to view code image
SELECT ID,Name from City WHERE CountryCode='CAN'
In such a case, we can include the --query option in the Sqoop import request.
The --query option also needs a variable called $CONDITIONS, which will
be explained next. In the following query example, a single mapper task is
designated with the -m 1 option:
Click here to view code image
sqoop --options-file world-options.txt -m 1 --target-dir
/user/hdfs/sqoop-mysql-import/canada-city --query "SELECT ID,Name
from City WHERE CountryCode='CAN' AND \$CONDITIONS"
Inspecting the results confirms that only cities from Canada have been imported:
Click here to view code image
$ hdfs dfs -cat sqoop-mysql-import/canada-city/part-m-00000
1810,MontrÄal
1811,Calgary
1812,Toronto
...
1856,Sudbury
1857,Kelowna
1858,Barrie
Since there was only one mapper process, only one copy of the query needed to
be run on the database. The results are also reported in a single file (part-m-
0000).
Multiple mappers can be used to process the query if the --split-by
option is used. The split-by option is used to parallelize the SQL query. Each
parallel task runs a subset of the main query, with the results of each sub-query
being partitioned by bounding conditions inferred by Sqoop. Your query must
include the token $CONDITIONS that each Sqoop process will replace with a
unique condition expression based on the --split-by option. Note that
$CONDITIONS is not an environment variable. Although Sqoop will try to
create balanced sub-queries based on the range of your primary key, it may be
necessary to split on another column if your primary key is not uniformly
distributed.
The following example illustrates the use of the --split-by option. First,
remove the results of the previous query:
Click here to view code image
$ hdfs dfs -rm -r -skipTrash sqoop-mysql-import/canada-city
Next, run the query using four mappers (-m 4), where we split by the ID
number (--split-by ID):
Click here to view code image
sqoop --options-file world-options.txt -m 4 --target-dir
/user/hdfs/sqoop-mysql-import/canada-city --query "SELECT ID,Name
from City WHERE CountryCode='CAN' AND \$CONDITIONS" --split-by ID
Finally, to make sure everything worked correctly, check the table in MySQL
to see if the cities are in the table:
Click here to view code image
$ mysql> select * from CityExport limit 10;
+----+----------------+-------------+---------------+------------+
| ID | Name | CountryCode | District | Population |
+----+----------------+-------------+---------------+------------+
| 1 | Kabul | AFG | Kabol | 1780000 |
| 2 | Qandahar | AFG | Qandahar | 237500 |
| 3 | Herat | AFG | Herat | 186800 |
| 4 | Mazar-e-Sharif | AFG | Balkh | 127800 |
| 5 | Amsterdam | NLD | Noord-Holland | 731200 |
| 6 | Rotterdam | NLD | Zuid-Holland | 593321 |
| 7 | Haag | NLD | Zuid-Holland | 440900 |
| 8 | Utrecht | NLD | Utrecht | 234323 |
| 9 | Eindhoven | NLD | Noord-Brabant | 201843 |
| 10 | Tilburg | NLD | Noord-Brabant | 193238 |
+----+----------------+-------------+---------------+------------+
10 rows in set (0.00 sec)
The following examples will also require some configuration files. See
Appendix A for download instructions.
If Flume is working correctly, the window where the Flume agent was started
will show the testing message entered in the telnet window:
Click here to view code image
14/08/14 16:20:58 INFO sink.LoggerSink: Event: { headers:{} body: 74
65 73 74 69
6E 67 20 20 31 20 32 20 33 0D testing 1 2 3. }
Now that you have created the data directories, you can start the Flume target
agent (execute as user hdfs):
Click here to view code image
This agent writes the data into HDFS and should be started before the source
agent. (The source reads the weblogs.) This configuration enables automatic use
of the Flume agent. The /etc/flume/conf/{flume.conf, flume-
env.sh.template} files need to be configured for this purpose. For this
example, the /etc/flume/conf/flume.conf file can be the same as the
web-server-target.conf file (modified for your environment).
Note
With the HDP distribution, Flume can be started as a service when the
system boots (e.g., service start flume).
In this example, the source agent is started as root, which will start to feed
the weblog data to the target agent. Alternatively, the source agent can be on
another machine if desired.
Click here to view code image
To see if Flume is working correctly, check the local log by using the tail
command. Also confirm that the flume-ng agents are not reporting any errors
(the file name will vary).
Click here to view code image
$ tail -f /var/log/flume-hdfs/1430164482581-1
The contents of the local log under flume-hdfs should be identical to that
written into HDFS. You can inspect this file by using the hdfs -tail
command (the file name will vary). Note that while running Flume, the most
recent file in HDFS may have the extension .tmp appended to it. The .tmp
indicates that the file is still being written by Flume. The target agent can be
configured to write the file (and start another .tmp file) by setting some or all of
the rollCount, rollSize, rollInterval, idleTimeout, and
batchSize options in the configuration file.
Click here to view code image
$ hdfs dfs -tail flume-
channel/apache_access_combined/150427/FlumeData.1430164801381
Both files should contain the same data. For instance, the preceding example
had the following data in both files:
Click here to view code image
10.0.0.1 - - [27/Apr/2015:16:04:21 -0400] "GET /ambarinagios/nagios/
nagios_alerts.php?q1=alerts&alert_type=all HTTP/1.1" 200 30801 "-"
"Java/1.7.0_65"
10.0.0.1 - - [27/Apr/2015:16:04:25 -0400] "POST /cgi-bin/rrd.py
HTTP/1.1" 200 784
"-" "Java/1.7.0_65"
10.0.0.1 - - [27/Apr/2015:16:04:25 -0400] "POST /cgi-bin/rrd.py
HTTP/1.1" 200 508
"-" "Java/1.7.0_65"
You can modify both the target and source files to suit your system.
source_agent.sources = apache_server
source_agent.sources.apache_server.type = exec
source_agent.sources.apache_server.command = tail -f /etc/httpd/
logs/access_log
The target file also defines the port and two channels (mc1 and mc2). One
of these channels writes the data to the local file system, and the other
writes to HDFS. The relevant lines are shown here:
Click here to view code image
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2
collector.sinks.LocalOut.sink.directory = /var/log/flume-hdfs
collector.sinks.LocalOut.channel = mc1
The HDFS file rollover counts create a new file when a threshold is
exceeded. In this example, that threshold is defined to allow any file size
and write a new file after 10,000 events or 600 seconds.
Click here to view code image
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600
Figure 7.6 A simple Oozie DAG workflow (Adapted from Apache Oozie
Documentation)
Oozie workflow definitions are written in hPDL (an XML Process Definition
Language). Such workflows contain several types of nodes:
Control flow nodes define the beginning and the end of a workflow. They
include start, end, and optional fail nodes.
Action nodes are where the actual processing tasks are defined. When an
action node finishes, the remote systems notify Oozie and the next node in
the workflow is executed. Action nodes can also include HDFS
commands.
Fork/join nodes enable parallel execution of tasks in the workflow. The
fork node enables two or more tasks to run at the same time. A join node
represents a rendezvous point that must wait until all forked tasks
complete.
Control flow nodes enable decisions to be made about the previous task.
Control decisions are based on the results of the previous action (e.g., file
size or file existence). Decision nodes are essentially switch-case
statements that use JSP EL (Java Server Pages—Expression Language)
that evaluate to either true or false.
Figure 7.7 depicts a more complex workflow that uses all of these node types.
More information on Oozie can be found at
http://oozie.apache.org/docs/4.1.0/index.html.
Figure 7.7 A more complex Oozie DAG workflow (Adapted from Apache
Oozie Documentation)
For HDP 2.2, the following command will extract the files:
Click here to view code image
$ tar xvzf /usr/hdp/2.2.4.2-2/oozie/doc/oozie-examples.tar.gz
The examples must also be placed in HDFS. Enter the following command to
move the example files into HDFS:
Click here to view code image
The Oozie shared library must be installed in HDFS. If you are using the
Ambari installation of HDP 2.x, this library is already found in HDFS:
/user/oozie/share/lib.
Note
In HDP 2.2+, some additional version-tagged directories may appear
below this path. If you installed and built Oozie by hand, then make sure
/user/oozie exists in HDFS and put the oozie-sharelib files in
this directory as user oozie and group hadoop.
$ cd oozie-examples/apps/map-reduce/
This directory contains two files and a lib directory. The files are:
The job.properties file defines parameters (e.g., path names, ports)
for a job. This file may change per job.
The workflow.xml file provides the actual workflow for the job. In this
case, it is a simple MapReduce (pass/fail). This file usually stays the same
between jobs.
The job.properties file included in the examples requires a few edits to
work properly. Using a text editor, change the following lines by adding the host
name of the NameNode and ResourceManager (indicated by jobTracker in
the file).
Click here to view code image
nameNode=hdfs://localhost:8020
jobTracker=localhost:8032
For example, for the cluster created with Ambari in Chapter 2, the lines were
changed to
nameNode=hdfs://limulus:8020
jobTracker=limulus:8050
You will need to change the “limulus” host name to match the name of the
node running your Oozie server. The job ID can be used to track and control job
progress.
If you receive this message, make sure the following is defined in the
core-site.xml file:
Click here to view code image
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
If you are using Ambari, make this change (or add the lines) in the
Services/HDFS/Config window and restart Hadoop. Otherwise, make the
change by hand and restart all the Hadoop daemons.
This setting is required because Oozie needs to impersonate other users to
run jobs. The group property can be set to a specific user group or to a
wild card. This setting allows the account that runs the Oozie server to run
as part of the user’s group.
To avoid having to provide the -oozie option with the Oozie URL every
time you run the oozie command, set the OOZIE_URL environment variable
as follows (using your Oozie server host name in place of “limulus”):
Click here to view code image
$ export OOZIE_URL="http://limulus:11000/oozie"
You can now run all subsequent Oozie commands without specifying the -
oozie URL option. For instance, using the job ID, you can learn about a
particular job’s progress by issuing the following command:
Click here to view code image
$ oozie job -info 0000001-150424174853048-oozie-oozi-W
The resulting output (line length compressed) is shown in the following listing.
Because this job is just a simple test, it may be complete by the time you issue
the -info command. If it is not complete, its progress will be indicated in the
listing.
Click here to view code image
Job ID : 0000001-150424174853048-oozie-oozi-W
---------------------------------------------------------------------
---------------
Workflow Name : map-reduce-wf
App Path : hdfs://limulus:8020/user/hdfs/examples/apps/map-reduce
Status : SUCCEEDED
Run : 0
User : hdfs
Group : -
Created : 2015-04-29 20:52 GMT
Started : 2015-04-29 20:52 GMT
Last Modified : 2015-04-29 20:53 GMT
Ended : 2015-04-29 20:53 GMT
CoordAction ID: -
Actions
---------------------------------------------------------------------
---------------
ID Status Ext ID Ext Status Err Code
---------------------------------------------------------------------
---------------
0000001-150424174853048-oozie
-oozi-W@:start: OK - OK -
---------------------------------------------------------------------
---------------
0000001-150424174853048-oozie
-oozi-W@mr-node OK job_1429912013449_0006 SUCCEEDED -
---------------------------------------------------------------------
---------------
0000001-150424174853048-oozie
-oozi-W@end OK - OK -
---------------------------------------------------------------------
---------------
The various steps shown in the output can be related directly to the
workflow.xml mentioned previously. Note that the MapReduce job number
is provided. This job will also be listed in the ResourceManager web user
interface. The application output is located in HDFS under the oozie-
examples/output-data/map-reduce directory.
Figure 7.8 shows the main Oozie console window. Note that a link to Oozie
documentation is available directly from this window.
Suspend a workflow:
Click here to view code image
Resume a workflow:
Click here to view code image
$ oozie job -resume _OOZIE_JOB_ID_
Rerun a workflow:
Click here to view code image
$ oozie job -rerun _OOZIE_JOB_ID_ -config JOB_PROPERTIES
Kill a job:
Click here to view code image
The data can be downloaded from Google using the following command. Note
that other stock prices are available by changing the NASDAQ:AAPL argument
to any other valid exchange and stock name (e.g., NYSE: IBM).
Click here to view code image
The Apple stock price database is in comma-separated format (csv) and will
be used to illustrate some basic operations in the HBase shell.
In this case, the table name is apple, and two columns are defined. The date
will be used as the row key. The price column is a family of four values
(open, close, low, high). The put command is used to add data to the
database from within the shell. For instance, the preceding data can be entered
by using the following commands:
Click here to view code image
put 'apple','6-May-15','price:open','126.56'
put 'apple','6-May-15','price:high','126.75'
put 'apple','6-May-15','price:low','123.36'
put 'apple','6-May-15','price:close','125.01'
put 'apple','6-May-15','volume','71820387'
Note that these commands can be copied and pasted into HBase shell and are
available from the book download files (see Appendix A). The shell also keeps a
history for the session, and previous commands can be retrieved and edited for
resubmission.
Get a Row
You can use the row key to access an individual row. In the stock price database,
the date is the row key.
Click here to view code image
hbase(main):008:0> get 'apple', '6-May-15'
COLUMN CELL
price:close timestamp=1430955128359, value=125.01
price:high timestamp=1430955126024, value=126.75
price:low timestamp=1430955126053, value=123.36
price:open timestamp=1430955125977, value=126.56
volume: timestamp=1430955141440, value=71820387
5 row(s) in 0.0130 seconds
Delete a Cell
A specific cell can be deleted using the following command:
Click here to view code image
hbase(main):009:0> delete 'apple', '6-May-15' , 'price:low'
If the row is inspected using get, the price:low cell is not listed.
Click here to view code image
Delete a Row
You can delete an entire row by giving the deleteall command as follows:
Click here to view code image
hbase(main):009:0> deleteall 'apple', '6-May-15'
Remove a Table
To remove (drop) a table, you must first disable it. The following two commands
remove the apple table from Hbase:
Click here to view code image
hbase(main):009:0> disable 'apple'
hbase(main):010:0> drop 'apple'
Scripting Input
Commands to the HBase shell can be placed in bash scripts for automated
processing. For instance, the following can be placed in a bash script:
Click here to view code image
While the script can be easily modified to accommodate other types of data, it
is not recommended for production use because the upload is very inefficient
and slow. Instead, this script is best used to experiment with small data files and
different types of data.
$ convert-to-tsv.sh Apple-stock.csv
Finally, ImportTsv is run using the following command line. Note the
column designation in the -Dimporttsv.columns option. In the example,
the HBASE_ROW_KEY is set as the first column—that is, the date for the data.
Click here to view code image
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -
Dimporttsv.columns=HBASE_ROW_KEY,price:open,price:high,price:low,price:close,volu
apple /tmp/Apple-stock.tsv
The ImportTsv command will use MapReduce to load the data into HBase.
To verify that the command works, drop and re-create the apple database, as
described previously, before running the import command.
In This Chapter:
The YARN Distributed-Shell is introduced as a non-MapReduce
application.
The Hadoop YARN application and operation structure is explained.
A summary of YARN application frameworks is provided.
The introduction of Hadoop version 2 has drastically increased the number and
scope of new applications. By splitting the version 1 monolithic MapReduce
engine into two parts, a scheduler and the MapReduce framework, Hadoop has
become a general-purpose large-scale data analytics platform. A simple example
of a non-MapReduce Hadoop application is the YARN Distributed-Shell
described in this chapter. As the number of non-MapReduce application
frameworks continues to grow, the user’s ability to navigate the data lake
increases.
YARN Distributed-Shell
The Hadoop YARN project includes the Distributed-Shell application, which is
an example of a Hadoop non-MapReduce application built on top of YARN.
Distributed-Shell is a simple mechanism for running shell commands and scripts
in containers on multiple nodes in a Hadoop cluster. This application is not
meant to be a production administration tool, but rather a demonstration of the
non-MapReduce capability that can be implemented on top of YARN. There are
multiple mature implementations of a distributed shell that administrators
typically use to manage a cluster of machines.
In addition, Distributed-Shell can be used as a starting point for exploring and
building Hadoop YARN applications. This chapter offers guidance on how the
Distributed-Shell can be used to understand the operation of YARN applications.
For the pseudo-distributed install using Apache Hadoop version 2.6.0, the
following path will run the Distributed-Shell application (assuming
$HADOOP_HOME is defined to reflect the location Hadoop):
Click here to view code image
$ export YARN_DS=$HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-
applications-
distributedshell-2.6.0.jar
The output of this command follows; we will explore some of these options in
the examples illustrated in this chapter.
Click here to view code image
usage: Client
-appname <arg> Application Name. Default
value - DistributedShell
-attempt_failures_validity_interval <arg> when
attempt_failures_validity_
interval in milliseconds
is set to > 0,the failure
number will not take
failures which happen out
of the validityInterval
into failure count. If
failure count reaches to
maxAppAttempts, the
application will be
failed.
-container_memory <arg> Amount of memory in MB to
be requested to run the
shell command
-container_vcores <arg> Amount of virtual cores to
be requested to run the
shell command
-create Flag to indicate whether
to create the domain
specified with -domain.
-debug Dump out debug information
-domain <arg> ID of the timeline domain
where the timeline
entities will be put
-help Print usage
-jar <arg> Jar file containing the
application master
-keep_containers_across_application_attempts Flag to indicate whether
to keep containers across
application attempts. If
the flag is true, running
containers will not be
killed when application
attempt fails and these
containers will be
retrieved by the new
application attempt
-log_properties <arg> log4j.properties file
-master_memory <arg> Amount of memory in MB to
be requested to run the
application master
-master_vcores <arg> Amount of virtual cores to
be requested to run the
application master
-modify_acls <arg> Users and groups that
allowed to modify the
timeline entities in the
given domain
-node_label_expression <arg> Node label expression to
determine the nodes where
all the containers of this
application will be
allocated, "" means
containers can be
allocated anywhere, if you
don't specify the option,
default
node_label_expression of
queue will be used.
-num_containers <arg> No. of containers on which
the shell command needs to
be executed
-priority <arg> Application Priority.
Default 0
-queue <arg> RM Queue in which this
application is to be
submitted
-shell_args <arg> Command line args for the
shell script. Multiple args
can be separated by empty
space.
-shell_cmd_priority <arg> Priority for the shell
command containers
-shell_command <arg> Shell command to be
executed by the
Application Master. Can
only specify either
--shell_command or
--shell_script
-shell_env <arg> Environment for shell
script. Specified as
env_key=env_val pairs
-shell_script <arg> Location of the shell
script to be executed. Can
only specify either
--shell_command or
--shell_script
-timeout <arg> Application timeout in
milliseconds
-view_acls <arg> Users and groups that
allowed to view the
timeline entities in the
given domain
A Simple Example
The simplest use-case for the Distributed-Shell application is to run an arbitrary
shell command in a container. We will demonstrate the use of the uptime
command as an example. This command is run on the cluster using Distributed-
Shell as follows:
Click here to view code image
$ yarn org.apache.hadoop.yarn.applications.distributedshell.Client -
jar $YARN_DS -shell_command uptime
If the shell command did not work for whatever reason, the following message
will be displayed:
Click here to view code image
15/05/27 14:58:42 ERROR distributedshell.Client: Application failed
to complete
successfully
The next step is to examine the output for the application. Distributed-Shell
redirects the output of the individual shell commands run on the cluster nodes
into the log files, which are found either on the individual nodes or aggregated
onto HDFS, depending on whether log aggregation is enabled.
Assuming log aggregation is enabled, the results for each instance of the
command can be found by using the yarn logs command. For the previous
uptime example, the following command can be used to inspect the logs:
Click here to view code image
$ yarn logs -applicationId application_1432831236474_0001
Note
The applicationId can be found from the program output or by using the
yarn application command (see the “Managing YARN Jobs” section in
Chapter 10, “Basic Hadoop Administration Procedures”).
LogType:stdout
Log Upload Time:Thu May 28 12:41:59 -0400 2015
LogLength:71
Log Contents:
12:41:56 up 33 days, 19:28, 0 users, load average: 0.08, 0.06, 0.01
Notice that there are two containers. The first container (con..._000001)
is the ApplicationMaster for the job. The second container (con..._000002)
is the actual shell script. The output for the uptime command is located in the
second containers stdout after the Log Contents: label.
If we now examine the results for this job, there will be five containers in the
log. The four command containers (2 through 5) will print the name of the node
on which the container was run.
$ yarn org.apache.hadoop.yarn.applications.distributedshell.Client -
jar $YARN_DS -shell_command ls -shell_args -l
As can be seen, the resulting files are new and not located anywhere in HDFS
or the local file system. When we explore further by giving a pwd command for
Distributed-Shell, the following directory is listed and created on the node that
ran the shell command:
Click here to view code image
/hdfs2/hadoop/yarn/local/usercache/hdfs/appcache/application_1432831236474_0003/
container_1432831236474_0003_01_000002/
Choose a delay, in seconds, to preserve these files, and remember that all
applications will create these files. If you are using Ambari, look on the YARN
Configs tab under the Advanced yarn-site options, make the change and restart
YARN. (See Chapter 9, “Managing Hadoop with Apache Ambari,” for more
information on Ambari administration.) These files will be retained on the
individual nodes only for the duration of the specified delay.
When debugging or investigating YARN applications, these files—in
particular, launch_container.sh—offer important information about
YARN processes. Distributed-Shell can be used to see what this file contains.
Using DistributedShell, the contents of the launch_container.sh file can
be printed with the following command:
Click here to view code image
$ yarn org.apache.hadoop.yarn.applications.distributedshell.Client -
jar $YARN_DS -shell_command cat -shell_args launch_container.sh
export NM_HTTP_PORT="8042"
export LOCAL_DIRS="/opt/hadoop/yarn/local/usercache/hdfs/appcache/
application_1432816241597_0004,/hdfs1/hadoop/yarn/local/usercache/hdfs/appcache/
application_1432816241597_0004,/hdfs2/hadoop/yarn/local/usercache/hdfs/appcache/
application_1432816241597_0004"
export JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk.x86_64"
export
NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAA=
"
export HADOOP_YARN_HOME="/usr/hdp/current/hadoop-yarn-client"
export
HADOOP_TOKEN_FILE_LOCATION="/hdfs2/hadoop/yarn/local/usercache/hdfs/
appcache/application_1432816241597_0004/container_1432816241597_0004_01_000002/
container_tokens"
export NM_HOST="limulus"
export JVM_PID="$$"
export USER="hdfs"
export PWD="/hdfs2/hadoop/yarn/local/usercache/hdfs/appcache/
application_1432816241597_0004/container_1432816241597_0004_01_000002"
export CONTAINER_ID="container_1432816241597_0004_01_000002"
export NM_PORT="45454"
export HOME="/home/"
export LOGNAME="hdfs"
export HADOOP_CONF_DIR="/etc/hadoop/conf"
export MALLOC_ARENA_MAX="4"
export LOG_DIRS="/opt/hadoop/yarn/log/application_1432816241597_0004/
container_1432816241597_0004_01_000002,/hdfs1/hadoop/yarn/log/
application_1432816241597_0004/container_1432816241597_0004_01_000002,/hdfs2/
hadoop/yarn/log/application_1432816241597_0004/
container_1432816241597_0004_01_000002"
exec /bin/bash -c "cat launch_container.sh
1>/hdfs2/hadoop/yarn/log/application_1432816241597_0004/
container_1432816241597_0004_01_000002/stdout
2>/hdfs2/hadoop/yarn/log/
application_1432816241597_0004/container_1432816241597_0004_01_000002/stderr
"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
exit $hadoop_shell_errorcode
fi
There are more options for the Distributed-Shell that you can test. The real
value of the Distributed-Shell application is its ability to demonstrate how
applications are launched within the Hadoop YARN infrastructure. It is also a
good starting point when you are creating YARN applications.
Structure of YARN Applications
A full explanation of writing YARN programs is beyond the scope of this book.
The structure and operation of a YARN application are covered briefly in this
section. For further information on writing YARN applications, consult Apache
Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache
Hadoop 2 (see the references listed at the end of this chapter).
As mentioned in Chapter 1, “Background and Concepts,” the central YARN
ResourceManager runs as a scheduling daemon on a dedicated machine and acts
as the central authority for allocating resources to the various competing
applications in the cluster. The ResourceManager has a central and global view
of all cluster resources and, therefore, can ensure fairness, capacity, and locality
are shared across all users. Depending on the application demand, scheduling
priorities, and resource availability, the ResourceManager dynamically allocates
resource containers to applications to run on particular nodes. A container is a
logical bundle of resources (e.g., memory, cores) bound to a particular cluster
node. To enforce and track such assignments, the ResourceManager interacts
with a special system daemon running on each node called the NodeManager.
Communications between the ResourceManager and NodeManagers are
heartbeat based for scalability. NodeManagers are responsible for local
monitoring of resource availability, fault reporting, and container life-cycle
management (e.g., starting and killing jobs). The ResourceManager depends on
the NodeManagers for its “global view” of the cluster.
User applications are submitted to the ResourceManager via a public protocol
and go through an admission control phase during which security credentials are
validated and various operational and administrative checks are performed.
Those applications that are accepted pass to the scheduler and are allowed to run.
Once the scheduler has enough resources to satisfy the request, the application is
moved from an accepted state to a running state. Aside from internal
bookkeeping, this process involves allocating a container for the single
ApplicationMaster and spawning it on a node in the cluster. Often called
container 0, the ApplicationMaster does not have any additional resources at this
point, but rather must request additional resources from the ResourceManager.
The ApplicationMaster is the “master” user job that manages all application
life-cycle aspects, including dynamically increasing and decreasing resource
consumption (i.e., containers), managing the flow of execution (e.g., in case of
MapReduce jobs, running reducers against the output of maps), handling faults
and computation skew, and performing other local optimizations. The
ApplicationMaster is designed to run arbitrary user code that can be written in
any programming language, as all communication with the ResourceManager
and NodeManager is encoded using extensible network protocols (i.e., Google
Protocol Buffers, http://code.google.com/p/protobuf/).
YARN makes few assumptions about the ApplicationMaster, although in
practice it expects most jobs will use a higher-level programming framework. By
delegating all these functions to ApplicationMasters, YARN’s architecture gains
a great deal of scalability, programming model flexibility, and improved user
agility. For example, upgrading and testing a new MapReduce framework can be
done independently of other running MapReduce frameworks.
Typically, an ApplicationMaster will need to harness the processing power of
multiple servers to complete a job. To achieve this, the ApplicationMaster issues
resource requests to the ResourceManager. The form of these requests includes
specification of locality preferences (e.g., to accommodate HDFS use) and
properties of the containers. The ResourceManager will attempt to satisfy the
resource requests coming from each application according to availability and
scheduling policies. When a resource is scheduled on behalf of an
ApplicationMaster, the ResourceManager generates a lease for the resource,
which is acquired by a subsequent ApplicationMaster heartbeat. The
ApplicationMaster then works with the NodeManagers to start the resource. A
token-based security mechanism guarantees its authenticity when the
ApplicationMaster presents the container lease to the NodeManager. In a typical
situation, running containers will communicate with the ApplicationMaster
through an application-specific protocol to report status and health information
and to receive framework-specific commands. In this way, YARN provides a
basic infrastructure for monitoring and life-cycle management of containers,
while each framework manages application-specific semantics independently.
This design stands in sharp contrast to the original Hadoop version 1 design, in
which scheduling was designed and integrated around managing only
MapReduce tasks.
Figure 8.1 illustrates the relationship between the application and YARN
components. The YARN components appear as the large outer boxes
(ResourceManager and NodeManagers), and the two applications appear as
smaller boxes (containers), one dark and one light. Each application uses a
different ApplicationMaster; the darker client is running a Message Passing
Interface (MPI) application and the lighter client is running a traditional
MapReduce application.
Figure 8.1 YARN architecture with two clients (MapReduce and MPI). The
darker client (MPI AM2) is running an MPI application, and the lighter client
(MR AM1) is running a MapReduce application. (From Arun C. Murthy, et
al., Apache Hadoop™ YARN, copyright © 2014, p. 45. Reprinted and
electronically reproduced by permission of Pearson Education, Inc., New
York, NY.)
Distributed-Shell
As described earlier in this chapter, Distributed-Shell is an example application
included with the Hadoop core components that demonstrates how to write
applications on top of YARN. It provides a simple method for running shell
commands and scripts in containers in parallel on a Hadoop YARN cluster.
Hadoop MapReduce
MapReduce was the first YARN framework and drove many of YARN’s
requirements. It is integrated tightly with the rest of the Hadoop ecosystem
projects, such as Apache Pig, Apache Hive, and Apache Oozie.
Apache Tez
One great example of a new YARN framework is Apache Tez. Many Hadoop
jobs involve the execution of a complex directed acyclic graph (DAG) of tasks
using separate MapReduce stages. Apache Tez generalizes this process and
enables these tasks to be spread across stages so that they can be run as a single,
all-encompassing job.
Tez can be used as a MapReduce replacement for projects such as Apache
Hive and Apache Pig. No changes are needed to the Hive or Pig applications.
For more information, see https://tez.apache.org.
Apache Giraph
Apache Giraph is an iterative graph processing system built for high scalability.
Facebook, Twitter, and LinkedIn use it to create social graphs of users. Giraph
was originally written to run on standard Hadoop V1 using the MapReduce
framework, but that approach proved inefficient and totally unnatural for various
reasons. The native Giraph implementation under YARN provides the user with
an iterative processing model that is not directly available with MapReduce.
Support for YARN has been present in Giraph since its own version 1.0 release.
In addition, using the flexibility of YARN, the Giraph developers plan on
implementing their own web interface to monitor job progress. For more
information, see http://giraph.apache.org.
Dryad on YARN
Similar to Apache Tez, Microsoft’s Dryad provides a DAG as the abstraction of
execution flow. This framework is ported to run natively on YARN and is fully
compatible with its non-YARN version. The code is written completely in native
C++ and C# for worker nodes and uses a thin layer of Java within the
application. For more information, see http://research.microsoft.com/en-
us/projects/dryad.
Apache Spark
Spark was initially developed for applications in which keeping data in memory
improves performance, such as iterative algorithms, which are common in
machine learning, and interactive data mining. Spark differs from classic
MapReduce in two important ways. First, Spark holds intermediate results in
memory, rather than writing them to disk. Second, Spark supports more than just
MapReduce functions; that is, it greatly expands the set of possible analyses that
can be executed over HDFS data stores. It also provides APIs in Scala, Java, and
Python.
Since 2013, Spark has been running on production YARN clusters at Yahoo!.
The advantage of porting and running Spark on top of YARN is the common
resource management and a single underlying file system. For more information,
see https://spark.apache.org.
Apache Storm
Traditional MapReduce jobs are expected to eventually finish, but Apache Storm
continuously processes messages until it is stopped. This framework is designed
to process unbounded streams of data in real time. It can be used in any
programming language. The basic Storm use-cases include real-time analytics,
online machine learning, continuous computation, distributed RPC (remote
procedure calls), ETL (extract, transform, and load), and more. Storm provides
fast performance, is scalable, is fault tolerant, and provides processing
guarantees. It works directly under YARN and takes advantage of the common
data and resource management substrate. For more information, see
http://storm.apache.org.
In This Chapter:
A tour of the Apache Ambari graphical management tool is provided.
The procedure for restarting a stopped Hadoop service is explained.
The procedure for changing Hadoop properties and tracking configurations
is presented.
Managing a Hadoop installation by hand can be tedious and time consuming. In
addition to keeping configuration files synchronized across a cluster, starting,
stopping, and restarting Hadoop services and dependent services in the right
order is not a simple task. The Apache Ambari graphical management tool is
designed to help you easily manage these and other Hadoop administrative
issues. This chapter provides some basic navigation and usage scenarios for
Apache Ambari.
Apache Ambari is an open source graphical installation and management tool
for Apache Hadoop version 2. Ambari was used in Chapter 2, “Installation
Recipes,” to install Hadoop and related packages across a four-node cluster. In
particular, the following packages were installed: HDFS, YARN, MapReduce2,
Tez, Nagios, Ganglia, Hive, HBase, Pig, Sqoop, Oozie, Zookeeper, and Flume.
These packages have been described in other chapters and provide basic Hadoop
functionality. As noted in Chapter 2, other packages are available for installation
(refer to Figure 2.18). Finally, to use Ambari as a management tool, the entire
installation process must be done using Ambari. It is not possible to use Ambari
for Hadoop clusters that have been installed by other means.
Along with being an installation tool, Ambari can be used as a centralized
point of administration for a Hadoop cluster. Using Ambari, the user can
configure cluster services, monitor the status of cluster hosts (nodes) or services,
visualize hotspots by service metric, start or stop services, and add new hosts to
the cluster. All of these features infuse a high level of agility into the processes
of managing and monitoring a distributed computing environment. Ambari also
attempts to provide real-time reporting of important metrics.
Apache Ambari continues to undergo rapid change. The description in this
chapter is based on version 1.7. The major aspects of Ambari, which will not
change, are explained in the following sections. Further detailed information can
found at https://ambari.apache.org.
Quick Tour of Apache Ambari
After completing the initial installation and logging into Ambari (as explained in
Chapter 2), a dashboard similar to that shown in Figure 9.1 is presented. The
same four-node cluster as created in Chapter 2 will be used to explore Ambari in
this chapter. If you need to reopen the Ambari dashboard interface, simply enter
the following command (which assumes you are using the Firefox browser,
although other browsers may also be used):
$ firefox localhost:8080
Dashboard View
The Dashboard view provides small status widgets for many of the services
running on the cluster. The actual services are listed on the left-side vertical
menu. These services correspond to what was installed in Chapter 2. You can
move, edit, remove, or add these widgets as follows:
Moving: Click and hold a widget while it is moved about the grid.
Edit: Place the mouse on the widget and click the gray edit symbol in the
upper-right corner of the widget. You can change several different aspects
(including thresholds) of the widget.
Remove: Place the mouse on the widget and click the X in the upper-left
corner.
Add: Click the small triangle next to the Metrics tab and select Add. The
available widgets will be displayed. Select the widgets you want to add
and click Apply.
Some widgets provide additional information when you move the mouse over
them. For instance, the DataNodes widget displays the number of live, dead, and
decommissioning hosts. Clicking directly on a graph widget provides an
enlarged view. For instance, Figure 9.2 provides a detailed view of the CPU
Usage widget from Figure 9.1.
Figure 9.2 Enlarged view of Ambari CPU Usage widget
The Dashboard view also includes a heatmap view of the cluster. Cluster
heatmaps physically map selected metrics across the cluster. When you click the
Heatmaps tab, a heatmap for the cluster will be displayed. To select the metric
used for the heatmap, choose the desired option from the Select Metric pull-
down menu. Note that the scale and color ranges are different for each metric.
The heatmap for percentage host memory used is displayed in Figure 9.3.
Figure 9.3 Ambari heatmap for Host memory usage
Configuration history is the final tab in the dashboard window. This view
provides a list of configuration changes made to the cluster. As shown in Figure
9.4, Ambari enables configurations to be sorted by Service, Configuration
Group, Data, and Author. To find the specific configuration settings, click the
service name. More information on configuration settings is provided later in the
chapter.
Hosts View
Selecting the Hosts menu item provides the information shown in Figure 9.7.
The host name, IP address, number of cores, memory, disk usage, current load
average, and Hadoop components are listed in this window in tabular form.
Admin View
The Administration (Admin) view provides three options. The first, as shown in
Figure 9.9, displays a list of installed software. This Repositories listing
generally reflects the version of Hortonworks Data Platform (HDP) used during
the installation process. The Service Accounts option lists the service accounts
added when the system was installed. These accounts are used to run various
services and tests for Ambari. The third option, Security, sets the security on the
cluster. A fully secured Hadoop cluster is important in many instances and
should be explored if a secure environment is needed. This aspect of Ambari is
beyond the scope of this book.
Figure 9.9 Ambari installed packages with versions, numbers, and
descriptions
Views View
Ambari Views is a framework offering a systematic way to plug in user interface
capabilities that provide for custom visualization, management, and monitoring
features in Ambari. Views allows you to extend and customize Ambari to meet
your specific needs. You can find more information about Ambari Views from
the following source:
https://cwiki.apache.org/confluence/display/AMBARI/Views.
In This Chapter:
Several basic Hadoop YARN administration topics are presented,
including decommissioning YARN nodes, managing YARN applications,
and important YARN properties.
Basic HDFS administration procedures are described, including using the
NameNode UI, adding users, performing file system checks, balancing
DataNodes, taking HDFS snapshots, and using the HDFS NFSv3 gateway.
The Capacity scheduler is discussed.
Hadoop version 2 MapReduce compatibility and node capacity are
discussed.
In Chapter 9, “Managing Hadoop with Apache Ambari,” the Apache Ambari
web management tool was described in detail. Much of the day-to-day
management of a Hadoop cluster can be accomplished using the Ambari
interface. Indeed, whenever possible, Ambari should be used to manage the
cluster because it keeps track of the cluster state.
Hadoop has two main areas of administration: the YARN resource manager
and the HDFS file system. Other application frameworks (e.g., the MapReduce
framework) and tools have their own management files. As mentioned in
Chapter 2, “Installation Recipes,” Hadoop configuration is accomplished
through the use of XML configuration files. The basic files and their function are
as follows:
core-default.xml: System-wide properties
hdfs-default.xml: Hadoop Distributed File System properties
mapred-default.xml: Properties for the YARN MapReduce
framework
yarn-default.xml: YARN properties
You can find a complete list of properties for all these files at
http://hadoop.apache.org/docs/current/ (look at the lower-left side of the page
under “Configuration”). A full discussion of all options is beyond the scope of
this book. The Apache Hadoop documentation does provide helpful comments
and defaults for each property.
If you are using Ambari, you should manage the configuration files though the
interface, rather than editing them by hand. If some other management tool is
employed, it should be used as described. Installation of Hadoop by hand (as in a
pseudo-distributed mode in Chapter 2) requires that you edit the configuration
files by hand and then copy them to all nodes in the cluster if applicable.
The following sections cover some useful administration tasks that may fall
outside of Ambari, require special configuration, or require more explanation.
By no means does this discussion cover all possible topics related to Hadoop
administration; rather, it is designed to help jump-start your administration of
Hadoop version 2.
YARN WebProxy
The Web Application Proxy is a separate proxy server in YARN that addresses
security issues with the cluster web interface on ApplicationMasters. By default,
the proxy runs as part of the Resource Manager itself, but it can be configured to
run in a stand-alone mode by adding the configuration property yarn.web-
proxy.address to yarn-site.xml. (Using Ambari, go to the YARN
Configs view, scroll to the bottom, and select Custom yarn-site.xml/Add
property.) In stand-alone mode, yarn.web-proxy.principal and
yarn.web-proxy.keytab control the Kerberos principal name and the
corresponding keytab, respectively, for use in secure mode. These elements can
be added to the yarn-site.xml if required.
Neither the YARN ResourceManager UI nor the Ambari UI can be used to kill
YARN applications. If a job needs to be killed, give the yarn application
command to find the Application ID and then use the -kill argument.
Other options provide more detail, include snapshots and open files, and
management of corrupted files.
-move moves corrupted files to /lost+found.
-delete deletes corrupted files.
-files prints out files being checked.
-openforwrite prints out files opened for writes during check.
-includeSnapshots includes snapshot data. The path indicates the
existence of a snapshottable directory or the presence of snapshottable
directories under it.
-list-corruptfileblocks prints out a list of missing blocks and
the files to which they belong.
-blocks prints out a block report.
-locations prints out locations for every block.
-racks prints out network topology for data-node locations.
Balancing HDFS
Based on usage patterns and DataNode availability, the number of data blocks
across the DataNodes may become unbalanced. To avoid over-utilized
DataNodes, the HDFS balancer tool rebalances data blocks across the available
DataNodes. Data blocks are moved from over-utilized to under-utilized nodes to
within a certain percent threshold. Rebalancing can be done when new
DataNodes are added or when a DataNode is removed from service. This step
does not create more space in HDFS, but rather improves efficiency.
The HDFS superuser must run the balancer. The simplest way to run the
balancer is to enter the following command:
$ hdfs balancer
By default, the balancer will continue to rebalance the nodes until the number
of data blocks on all DataNodes are within 10% of each other. The balancer can
be stopped, without harming HDFS, at any time by entering a Ctrl-C. Lower or
higher thresholds can be set using the -threshold argument. For example,
giving the following command sets a 5% threshold:
$ hdfs balancer -threshold 5
The lower the threshold, the longer the balancer will run. To ensure the
balancer does not swamp the cluster networks, you can set a bandwidth limit
before running the balancer, as follows:
Click here to view code image
$ dfsadmin -setBalancerBandwidth newbandwidth
HDFS may drop into Safe Mode if a major issue arises within the file system
(e.g., a full DataNode). The file system will not leave Safe Mode until the
situation is resolved. To check whether HDFS is in Safe Mode, enter the
following command:
Click here to view code image
$ hdfs dfsadmin -safemode get
SecondaryNameNode
To avoid long NameNode restarts and other issues, the performance of the
SecondaryNameNode should be verified. Recall that the SecondaryNameNode
takes the previous file system image file (fsimage*) and adds the NameNode
file system edits to create a new file system image file for the NameNode to use
when it restarts. The hdfs-site.xml defines a property called
fs.checkpoint.period (called HDFS Maximum Checkpoint Delay in
Ambari). This property provides the time in seconds between the
SecondaryNameNode checkpoints.
When a checkpoint occurs, a new fsimage* file is created in the directory
corresponding to the value of dfs.namenode.checkpoint.dir in the
hdfs-site.xml file. This file is also placed in the NameNode directory
corresponding to the dfs.namenode.name.dir path designated in the
hdfs-site.xml file. To test the checkpoint process, a short time period (e.g.,
300 seconds) can be used for fs.checkpoint.period and HDFS restarted.
After five minutes, two identical fsimage* files should be present in each of
the two previously mentioned directories. If these files are not recent or are
missing, consult the NameNode and SecondaryNameNode logs.
Once the SecondaryNameNode process is confirmed to be working correctly,
reset the fs.checkpoint.period to the previous value and restart HDFS.
(Ambari versioning is helpful with this type or procedure.) If the
SecondaryNameNode is not running, a checkpoint can be forced by running the
following command:
Click here to view code image
$ hdfs secondarynamenode -checkpoint force
HDFS Snapshots
HDFS snapshots are read-only, point-in-time copies of HDFS. Snapshots can be
taken on a subtree of the file system or the entire file system. Some common
use-cases for snapshots are data backup, protection against user errors, and
disaster recovery.
Snapshots can be taken on any directory once the directory has been set as
snapshottable. A snapshottable directory is able to accommodate 65,536
simultaneous snapshots. There is no limit on the number of snapshottable
directories. Administrators may set any directory to be snapshottable, but nested
snapshottable directories are not allowed. For example, a directory cannot be set
to snapshottable if one of its ancestors/descendants is a snapshottable directory.
More details can be found at https://hadoop.apache.org/docs/current/hadoop-
project-dist/hadoop-hdfs/HdfsSnapshots.html.
The following example walks through the procedure for creating a snapshot.
The first step is to declare a directory as “snapshottable” using the following
command:
Click here to view code image
$ hdfs dfsadmin -allowSnapshot /user/hdfs/war-and-peace-input
Allowing snapshot on /user/hdfs/war-and-peace-input succeeded
Once the directory has been made snapshottable, the snapshot can be taken with
the following command. The command requires the directory path and a name
for the snapshot—in this case, wapi-snap-1.
Click here to view code image
$ hdfs dfs -createSnapshot /user/hdfs/war-and-peace-input wapi-snap-1
Created snapshot /user/hdfs/war-and-peace-input/.snapshot/wapi-snap-1
The restoration process is basically a simple copy from the snapshot to the
previous directory (or anywhere else). Note the use of the
~/.snapshot/wapi-snap-1 path to restore the file:
Click here to view code image
$ hdfs dfs -cp /user/hdfs/war-and-peace-input/.snapshot/wapi-snap-
1/war-and-peace
.txt /user/hdfs/war-and-peace-input
Confirmation that the file has been restored can be obtained by issuing the
following command:
Click here to view code image
$ hdfs dfs -ls /user/hdfs/war-and-peace-input/
Found 1 items
-rw-r--r-- 2 hdfs hdfs 3288746 2015-06-24 21:12 /user/hdfs/war-and-
peace-
input/war-and-peace.txt
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
The name of the user who will start the Hadoop NFSv3 gateway is placed in
the name field. In the previous example, root is used for this purpose. This
setting can be any user who starts the gateway. If, for instance, user nfsadmin
starts the gateway, then the two names would be
hadoop.proxyuser.nfsadmin.groups and
hadoop.proxyuser.nfsadmin.hosts. The * value, entered in the
preceding lines, opens the gateway to all groups and allows it to run on any host.
Access is restricted by entering groups (comma separated) in the group’s
property. Entering a host name for the host’s property can restrict the host
running the gateway.
Next, move to the Advanced hdfs-site.xml section and set the following
property:
Click here to view code image
<property>
<name>dfs.namenode.accesstime.precision</name>
<value>3600000</value>
</property>
This property ensures client mounts with access time updates work properly.
(See the mount default atime option.)
Finally, move to the Custom hdfs-site section, click the Add Property link, and
add the following property:
Click here to view code image
property>
<name>dfs.nfs3.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
The NFSv3 dump directory is needed because the NFS client often reorders
writes. Sequential writes can arrive at the NFS gateway in random order. This
directory is used to temporarily save out-of-order writes before writing to HDFS.
Make sure the dump directory has enough space. For example, if the application
uploads 10 files, each of size 100MB, it is recommended that this directory have
1GB of space to cover a worst-case write reorder for every file.
Once all the changes have been made, click the green Save button and note
the changes you made to the Notes box in the Save confirmation dialog. Then
restart all of HDFS by clicking the orange Restart button.
Next, start the HDFS gateway by using the hadoop-daemon script to start
portmap and nfs3 as follows:
Click here to view code image
# /usr/hdp/2.2.4.2-2/hadoop/sbin/hadoop-daemon.sh start portmap
# /usr/hdp/2.2.4.2-2/hadoop/sbin/hadoop-daemon.sh start nfs3
/var/log/hadoop/root/hadoop-root-nfs3-n0.log
To confirm the gateway is working, issue the following command. The output
should look like the following:
Click here to view code image
# rpcinfo -p n0
program vers proto port service
100005 2 tcp 4242 mountd
100000 2 udp 111 portmapper
100000 2 tcp 111 portmapper
100005 1 tcp 4242 mountd
100003 3 tcp 2049 nfs
100005 1 udp 4242 mountd
100005 3 udp 4242 mountd
100005 3 tcp 4242 mountd
100005 2 udp 4242 mountd
Finally, make sure the mount is available by issuing the following command:
# showmount -e n0
Export list for n0:
/ *
If the rpcinfo or showmount command does not work correctly, check the
previously mentioned log files for problems.
The mount command is as follows. Note that the name of the gateway node
will be different on other clusters, and an IP address can be used instead of the
node name.
Click here to view code image
Once the file system is mounted, the files will be visible to the client users.
The following command will list the mounted file system:
Click here to view code image
# ls /mnt/hdfs
app-logs apps benchmarks hdp mapred mr-history system tmp user var
A webpage with code downloads, a question and answer forum, resource links,
and updated information is available from the following link. All of the code and
examples used in this book can be downloaded from this page.
http://www.clustermonkey.net/Hadoop2-Quick-Start-Guide
B. Getting Started Flowchart and Troubleshooting
Guide
The flowchart helps you get to the book content that you need, while the
troubleshooting section walks you through basic rules and tips.
Check the System Logs First: Maybe the Problem Is Not Hadoop
When an issue or error occurs, a quick check of the system logs is a good idea.
There could be a file permission error or some local non-Hadoop service that is
causing problems.
$ export HADOOP_ROOT_LOGGER="console"
$ hadoop jar wordcount.jar WordCount war-and-peace-input war-and-
peace-output
Although no messages are printed, the output of the job can be confirmed by
examining the output directory.
Click here to view code image
hdfs dfs -ls war-and-peace-output
Found 2 items
-rw-r--r-- 2 hdfs hdfs 0 2015-07-18 16:30 war-and-peace-
output/_SUCCESS
-rw-r--r-- 2 hdfs hdfs 467839 2015-07-18 16:30 war-and-peace-output/
part-r-00000
The allowable levels are OFF, FATAL, ERROR, WARN, INFO, DEBUG,
TRACE and ALL.
no resourcemanager to stop
When you check the system, however, you will see that the ResourceManager
is still running. The reason for this confusion is that the user who started the
Hadoop service must stop that service. Unlike system processes, the root user
cannot kill whatever Hadoop process is running. Of course, the root user can
kill the Java process running the service, but this is not a clean method.
There is also a preferred order for starting and stopping Hadoop services.
Although it is possible to start the services in any order, a more ordered method
helps minimize startup issues. The core services should be started as follows.
For HDFS, start in this order (shut down in the opposite order):
1. NameNode
2. All DataNodes
3. SecondaryNameNode
For YARN, start in this order (shut down in the opposite order)
1. ResourceManager
2. All NodeManagers
3. MapReduceHistoryServer
The HDFS and YARN services can be started independently. Apache Ambari
manages the startup/shutdown order automatically.
NameNode Reformatting
Like any other file system, the format operation in HDFS deletes all data. If you
choose to reformat a previously installed and running HDFS system, be aware
that the DataNodes and/or SecondaryNameNode will not start with the newly
formatted NameNode. If you examine the DataNode logs, you will see
something similar to the following:
Click here to view code image
$ rm -r /var/data/hadoop/hdfs/dn/current/
$ /opt/hadoop-2.6.0/sbin/hadoop-daemon.sh start datanode
The path to the DataNode used here is set in the hdfs-site.xml file. This
step must be performed for each of the individual data nodes. In a similar
fashion, the SecondaryNameNode must be reset using the following commands:
Click here to view code image
$ rm -r /var/data/hadoop/hdfs/snn/current
$ /opt/hadoop-2.6.0/sbin/hadoop-daemon.sh start namenode
These errors are usually due to a failed, corrupt, or unavailable local machine
file system. Check that the directory in the hdfs-site.xml file, set in the
following property, is mounted and functioning correctly. As mentioned
previously, this property can include multiple directories for redundancy.
Click here to view code image
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/var/data/hadoop/hdfs/nn</value>
</property>
If the NameNode directory (or directories) is not recoverable, then you can
use the mirrored copy to restore the NameNode state. As a last resort, the latest
checkpoint, if intact, can be used to restore the NameNode. Note that some data
may be lost in this process. The latest checkpoint can be imported to the
NameNode if all other copies of the image and the edits files are lost. The
following steps will restore the latest NameNode checkpoint:
1. Create an empty directory specified in the dfs.namenode.name.dir
configuration variable.
2. Specify the location of the checkpoint directory in the configuration
variable dfs.namenode.checkpoint.dir.
3. Start the NameNode with the -importCheckpoint option:
Click here to view code image
$ hadoop-daemon.sh start namenode -importCheckpoint
If Safe Mode is on, turn it off. If Safe Mode will not turn off, there may be
larger issues. Check the NameNode log:
Click here to view code image
$ hdfs dfsadmin -safemode leave
If there are corrupted blocks or files, delete them with the following
command:
$ hdfs fsck / -delete
Your HDFS files system should be usable, but there is no guarantee that all the
files will be available.
C. Summary of Apache Hadoop Resources by Topic
HDFS
HDFS background
http://hadoop.apache.org/docs/stable1/hdfs_design.html
http://developer.yahoo.com/hadoop/tutorial/module2.html
http://hadoop.apache.org/docs/stable/hdfs_user_guide.html
HDFS user commands
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/HDFSCommands.html
HDFS Java programming
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
HDFS libhdfs programming in C
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/LibHdfs.html
Examples
Pi benchmark
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/pi/package-
summary.html
Terasort benchmark
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/pack
summary.html
Benchmarking and stress testing a Hadoop cluster
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-
stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-
mrbench (uses Hadoop V1, will work with V2)
MapReduce
https://developer.yahoo.com/hadoop/tutorial/module4.html (based on
Hadoop version 1, but still a good MapReduce background)
http://en.wikipedia.org/wiki/MapReduce
http://research.google.com/pubs/pub36249.html
MapReduce Programming
Apache Hadoop Java MapReduce example
http://hadoop.apache.org/docs/current/hadoop-mapreduce-
client/hadoop-mapreduce-client-
core/MapReduceTutorial.html#Example:_WordCount_v1.0
Apache Hadoop streaming example
http://hadoop.apache.org/docs/r1.2.1/streaming.html
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-
program-in-python
Apache Hadoop Pipes example
http://wiki.apache.org/hadoop/C++WordCount
https://developer.yahoo.com/hadoop/tutorial/module4.html#pipes
Apache Hadoop Grep example
http://wiki.apache.org/hadoop/Grep
https://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
Debugging MapReduce
http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms
http://hadoop.apache.org/docs/current/hadoop-mapreduce-
client/hadoop-mapreduce-client-
core/MapReduceTutorial.html#Debugging
Essential Tools
Apache Pig scripting language
http://pig.apache.org/
http://pig.apache.org/docs/r0.14.0/start.html
Apache Hive SQL-like query language
https://hive.apache.org/
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
http://grouplens.org/datasets/movielens (data for example)
Apache Sqoop RDBMS import/export
http://sqoop.apache.org
http://dev.mysql.com/doc/world-setup/en/index.html (data for example)
Apache Flume steaming data and transport utility
https://flume.apache.org
https://flume.apache.org/FlumeUserGuide.html
Apache Oozie workflow manager
http://oozie.apache.org
http://oozie.apache.org/docs/4.0.0/index.html
Apache HBase distributed database
http://hbase.apache.org/book.html
http://hbase.apache.org
http://research.google.com/archive/bigtable.html (Google Big Table
paper)
http://www.google.com/finance/historical?
q=NASDAQ:AAPL\&authuser=0\&output=csv (data for example)
Ambari Administration
https://ambari.apache.org
Hue Installation
For this example, the following software environment is assumed. This Ambari
environment is the same as that used in Chapter 2, “Installation Recipes,” and
Chapter 9, “Managing Hadoop with Apache Ambari.”
OS: Linux
Platform: RHEL 6.6
Hortonworks HDP 2.2.4 with Hadoop version: 2.6
Hue version: 2.6.1-2
Installing Hue requires several configuration steps. In this appendix, the
Hadoop XML configuration files will be changed using Ambari and the
hue.ini files will be edited by hand. More detailed instructions are provided
in the Hortonworks HDP documentation:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
2.1.7/bk_installing_manually_book/content/rpm-chap-hue.html.
Next, look further down in the HDFS properties form for the Custom core-site
heading and click the small triangle to open the drop-down form. Using the Add
Property... link at the bottom of the form, add the following properties and
values. (Recall that the <name> tag here refers to the Key field in the input
form.) These additions correspond to the settings in the core-site.xml file
located in the /etc/hadoop/conf directory. When you are finished, select
Save and add your notes, but do not restart the services.
Click here to view code image
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hcat.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hcat.hosts</name>
<value>*</value>
</property>
<property>
<name>webhcat.proxyuser.hue.groups</name>
<value>*</value>
</property>
<property>
<name>oozie.service.ProxyUserService.proxyuser.hue.groups</name>
<value>*</value>
</property>
Next, set the time zone variable to your time zone. (See
https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for values.)
time_zone=America/New_York
Starting Hue
Hue is started as a service. Once its configuration is complete, it can be started
with the following command:
# service hue start
As Hue starts, the console output should look like the following:
Click here to view code image
Detecting versions of components...
HUE_VERSION=2.6.1-2
HDP=2.2.4
Hadoop=2.6.0
Pig=0.14.0
Hive-Hcatalog=0.14.0
Oozie=4.1.0
Ambari-server=1.7-169
HBase=0.98.4
Knox=0.5.0
Storm=0.9.3
Falcon=0.6.0
Starting hue: [ OK ]
The first time you connect to Hue, the window in Figure D.2 is presented. The
user name and password used for this first login will be the administrator for
Hue. In this example, the user name hue-admin and password admin are
used.
# export SPARK_HOME=/usr/spark-1.4.1-bin-hadoop2.6/
# cd $SPARK_HOME/sbin
# ./start-all.sh
The three worker nodes will have one log. For instance on node n0, a single
log file will be created with the following name:
Click here to view code image
spark-root-org.apache.spark.deploy.worker.Worker-1-n0.out
Check the logs for any issues. In particular, if iptables (or some other firewall)
is running, make sure it is not blocking the slave nodes from contacting the
master node. If everything worked correctly, messages similar to the following
should appear in the master log. The successful registration of all four worker
nodes, including the local node, should be listed. The Spark master URL is also
provided in the output (spark://limulus:7077).
Click here to view code image
...
INFO Master: Starting Spark master at spark://limulus:7077
INFO Master: Running Spark version 1.4.1
WARN Utils: Service 'MasterUI' could not bind on port 8080.
Attempting port 8081.
INFO Utils: Successfully started service 'MasterUI' on port 8081.
INFO MasterWebUI: Started MasterWebUI at http://10.0.0.1:8081
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker 10.0.0.1:54856 with 4 cores, 22.5 GB
RAM
INFO Master: Registering worker 10.0.0.11:34228 with 4 cores, 14.6 GB
RAM
INFO Master: Registering worker 10.0.0.12:49932 with 4 cores, 14.6 GB
RAM
INFO Master: Registering worker 10.0.0.10:36124 with 4 cores, 14.6 GB
RAM
Spark provides a web UI, whose address is given as part of the log. As
indicated, the MasterWebUI is http://10.0.0.1:8081. Placing this address in a
browser on the master node displays the interface in Figure E.1. Note that
Ambari uses the default port 8080. In this case, Spark used 8081.
Figure E.1 Spark UI with four worker nodes
4. Inspect the logs to make sure both the master and the worker started
properly. Files similar to the following should appear in the
$SPARK_HOME/logs directory (the host name is norbert).
Click here to view code image
spark-root-org.apache.spark.deploy.master.Master-1-norbert.out
spark-root-org.apache.spark.deploy.worker.Worker-1-norbert.out
5. Open the Spark web GUI as described previously. Note that the
MasterWebUI will use port 8080 because Ambari is not running. Check
the master log for the exact URL.
You can start the Spark shell with the following command:
$ $SPARK_HOME/bin/spark-shell
Finally, to access Hadoop HDFS data from Spark, use the Hadoop NameNode
URL. This URL is typically hdfs://<namenode>:8020/path (for HDP
installations) or hdfs://namenode>:9000 (for ASF source installations). The
HDFS NameNode URL can be found in the /etc/hadoop/conf/core-
site.xml file.
More information on using Spark can be found at http://spark.apache.org.
Index
A
Action flow nodes, 155–156
Ad-hoc queries. See Hive.
Admin pull-down menu, Ambari, 194
Admin view, Ambari, 193
Ambari
Admin pull-down menu, 194
Admin view, 193
changing the password, 186
console, 45–55
customizing, 193
dashboard, 41
Dashboard view, 186–189
host management, 191–192
Hosts view, 191–192
installing Hadoop. See Installing Hadoop, with Ambari.
listing installed software, 193
management screen, opening, 194
overview, 185–186
progress window, disabling, 194
properties, setting, 189–191
screen shots, 46–55
signing in, 45–46
signing out, 194
status widgets, 186–189
as troubleshooting tool, 234
Views view, 193
Ambari, managing Hadoop services
configuring services, 189–191
listing service accounts, 193
reporting service interruptions, 194–195
resolving service interruptions, 195–198
restarting after configuration change, 200–201
reverting to a previous version, 202–204
Services view, 189–191
starting and stopping services, 190–191
Ambari Installation Guide, 42
ambari-agent, 47–49
ambari-server, 44–45
Apache Hadoop 2 Quick-Start Guide, 226, 230–233
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with
Apache Hadoop 2, 17, 178, 221
Apache projects. See specific projects.
ApplicationHistoryServer, 20
ApplicationMaster
definition, 178–179
enabling restarts, 222
Applications. See also YARN applications.
C language, examples, 82–83
dynamic application management, online resources, 183–184
Hadoop V1, compatibility with MapReduce, 224–225
Java language, examples, 78–81
managing dynamically, 183–184
monitoring, 89–95
parallel, debugging, 124
ASF (Apache Software Foundation), 2
Avro
data transfer format, 150
in the Hadoop ecosystem, 17
B
BackupNode, 71
Backups, HDFS, 71
Balancing
DataNodes, online resources, 214
HDFS, 213–214
Bash scripts, HBase, 167
Benchmarks
terasort test, 95–96
TestDFSIO, 96–97
Big Data
characteristics of, 4
definition, 4
examples, 4–5
storing. See Data lakes.
typical size, 4
“Big Data Surprises,” 4
Block locations, printing, 213
Block replication, 67–68
Block reports, 213
-blocks option, 213
Books and publications. See also Online resources.
Apache Hadoop 2 Quick-Start Guide, 226, 230–233
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing
with Apache Hadoop 2, 17, 178, 221, 243
“Nobody Ever Got Fired for Using Hadoop,” 4
Bottlenecks, resolving, 8
Bulk data, adding, 168
Bundle jobs, 155
C
C++
on MapReduce, 119–121
WordCount example program, 119–121
C application, HDFS programming examples, 82–83
Cafarella, Michael J., 3
Capacity scheduler
administration, online resources, 247
description, 220–221
online resources, 221, 226
-cat command, 145
Cells, deleting, 167
Channel component of Flume, 148
CheckPointNode, 20, 66. See also SecondaryNameNode.
Checkpoints
HDFS, 71
NameNode, recovering from, 240–241
Cloud, installing Hadoop in. See Installing Hadoop in the cloud, with Whirr.
Cloud tools. See Whirr.
Clusters
benchmarking, examples, 245
coordinating services across, 17
executing MapReduce process V1 on, 11–12
heatmaps, 187–188
installing. See Installing Hadoop, with Ambari.
installing Hadoop in the cloud, with Whirr, 59–61
installing Spark on, 257–258
launching the cluster, 59–61
MPI (Message Passing Interface) and Hadoop on the same, 183
nodes. See NodeManagers.
preparing notes for Hadoop installation, 43–44
security settings, 193
starting Spark across, 258–259
stress testing, 245
taking down, 61
Cluster-wide settings
properties, 189–191, 198–204
security, 193
Code samples used in this book, online resources, 227
Command line scripts, compatibility with MapReduce and Hadoop V2, 224
Commands. See also specific commands.
HDFS, 75–77
Hive, 135–136
Oozie, 163
Compatibility, MapReduce and Hadoop V2
ApplicationMaster restarts, enabling, 222
command line scripts, 224
Hive queries under YARN, 225
mradmin function. See rmadmin function.
node capacity, calculating, 222
Oozie workflows under YARN, 225
org.apache.hadoop.mapred APIs, 224
org.apache.hadoop.mapreduce APIs, 224
overview, 222
Pig scripts under YARN, 225
running V1 applications, 223
sample MapReduce settings, 223
Configuration check screen, Hue GUI, 254, 255
Configuration files. See XML configuration files.
Configuration screen, Hue GUI, 254, 255
Configuring
Hadoop services, 189–191
HDFS services and proxy users, 250–251
Hive Webchat service, 251
Hue, with Ambari, 250–252
Hue GUI, 252–253
Configuring an NFSv3 gateway
configuration files, 217–218
current NFSv3 capabilities, 217
mount HDFS, 219–220
overview, 217
starting the gateway, 218–219
user name, specifying, 218
Containers, YARN
cores, setting, 208
definition, 178
memory, setting, 207
monitoring, 184
running commands in, 174–176
Control flow nodes, 155–156
Coordinator jobs, 155
Copying files, 75
core-default.xml file, 205
core-site.xml, configuring, 31–32
Corrupt file blocks, listing, 213
Corrupted files, moving and deleting, 212
create command, HBase, 165–166
CREATE TABLE command, 135–136
Cutting, Doug, 3
D
DAGs (directed acrylic graphs), 154
Dashboard view, Ambari, 186–189
Dashes (--), Pig comment delimiters, 133
Data directories, creating, 31
Data lakes. See also Big Data.
vs. data warehouses, 5–6
definition, 5
schema on read, 5
schema on write, 5
vs. traditional data storage, 5–6
traditional data transform. See ETL (extract, transform, and load).
Data locality. See also Rack awareness.
design aspect of MapReduce, 64
HDFS, 64
levels of, 69
Data streams, acquiring. See Flume.
Data summation. See Hive.
Data transfer format, 150
Data warehouses
analyzing large sets of data. See Hive.
bottlenecks, resolving, 8
vs. data lakes, 5–6
Databases
distributed, online resources, 246
in the Hadoop ecosystem, 16
listing, 144
MySQL, loading, 142–143
non-relational. See HBase.
relational. See Sqoop.
tables, listing, 144
DataNodes, 64–67
Datanodes tab, NameNode UI, 209–210
Debugging. See MapReduce, debugging; Troubleshooting.
Decommissioning
HDFS nodes, 214
YARN nodes, 206
delete command, Flume, 167
-delete command, Sqoop, 148
-delete option, 212
deleteall command, 167
Deleting
all HDFS data, 239
bypassing the Trash directory, 75
cells, 167
corrupted files, 212
data from tables, 148
directories, 75, 77
files, 75
rows, 167
tables. See Dropping tables.
dfs command (deprecated), 72
dfs.webhdfs.enabled property, setting, 250–251
Directed acrylic graphs (DAGs), 154
Directories
deleting, 75, 77
making, 76
disable command, 167
Documentation, online resources, 20, 233
Downloading/uploading files
Flume, 151
Hadoop, 30
HDFS, 72
NFS Gateway, 72
Oozie examples, 156–157
Sandbox, 23
Sqoop examples, 142–143
VirtualBox, 23
drop command, HBase, 167
DROP TABLE command, 135–136
Dropping tables with
HBase, 167
Hive, 135–136
MySQL, 148
Dryad on YARN, 182
Dynamic application management, online resources, 183–184
E
End users
adding to HDFS, 211–212
creating, 31
essential tasks, 7
Errors and messages, Java, 233–234
ETL (extract, transform, and load), 5.
Examples. See also Online resources, examples.
Big Data, 4–5
C application, 82–83
of code used in this book, online resources, 227
Flume, 151–154
HBase. See HBase, examples.
Hive, 134–139
Java application, 78–81
MapReduce. See MapReduce, examples.
NameNode Federation, 71
Pig, 132–134
querying movie reviews, 136–139
YARN Distributed-Shell. See YARN Distributed-Shell, examples.
exit command
HBase, 164
Hive, 136
Exporting data from HDFS, 147–148. See also Sqoop.
F
Fair scheduler, online resources, 221
Falcon, in the Hadoop ecosystem, 17
Fault tolerance, 107–108
File system checks, 212–213
Files
configuration. See XML configuration files.
copying, 75
corrupted, moving, 212
deleting, 75
downloading/uploading. See Downloading/uploading files.
listing, 75
printing, 212–213
-files option, 212
Flink, 183
Flume
channel component, 148
components, 148
configuration files, 153–154
downloading, 151
examples, 151–154
flume.ng command, 151–152
in the Hadoop ecosystem, 17
install command, 151
installing Flume, 151
installing Telnet, 151
log, checking, 153
online resources, 246
pipelines, 149–150
sink component, 148
source component, 148
tail command, 153
testing, 151–152
uses for, 148
weblog example, 152–153
website, 153, 170
Flume User Guide, 150
flume.ng command, 151–152
Fork/join nodes, 155–156
Formatting an HDFS, 239–240
fsimage_* file
definition, 66
location, 66
updating, 67
G
-get command, 75
get command, 166
Getting
columns, 166
rows, 166
table cells, 166
GFS (Google File System), 64
Giraph, 181
Google BigTable, 163
Google Protocol Buffers, online resources, 179
Graph processing, 181
grep chaining, 121–124
Grep examples, online resources, 245
grep.java command, 121–124
GroupLens Research website, 135
Groups, creating, 31
grunt> prompt, 133
GUI (graphical user interface). See Ambari; HDFS, web GUI; Hue GUI.
H
Hadoop
core components, 1–2, 19–21. See also NameNode; specific components.
daemons, starting and stopping, 238–239
definition, 1
history of, 3–4
key features, 1–2
name origin, 3
prominent users, 3–4
uses for, 3
Hadoop, V1
design principles, 7–8
MapReduce process, 8–12
Hadoop, V2
design overview, 13–15
MapReduce compatibility. See Compatibility, MapReduce and Hadoop V2.
YARN, operation design, 13–15
Hadoop administration, 7. See also Ambari GUI.
Hadoop database, in the Hadoop ecosystem, 16
Hadoop Distributed File System (HDFS). See HDFS (Hadoop Distributed File
System).
Hadoop ecosystem, components
administration, 17
Avro, 17
Falcon, 17
Flume, 17
Hadoop database, 16
HBase, 16
HCatalog, 16
HDFS, 15
Hive, 16–17
import/export, 17
machine learning, 17
Mahout, 17
MapReduce, 15
MapReduce query tools, 16–17
Oozie, 17
Pig, 16
Sqoop, 17
workflow automation, 17
YARN, 15
YARN application frameworks, 17
Zookeeper, 17
Hadoop services
assigning to hosts, 49–52
core services, 19–21. See also HDFS (Hadoop Distributed File System);
YARN (Yet Another Resource Negotiator).
Hadoop services, managing with Ambari
configuring services, 189–191
listing service accounts, 193
reporting service interruptions, 194–195
resolving service interruptions, 195–198
restarting after configuration change, 200–201
reverting to a previous version, 202–204
Services view, 189–191
starting and stopping services, 190–191
Hadoop User Experience (Hue) GUI. See Hue (Hadoop User Experience) GUI.
Hadoop V1 applications, compatibility with MapReduce, 224–225
HADOOP_HOME, setting, 30–31
Hamster, 183
HBase
data model, 164
distributed databases, online resources, 246
features, 163
in the Hadoop ecosystem, 16
overview, 163
website, 164
on YARN, 181–182
HBase, examples
‘ ‘ (single quotes), argument delimiters, 164
adding bulk data, 168
bash scripts, 167
create command, 165–166
creating a database, 165–166
delete command, 167
deleteall command, 167
deleting cells, 167
deleting rows, 167
disable command, 167
drop command, 167
dropping tables, 167
exit command, 164
exiting HBase, 164
get command, 166
getting a row, 166
getting columns, 166
getting table cells, 166
importing tab-delimited data, 168
ImportTsv command, 168
inspecting a database, 166
required software environment, 164
scan command, 166
scripting input, 167
shell command, 164
starting HBase, 164
web interface, 168–169
HCatalog in the Hadoop ecosystem, 16
HDFS (Hadoop Distributed File System)
BackupNode, 71
backups, 71. See also Checkpoints, HDFS.
block replication, 67–68
checkpoints, 71
components, 64–67. See also DataNodes; NameNode.
data locality, 64. See also Rack awareness.
definition, 2
deleting data, 239
design features, 63–64
in the Hadoop ecosystem, 15
master/slave architecture, 65
NameNode. See NameNode.
NFS Gateway, 72
NFSv3 mounting, 72
permissions, online resources, 211
roles, 67
Safe Mode, 68
snapshots, 71–72. See also Backups, HDFS.
status report, 77
streaming data, 72
uploading/downloading files, 72
version, getting, 73
web GUI, 77, 89–95
HDFS administration
adding users, 211–212
balancing HDFS, 213–214
block reports, 213
checking HDFS health, 212–213
data snapshot s, 213
decommissioning HDFS nodes, 214
deleting corrupted files, 212
Hadoop, online resources, 247
listing corrupt file blocks, 213
monitoring HDFS. See NameNode UI.
moving corrupted files, 212
NameNode UI, 208–211
online resources, 226
permissions, online resources, 211
printing block locations, 213
printing files, 212–213
printing network topology, 213
Safe Mode, 214
SecondaryNameNode, verifying, 214–215
snapshots, 215–216
snapshottable directories, 215
HDFS administration, configuring an NFSv3 gateway
configuration files, 217–218
current NFSv3 capabilities, 217
mount HDFS, 219–220
overview, 217
starting the gateway, 218–219
user name, specifying, 218
hdfs command, 72
hdfs fsck command, 212
HDFS in pseudo-distributed mode
configuring, 31–32
starting, 35–36
verifying, 37
HDFS programming examples
C application, 82–83
Java application, 78–81
HDFS user commands
copying files, 75
deleting directories, 77
deleting files, 75
dfs (deprecated), 72
general, 73–75
generic options, 74–75
get, 75
hdfs, 72
HDFS status report, 77
HDFS version, getting, 73
help for, 73
listing, 73
listing files, 75
ls, 75
making a directory, 76
mkdir, 75
overview, 72–73
put, 75
-report, 77
rm, 75
-rm, 77
skipTrash option, 76
version option, 73
hdfs-default.xml file, 205
hdfs-site.xml, configuring, 32
Help
in books and publications. See Books and publications.
for HDFS commands, 73
on websites. See Online resources.
High availability. See NameNode HA (High Availability).
Hive. See also ETL (extract, transform, and load).
; (semicolon), command terminator, 135
creating tables, 135–136
dropping tables, 135–136
examples, 134–139
exiting, 136
features, 134
in the Hadoop ecosystem, 16–17
installing, 39
queries under YARN, compatibility with MapReduce and Hadoop V2, 225
query language, online resources, 170
SQL-like queries, online resources, 246
starting, 135
Webchat service, configuring, 251
hive command, 135
Hive commands
CREATE TABLE, 135–136
DROP TABLE, 135–136
exit, 136
hive> prompt, 135
HiveQL, 134
Hortonworks HDP 2.2 Sandbox. See Sandbox.
Hortonworks Sandbox, online resources, 244
Hosts, assigning services to, 49–52
Hosts view, Ambari, 191–192
Hoya, 181–182
HPC (high-performance computing), 183
Hue (Hadoop User Experience) GUI configuration check screen, 254, 255
configuration screen, 254, 255
configuring, 252–253
configuring HDFS services and proxy users, 250–251
configuring Hive Webchat service, 251
configuring with Ambari, 250–252
description, 249
initial login screen, 254
installing, 249–253
logging in, 253–254
main icon bar, 254, 256
online resources, 249, 254
Oozie workflow, modifying, 252
starting, 253
user interface, 253–256
Users administration screen, 255, 256
hue start command, 253
I
import command, 145–146
Import/export in the Hadoop ecosystem, 17
Importing data. See also Sqoop.
cleaning up imported files, 148
ImportTsv utility, 168
from MySQL, 144–147
tab-delimited, 168
viewing imported files, 145
ImportTsv command, 168
-includeSnapshots option, 213
-info command, Oozie, 163
INFO messages, turning off, 237
install command, 151, 252
Installation recipes, online resources, 243–244
Installing
Flume, 151
Hive, 39
Hue GUI, 249–253
Pig, 38–39. See also Installing Hadoop, in pseudo-distributed mode.
Sqoop, 143
Telnet, 151
YARN Distributed-Shell, 172–174
Installing Hadoop
in the cloud. See Installing Hadoop in the cloud, with Whirr.
single-machine versions. See Installing Hadoop, in pseudo-distributed mode;
Installing Sandbox
Installing Hadoop, in pseudo-distributed mode. See also Installing, Hive;
Installing, Pig.
configuring a single-node YARN server, 30
core components, 30
core-site.xml, configuring, 31–32
data directories, creating, 31
downloading Hadoop, 30
groups, creating, 31
HADOOP_HOME, setting, 30–31
HDFS, configuring, 31–32
HDFS services, starting, 35–36
HDFS services, verifying, 37
hdfs-site.xml, configuring, 32
Java heap sizes, modifying, 34
JAVA_HOME, setting, 30–31
log directories, creating, 31
mapred-site.xml, configuring, 33
MapReduce, example, 37–38
MapReduce, specifying a framework name, 33
minimal hardware requirements, 30
NameNode, formatting, 34
NodeManagers, configuring, 33
overview, 29–30
users, creating, 31
YARN services, starting, 36
YARN services, verifying, 37
yarn-site.xml, configuring, 33
Installing Hadoop, with Ambari
Ambari console, 45–55
Ambari dashboard, 41
Ambari screen shots, 46–55
ambari-agent, 47–49
ambari-server, 44–45
assigning services to hosts, 49–52
overview, 40–41, 42
requirements, 42–43
undoing the install, 55–56
warning messages, 53, 55
Installing Hadoop in the cloud, with Whirr benefits of, 56
configuring Whirr, 57–59
installation procedure, 57
overview, 56–57
Installing Sandbox, VirtualBox
connecting to the Hadoop appliance, 28–29
downloading, 23
loading, 25–28
minimum requirements, 23
overview, 23–24
saving, 29
shutting down, 29
starting, 24–25
virtual machine, definition, 24
Internet, as a troubleshooting tool, 235
J
Java
applications, HDFS programming examples, 78–81
errors and messages, 233–234
heap sizes, modifying, 34
and MapReduce, online examples, 245
on MapReduce, 111–116
supported versions, online resources, 62, 243
WordCount example program, 111–116
JAVA_HOME, setting, 30–31
JIRA issues, tracking, 235
JIRAs, online resources, 235
job command, 97–98
Job information, getting, 89–95
Job types, Oozie, 154–155
JobHistoryServer, 20, 207
job.properties file, 157
Jobs, MapReduce. See MapReduce jobs.
K
-kill command, Oozie, 163
-kill option, 97–98, 125, 207
Killing
MapReduce jobs, 97–98, 125
Oozie jobs, 163
YARN applications, 207
L
-list option, 97–98, 125, 207
-list-corruptfileblocks option, 213
Listing
corrupt file blocks, 213
database tables, 144
databases, 144
files, 75
HDFS commands, 73
installed software, 193
MapReduce jobs, 97–98, 125, 207
service accounts, 193
-locations option, 213
Log aggregation, 125–126
Log management
enabling log aggregation, 125–128
online resources, 235
Logging in to Hue GUI, 253–254
Login screen, Hue GUI, 254
Logs
checking, Flume, 153
creating directories, 31
troubleshooting, 234–235
-logs command, Oozie, 163
-ls command, 75
M
Machine learning in the Hadoop ecosystem, 17
Mahout in the Hadoop ecosystem, 17
Main icon bar, Hue GUI, 254, 256
Management screen in Ambari, opening, 194
Managing. See also ResourceManager.
applications, dynamically, 183–184
Hadoop, graphical interface for. See Ambari.
HDFS. See HDFS administration.
hosts, with Ambari, 191–192
log, 125–128
MapReduce jobs, 97–98, 125
workflow, MapReduce, V1, 10
YARN. See YARN administration.
mapred commands. See specific commands.
mapred.child.java.opts property, 208
mapred-default.xml file, 205
mapred-site.xml, configuring, 33, 207
mapred-site.xml file, 222–223
MapReduce
chaining, 121–124
compatibility with Hadoop V2. See Compatibility, MapReduce and Hadoop
V2.
data locality, 64
debugging examples, online resources, 245
example, pseudo-distributed mode, 37–38
fault tolerance, 107–108
in the Hadoop ecosystem, 15
hardware, 108
heap size, setting, 208
log management, 125–127
node capacity, compatibility with MapReduce and Hadoop V2, 222
parallel data flow, 104–107
Pipes interface, 119–121
programming, online resources, 245
programming examples, online resources, 245
properties, setting, 208
query tools in the Hadoop ecosystem, 16–17
replacing with Tez engine, 181
resource limits, setting, 208
vs. Spark, 182
specifying a framework name, 33
speculative execution, 108
streaming interface, 116–119
in the YARN framework, 181
MapReduce, debugging
checking job status, 125
example, online resources, 245
killing jobs, 125
listing jobs, 125
log aggregation, 125–126
log management, 125–128
parallel applications, 124
recommended install types, 124–125
viewing logs, 127–128
MapReduce, examples
computational model, 101–104
files for, 85–86
listing available examples, 86–87
monitoring, 89–95
pi program, 37–38, 87–89
V1, 8–10
War and Peace example, 101–104
word count program, 101–104, 104–107. See also WordCount example
program.
MapReduce, multi-platform support
C++, 119–121
Java, 111–116
Python, 116–119
MapReduce computational model
example, 101–104
important properties, 103
mapping process, 101–104
MapReduce jobs
finding, 97–98, 125
killing, 97–98, 125
listing, 97–98, 125
managing, 97–98, 125
status check, 97–98, 125, 207
MapReduce process, V1
advantages, 10–11
availability, 12
basic aspects, 10
design principles, 7–8
example, 8–10
fault tolerance, 11
Job Tracker, 12
limitations, 12
managing workflow, 10
mapping step, 8–10
master control process, 12
reducing step, 8–10
resource utilization, 12
scalability, 10, 12
support for alternative paradigms and services, 12
Task Tracker, 12
typical job progress, 12
MapReduce process, V2
advantages of, 14–15
Job Tracker, 13
overview, 13
typical job progress, 13–14
MapReduce V2 administration, online resources, 226, 247
mapreduce.am.max-attempts property, 222
mapreduce.map.java.opts property, 223
mapreduce.map.memory.mb property, 208, 223
mapreduce.reduce.java.opts property, 208, 223
mapreduce.reduce.memory.mb property, 208, 223
Master user job, YARN applications, 178–179
Masters. See NameNode.
Master/slave architecture in HDFS, 65
Messages, during an Ambari installation, 53
Messages and errors, Java, 233–234
Metadata, tracking changes to, 66. See also CheckPointNode.
Microsoft projects. See specific projects.
-mkdir command, 75
Monitoring. See also Zookeeper.
applications, 89–95
HDFS. See NameNode UI.
MapReduce examples, 89–95
YARN containers, 184
-move option, 212
Movie reviews, query example, 136–139
MovieLens website, 135
Moving computation vs. moving data, 8
Moving corrupted files, 212
mradmin function. See rmadmin function.
Murthy, Arun C., 13
MySQL databases, loading, 142–143
N
NameNode
description, 64–67
formatting, 34
health, monitoring. See Zookeeper.
periodic checkpoints, 66
NameNode, troubleshooting
checkpoint, recovering from, 240–241
failure and recovery, 239–241
reformatting, 239–240
NameNode Federation
description, 22
example, 71
key benefits, 70
NameNode HA (High Availability), 69–70
NameNode UI
Datanodes tab, 209–210
directory browser, 211
overview, 208–209
Overview tab, 209
Snapshot tab, 209, 216
startup progress, 209–210
Namespace Federation, 70–71
Network topology, printing, 213
newbandwidth option, 213
NFSv3 gateway, configuring
configuration files, 217–218
current NFSv3 capabilities, 217
mounting HDFS, 219–220
overview, 217
starting the gateway, 218–219
user name, specifying, 218
NFSv3 Gateway, mounting an HDFS, 72
“Nobody Ever Got Fired for Using Hadoop,” 4
Node capacity, compatibility with MapReduce and Hadoop V2, 222
Node types, Oozie, 155–156
NodeManagers
configuring, 33
definition, 20
O
Official Hadoop sources, 62
Online resources. See also Books and publications.
balancing HDFS, 214
“Big Data Surprises,” 4
Capacity scheduler administration, 221, 226
downloading Hadoop, 30
Dryad on YARN, 182
dynamic application management, 183–184
essential tools, 245–246
Fair scheduler, 221
Flink, 183
Flume, streaming data and transport, 246
Flume configuration files, 153–154
Flume User Guide, 150
Flume website, 153, 170
Google BigTable, 163
Google Protocol Buffers, 179
GroupLens Research website, 135
HBase distributed database, 246
HBase on YARN, 182
HBase website, 164
HDFS administration, 226
HDFS permissions, 211
Hive query language, 170
Hive SQL-like queries, 246
Hoya, 182
Hue GUI, 254
log management, 235
MovieLens website, 135
MPI (Message Passing Interface), 183
official Hadoop sources, 62
Oozie, 170
Oozie workflow manager, 246
Pig scripting, 134, 170, 245
Pig website, 132
REEF, 183
Sandbox, 62
Secure Mode Hadoop, 22
Slider, 183–184
Spark, 182, 260
SQL-like query language, 170
Sqoop, data import/export, 170
Sqoop RDBMS import/export, 246
Storm, 182
supported Java versions, 62
Tez, 181
VirtualBox, 62
Whirr, 56, 62
XML configuration files, 206
XML configuration files, description, 62
YARN administration, 226, 247
YARN application frameworks, 246
YARN development, 184, 246
Online resources, Ambari
administration, 246
Ambari Installation Guide, 42
Ambari Views, 193
installation guide, 62, 244
project page, 62, 244
project website, 204
troubleshooting guide, 62, 244
Whirr cloud tools, 244
Online resources, examples
debugging, 245
Grep, 245
Java MapReduce, 245
MapReduce programming, 245
Pi benchmark, 244
Pipes, 245
streaming data, 245
Terasort benchmark, 244
Online resources, Hadoop
administration, 247
Capacity scheduler administration, 247
code and examples used in this book, 227
documentation page, 20
general information, 227
HDFS administration, 247
Hortonworks Sandbox, 244
installation recipes, 243–244
JIRAs, 235
MapReduce V2 administration, 247
Oracle VirtualBox, 244
supported Java versions, 243
XML configuration files, 243
YARN administration, 247
Online resources, HDFS
background, 244
Java programming, 244
libhdfs programming in C, 244
user commands, 244
Online resources, MapReduce
background, 245
debugging, example, 245
Grep example, 245
Java MapReduce example, 245
MapReduce V2 administration, 226
Pipes example, 245
programming, 245
streaming data example, 245
Oozie
action flow nodes, 155–156
bundle jobs, 155
command summary, 163
control flow nodes, 155–156
coordinator jobs, 155
DAGs (directed acrylic graphs), 154
fork/join nodes, 155–156
in the Hadoop ecosystem, 17
job types, 154–155
node types, 155–156
online resources, 170
overview, 154
workflow jobs, 154–155
Oozie, examples
demonstration application, 160–162
downloading examples, 156–157
job.properties file, 157
MapReduce, 156–160
Oozie is not allowed to impersonate Oozie error, 159
-oozie option, setting, 159
OOZIE_URL environmental variable, 159
required environment, 156
workflow.xml file, 157
Oozie is not allowed to impersonate Oozie error, 159
-oozie option, setting, 159
Oozie workflow
modifying in the Hue GUI, 252
online resources, 246
under YARN, 225
OOZIE_URL environmental variable, 159
-openforwrite option, 213
Oracle VirtualBox
installing. See Installing Sandbox, VirtualBox.
online resources, 62, 244
org.apache.hadoop.mapred APIs, 224
org.apache.hadoop.mapreduce APIs, 224
Overview tab, NameNode UI, 209
P
Parallel applications, debugging, 124
Parallel data flow, 104–107
Parallelizing SQL queries, 146
Passwords
Ambari, 45, 186
changing, 45, 186
Hadoop appliance, 28
Hue, 253
Sqoop, 143, 145
Performance improvement
HPC (high-performance computing), 183
YARN framework, 182
Permissions for HDFS, online resources, 211
Pi benchmark, online resources, 244
pi program, MapReduce example, 37–38
Pig
-- (dashes), comment delimiters, 133
/* */ (slash asterisk...), comment delimiters, 133
; (semicolon), command terminator, 133
example, 132–134
grunt> prompt, 133
in the Hadoop ecosystem, 16
installing, 38–39
online resources, 132
running from a script, 133–134
Tez engine, 131–134
uses for, 131–132
website, 132
Pig scripting
online resources, 134, 170, 245
under YARN, 225
Pipelines, Flume, 149–150
Pipes
examples, online resources, 245
interface, 119–121
Planning resources. See Resource planning; specific resources.
Prepackaged design elements, 182–183
Printing
block locations, 213
files, 212–213
network topology, 213
Processing unbounded streams of data, 182
Programming MapReduce, online resources, 245
Progress window in Ambari, disabling, 194
Pseudo-distributed mode. See Installing Hadoop, in pseudo-distributed mode.
-put command, 75
Python, 116–119
R
Rack awareness, 68–69. See also Data locality.
-racks option, 213
RDBMS (relational database management system). See Sqoop.
REEF (Retainable Evaluator Execution Framework), 182–183
Reference material
publications. See Books and publications.
on websites. See Online resources.
Reformatting NameNode, 239–240
Relational database management system (RDBMS). See Sqoop.
Relational databases, transferring data between. See Sqoop.
Removing. See Deleting.
-report command, 77
Reporting service interruptions, 194–195
-rerun command, 163
Resolving service interruptions, 195–198
Resource management. See Oozie; YARN.
Resource planning. See also specific resources.
Resource planning, hardware, 21–22
Resource planning, software
building a Windows package, 22
installation procedures. See specific software.
NameNode Federation, 22
NameNode HA (High Availability), 22
official supported versions, 22
Secure Mode Hadoop, 22
ResourceManager. See also NodeManager; Oozie; YARN.
definition, 20
web GUI, monitoring examples, 89–95
Restarting after configuration change, 200–201
-resume command, 163
Retainable Evaluator Execution Framework (REEF), 182–183
Reverting to a previous version of Hadoop, 202–204
-rm command
deleting directories, 77
deleting files, 75
rmadmin function, 224
Rows, deleting, 167
-run command, 163
S
Safe Mode, 68, 214
-safemode enter command, 214
-safemode leave command, 214
Sandbox
installing. See Installing Sandbox.
online resources, 62
Scalable batch and stream processing, 183
scan command, 166
scheduler.minimum-allocation-mb property, 207
Scheduling. See Capacity scheduler; Oozie; ResourceManager; YARN.
Schema on read, 5
Schema on write, 5
Scripting
command line scripts, compatibility with MapReduce and Hadoop V2, 224
HBase input, 167
with Pig, 133–134, 170, 225, 245
SecondaryNameNode, 20, 214–215. See also CheckPointNode.
Secure Mode Hadoop, online resources, 22
Semicolon (;), Pig command terminator, 133
Service interruptions
reporting, 194–195
resolving, 195–198
Services. See Hadoop services.
Services view, Ambari, 189–191
shell command, 164
-shell_args option, 176–178
Simplifying solutions, 235
Single quotes (‘ ‘), HBase argument delimiters, 164
Sink component of Flume, 148
skipTrash option, 76
Slash asterisk... (/* */), Pig comment delimiters, 133
Slaves. See DataNodes.
Slider, online resources, 183–184
-snapshot command, 71
Snapshot tab, NameNode UI, 209, 216
Snapshots, HDFS
description, 71–72
including data from, 213
overview, 215
restoring deleted files, 215–216
sample screen, 216
shapshottable directories, 215
Snapshottable directories, 215
Source component of Flume, 148
Spark
installing, 260
vs. MapReduce, 257
online resources, 260
Speculative execution, 108
--split-by option, 146
SQL queries, parallelizing, 146
SQL-like query language, online resources, 170. See also Hive.
Sqoop
data import/export, online resources, 170
database connectors, 141
in the Hadoop ecosystem, 17
importing/exporting data, 139–142
overview, 139
RDBMS import/export, online resources, 246
version changes, 140–142
Sqoop, examples
-cat command, 145
cleaning up imported files, 148
-delete command, 148
deleting data from tables, 148
downloading Sqoop, 142–143
-drop command, 148
dropping tables, 148
exporting data from HDFS, 147–148
import command, 145–146
importing data from MySQL, 144–147
installing Sqoop, 143
listing database tables, 144
listing databases, 144
loading MySQL database, 142–143
overview, 142
parallelizing SQL queries, 146
setting Sqoop permissions, 143–144
software environment, 142
--split-by option, 146
viewing imported files, 145
wget command, 143
-start command, 163
Starting and stopping. See also specific programs.
Hadoop daemons, 238–239
Hadoop services, 190–191
Status check, MapReduce jobs, 97–98, 125, 207
-status option, 97–98, 125, 207
Status widgets, Ambari, 186–189
stderr, MapReduce log, 125
stdout, MapReduce log, 125
Storm, 182
Streaming data
examples, online resources, 245
from HDFS, 72
Streaming interface, 116–119
-submit command, 163
-suspend command, 163
syslog, MapReduce log, 125
System administration. See HDFS administration; YARN administration.
T
Tables
cells, getting, 166
creating, 135–136
deleting. See Tables, dropping.
deleting data from, 148
listing, 144
Tables, dropping with
HBase, 167
Hive, 135–136
MySQL, 148
tail command, 153
Task neutrality, YARN, 13
Terasort benchmark, 244
terasort test, 95–96
Terminal initialization failed error message, 40
TestDFSIO benchmark, 96–97
Testing Flume, 151–152
Tez engine.
Hive, 134
online resources, 181
Pig, 131–134
replacing MapReduce, 181
Tools, online resources, 245–246
Traditional data transform. See ETL (extract, transform, and load).
Transforming data. See Sqoop.
Trash directory, bypassing, 75
Troubleshooting
with Ambari, 234
Apache Hadoop 2 Quick-Start Guide, 229–233
checking the logs, 234–235
debugging a MapReduce example, online resources, 245
deleting all HDFS data, 239
examining job output, 238
formatting HDFS, 239–240
Hadoop documentation online, 233
INFO messages, turning off, 237
Java errors and messages, 233–234
with MapReduce information streams, 235–238
searching the Internet, 235
simplifying solutions, 235
starting and stopping Hadoop daemons, 238–239
this book as a resource, 229–233
tracking JIRA issues, 235
Troubleshooting, NameNode
checkpoint, recovering from, 240–241
failure and recovery, 239–241
reformatting, 239–240
Troubleshooting guide, online resources, 244
U
Undoing an Ambari install, 55–56
Uploading files. See Downloading/uploading files.
uptime command, 174–175
User interface. See HDFS, web GUI; Hue (Hadoop User Experience) GUI.
Users. See End users.
Users administration screen, Hue GUI, 255
V
version option, 73
Views view, Ambari, 193
VirtualBox
installing. See Installing Sandbox, VirtualBox.
online resources, 62, 244
W
War and Peace example, 101–104
Warning messages. See Messages and errors.
Web interface. See HDFS, web GUI; ResourceManager, web GUI.
Webchat service, configuring, 251
WebProxy server, 206
wget command, 143
Whirr
cloud tools for Ambari, online resources, 244
installing Hadoop in the cloud. See Installing Hadoop in the cloud, with
Whirr.
online resources, 56, 62
Word count program, example, 101–104, 104–107
WordCount example program
C++-based, 119–121
grep chaining, 121–124
grep.java command, 121–124
Java-based, 111–116
Python-based, 116–119
WordCount.java program, 111–116
Workflow, managing. See Oozie; ResourceManager; YARN (Yet Another
Resource Negotiator).
Workflow automation in the Hadoop ecosystem, 17
Workflow jobs, 154–155
workflow.xml file, 157
X
XML configuration files
description, 62
editing, 20–21
functions of, 205
list of, 205
location, 20
online resources, 206, 243
properties of, 206
Y
YARN (Yet Another Resource Negotiator).
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing
with Apache Hadoop 2, 17
application frameworks, 13
Capacity scheduler, 220–221
definition, 2
development, online resources, 246
in the Hadoop ecosystem, 15
job information, getting, 89–95
log aggregation, 125–126
operation design, 13–15
scheduling and resource management, 13–15
services, starting, 36
task neutrality, 13
YARN administration
container cores, setting, 208
container memory, setting, 207
decommissioning YARN nodes, 206
JobHistoryServer, 207
managing YARN applications, 178–179
managing YARN jobs, 207
MapReduce properties, setting, 208
online resources, 226, 247
YARN WebProxy server, 206
YARN application frameworks
Dryad on YARN, 182
Flink, 183
Giraph, 181
graph processing, 181
Hadoop ecosystem, components, 17
in the Hadoop ecosystem, 17
Hamster, 183
HBase on YARN, 181–182
Hoya, 181–182
HPC (high-performance computing), 183
MapReduce, 181
online resources, 246
performance improvement, 182
prepackaged design elements, 182–183
processing unbounded streams of data, 182
REEF, 182–183
resource management, 180
scalable batch and stream processing, 183
Spark, 182
Storm, 182
Tez, 181
YARN Distributed-Shell, 180
YARN applications
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing
with Apache Hadoop 2, 178
ApplicationMaster, 178–179
developing, online resources for, 184
killing, 207
managing, 178–179
master user job, 178–179
resource management, 178–179
structure of, 178–179
vs. YARN components, 179–180
YARN applications, containers
definition, 178
monitoring, 184
running commands in, 174–176
YARN components vs. YARN applications, 179–180
YARN Distributed-Shell
description, 171
installing, 172–174
in the YARN framework, 180
YARN Distributed-Shell, examples
running commands in containers, 174–176
with shell arguments, 176–178
-shell_args option, 176–178
uptime command, 174–175
YARN server, single-node configuration, 30
YARN WebProxy server, 206
yarn-default.xml file, 205
yarn.manager.resource.memory-mb property, 207
yarn.nodemanager.resource.cpu-vcores property, 208
yarn.nodemanager.resource.memory-mb property, 223
yarn.nodemanager.-vmem-pmem-ratio property, 223
yarn.resourcemanager.am.max-retries property, 222
yarn.scheduler.maximum-allocation-mb property, 207, 223
yarn.scheduler.maximum-allocation-vcores property, 208
yarn.scheduler.minimum-allocation-mb property, 223
yarn.scheduler.minimum-allocation-vcores property, 208
yarn-site.xml file
configuring, 33
enabling ApplicationMaster restarts, 222–223
important properties, 207
Yet Another Resource Negotiator (YARN). See YARN (Yet Another Resource
Negotiator).
Z
Zookeeper, 17, 70
Code Snippets