0% found this document useful (0 votes)
38 views120 pages

Unit III

Uploaded by

midhun reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views120 pages

Unit III

Uploaded by

midhun reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Unit III

Introduction to Hadoop: Big Data – Apache Hadoop & Hadoop Eco System –
Moving Data in andout ofHadoop – Understanding inputs and outputs of
MapReduce - Data Serialization.

Introduction to File System


The first storage mechanism used by computers to store data was punch
cards. Each groupof related punch cards (Punch cards related to same program)
used to be stored into a file; and files were stored in file cabinets.
This is very similar to what we do nowadays to archive papers in
government instituitions who still use paper work on daily basis. This is where the
word “File System” (FS) comes from. The computer systems evolved; but the
concept remains the same.

Figure: Storage Mechanism

What is File System?


Instead of storing information on punch cards; we can now store information
/ data in a digital format on a digital storage device such as hard disk, flash
drive…etc.
Relateddata arestill categorized as files;
Related groups of files are stored in folders.
Each file has a name, extension and icon. The file name gives an indication
about the content it has while file extension indicates the type of information stored
in that file. for example; EXE extension refers to executable files, TXT refers to
text files…etc.
File managementsystem is used by the operating system to access the files
and folders stored in a computer or any external storage devices.

What is Distributed File System?


In Big Data, we deal with multiple clusters (computers) often. One of the
main advantagesof Big Data which is that it goes beyond the capabilities of one
single super powerfulserver with extremely high computing power.
The whole idea of Big Data is to distributedata across multiple clusters and
to make use of computing power of each cluster (node)to process information.
Distributed file system is a system that can handle accessing dataacross
multiple clusters (nodes).
DFS has two components
 Location Transparency:Location Transparency achieves through the
namespacecomponent.Transparent means that each user within the system may
access all the data withinall databases as if they were a single database.
 Redundancy: Redundancy is done through a file replication component.

Features of DFS
 Transparency
• Structure transparency: There is no need for the client to know
about the number orlocations of file servers and the storage devices.
Multiple fileservers should be providedfor performance, adaptability, and
dependability.
• Access transparency: Both local and remote files should be
accessible in the samemanner. The file system should be automatically
located on the accessed file and send it tothe client’s side
 Naming transparency: Thereshould not be any hint in the name
of the file to thelocation of the file. Once a name is given to the file, it
should not be changed duringtransferring from one node to another.
• Replication transparency: If a file is copied on multiple nodes,
both the copies ofthefile and their locations should be hidden from one node
to another.

 User mobility:It willautomatically bringthe user’s home directory to the


nodewhere the userlogs in.

 Performance:Performance isbased on the average amount of time needed to


convince the clientrequests. This time covers the CPU time + time taken to
access secondary storage + network accesstime. It is advisablethat the
performance ofthe Distributed File System be similar to that of acentralized
file system.

 Simplicity and ease of use: The user interface of a file system should be
simple and thenumber of commands in the file shouldbesmall.

 High availability:A Distributed File System should be able to continue in


case of any partialfailureslike a link failure, a node failure, or astorage drive
crash.A high authentic and adaptable distributed file system should have
different and independent fileservers for controlling different and
independent storage devices.

How Distributed file system (DFS) works?


Distributed file system works as follows:
• Distribution:Distribute blocksof data sets across multiple nodes.
Each node has its owncomputing power; which gives the ability of
DFS to parallelprocessing data blocks.

• Replication: Distributed file system will also replicate data blocks


on different clusters bycopy the same pieces of
information into multiple clusters on different racks.
This willhelp to achieve the following:
• Fault Tolerance: recover data block in case of cluster failure
or Rack failure.
Data replication is a good way to achieve fault tolerance and
high concurrency; but it’svery hard to maintain frequent
changes. Assume that someone changed a datablock on one
cluster; these changes need to be updated on all data replica of
thisblock.

Figure 4 Fault Tolerance Concept


Data replication is a good way to achieve fault tolerance and high
concurrency; but it’svery hard to maintain frequent changes. Assume that
someone changed a data block onone cluster; these changes need to be
updated on all data replica of this block.

• High Concurrency: avail same piece of data to be processed by multiple


clients atthe same time. It is done using the
computation power of each node to parallelprocess
data blocks.

Advantages of Distributed File System


 Scalability: You can scale up your infrastructure by adding more racks or
clusters to yoursystem.
 Fault Tolerance: Data replication will help to achieve fault tolerance in the
Followingcases:
• Cluster is down
• Rack is down
• Rack is disconnected from the network.
• Job failure or restart.
 High Concurrency: utilize the compute power of each node to handle multiple
Clientrequests (in a parallel way) at the same time.
• DFS allows multiple users to access or store the data.
• It allows the data to be share remotely.
• It improved the availability of file, access time and
network efficiency.
• Improved the capacity to change the size of the data and
also improves the ability toexchange the data.
• Distributed File System provides transparency of data
even if server or disk fails.

Disadvantages of Distributed File System (DFS)


 In Distributed File System nodes and connections needs to be secured therefore
we can saythat security is at stake.
 There is a possibility of loss of messages and data in the network while
movement fromone node to another.
 Database connection in case of Distributed File System is complicated.
 Also handling of the database is not easy in Distributed File System as
compared to asingle user system.
 There are chances that overloading will take place if all nodes try to send data at
once.

Introduction to Hadoop:
 Hadoop is an open-source project of the Apache foundation.
 Hadoop is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models.
 It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
 Rather than rely on hardware to deliver high-availability, the library itself is
designed to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of which may
be prone to failures.
 In simple words, Hadoop is a software library that allows its users to
process large datasets across distributed clusters of computers, thereby
enabling them to gather, store and analyze huge sets of data.
 Hadoop is now a core part of the computing infrastructure for companies
such as Yahoo, Facebook, LinkedIn, Twitter etc

Features of Hadoop
1. It is optimized to handle massive quantities of structured, semi-structured
and unstructured data, using commodity hardware, that is, relatively
inexpensive computers.
2. Hadoop has a share nothing architecture.
3. It replicates its data across multiple computers so that if one goes down, the
data can still be processed from another machine that stores its replica
4. Hadoop is for high throughput rather than low latency. It is a batch operation
handling massive quantities of data; therefor the response time is not
immediate.
5. It complements On-Line Transaction Processing(OLTP) and On-Line
Analytical Processing(OLAP). However, it is not a replacement for a
relational database management system.
6. It is NOT good when work cannot be parallelized or when there are
dependencies within the data.
7. It is NOT good for processing small files. It works best with huge data files
and datasets.

Key Advantages of Hadoop


 Distributed Storage: Hadoop stores large data sets across multiple
machines, allowing for the storage and processing of extremely large
amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
 Cost-Effective: Owing to its scale-out architecture, Hadoop has a much
reduced cost / terabyte of storage and processing.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it
can continue to operate even in the presence of hardware failures.
 Flexible Data Processing: Hadoop’s MapReduce programming model
allows for the processing of data in a distributed fashion, making it easy to
implement a wide variety of data processing tasks.
 Fast: Processing is extremely fast in Hadoop as compared to other
conventional systems owing to the “move code to data” paradigm.

Hadoop Versions:
As the data beings stored and processed increases in its complexity so do
Hadoop where the developers bring out various versions to address the issues (bug
fixes) and simplify the complex data processes. The updates are automatically
implemented as Hadoop development follows the trunk (base code) – branch
(fix)model.
Hadoop has two versions:
 Hadoop 1.x (Version 1)
 Hadoop 2 (Version 2)
1. Hadoop 1.x
Below are the Components of Hadoop 1.x
1. The Hadoop Common Module is a jar file which acts as the base API on
top of which all the other components work.
2. No new updates because of 1st Version.
3. Maximum of 4000 nodes only for each cluster
4. The functionality is limited utilizing the slot concept, i.e., the slots are
capable of running a map task or a reduce task.
5. HDFS is used for distributed storage system that is designed to cater to
large data, with a block size of 64 Mega Bytes (64MB) for supporting the
architecture. It is further divided into two components:
 Name Node which is used to store metadata about the Data node,
placed with the Master Node. They contain details like the details
about the slave note, indexing and their respective locations along
with timestamps for timelining.
 Data Nodes used for storage of data related to the applications in
use placed in the Slave Nodes.
6. Hadoop 1 uses Map Reduce (MR) data processing model. It is not capable
of supporting other non-MR tools.
MR has two components:
 Job Tracker is used to assign or reassign task to MapReduce to an
application called task tracker located in the node clusters. It
additionally maintains a log about the status of the task tracker.
 Task Tracker is responsible for executing the functions which have
been allocated by the job tracker and sensor cross the status report
of those task to the job tracker.
7. The network of the cluster is formed by organizing the master node and
slave nodes.
8. Whenever a large storage operation for big data set is received by the
Hadoop system, the data is divided into decipherable and organized
blocks that are distributed into different nodes.

2. Hadoop Version 2
Version 2 for Hadoop was released to provide improvements over the lags
which the users faced with version 1.
Improvements that the new version provides:
 HDFS Federation: In prior HDFS architecture for entire cluster allows only
single namespace. In that configuration, Single Name Node manages
namespace. If Name Node fails, the cluster as a whole would be out of
services. The cluster will be unavailable until the Name Node restarts or
brought on a separate machine. Federation overcomes this limitation by
adding support for many Name Node/Namespaces ( A layer responsible for
managing the directories, files and blocks) to HDFS.
 YARN(Yet Another Resource Negotiator)
 Version 2.7.x - Released on 31st May 2018: Provide two major
functionalities that are providing for your application and providing for a
global resource manager, thereby improving its overall utility and versatility,
increasing scalability up to 10000 nodes for each cluster.
 Version 2.8.x - Released in September 2018: Capacity scheduler is
designed to provide multi-tenancy support for processing data over Hadoop
and it has been made to be accessible for window uses so that there is an
increase in the rate of adoption for the software across the industry for
dealing with problems related to big data.

Version 3 (latest running Hadoop Updated Version)


 Version 3.1.x – released on 21 October 2019: This update enables Hadoop
to be utilized as a platform to serve a big chunk of Data Analytics Functions
and utilities to be performed over event processing alongside using real-time
operations give a better result.
 It has now improved feature work on the container concept which
enables had to perform generic which were earlier not possible with
version 1.
 Version 3.2.1 - released on 22nd September 2019: It addresses issues of
non-functionality (in terms of support) of data nodes for multi-Tenancy and
the biggest problem than needed for an alternate data storage which is
needed for the real-time processing and graphical analysis.

Hadoop EcoSystem
Apache Hadoop is an open source framework intended to make interaction
with big data easier,
However, for those who are not acquainted with this technology, one
question arises that what is big data ?
Big data is a term given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology such as RDBMS.
Hadoop has made its place in the industries and companies that need to work
on large data sets which are sensitive and needs efficient handling.

Being a framework, Hadoop is made up of several modules that are


supported by a large ecosystem of technologies.
Hadoop Ecosystem can be defined as a comprehensive collection of tools
and technologies that can be effectively implemented and deployed to provide Big
Data solutions in a cost-effective manner.
MapReduce and HDFS are two core components of the Hadoop ecosystem
that provide a great starting point to manage Big Data, however they are not
sufficient to deal with the Big Data challenges.
Hadoop Ecosystem is neither a programming language nor a service, it is
a platform or framework which solves big data problems. You can consider it as
a suite which encompasses a number of services (ingesting, storing, analyzing
and maintaining) inside it.
All these elements enable users to process large datasets in real time and
provide tools to support various types of Hadoop projects, schedule jobs and
manage cluster resources.
Sqoop
Sqoop(SQL on Hadoop) that means SQL is basically the database and this
entire database is now pulled into the Hadoop system hence it is called Sqoop that
is SQL on the Hadoop.
Sqoop (SQL to Hadoop) is a tool used for data transfer between Hadoop and
and external datastores, such as relational databases (MS SQL Server, MySQL).

To process data using Hadoop, the data first needs to be loaded into Hadoop
clusters from several sources.
Sqoop is also a command-line interpreter, which sequentially executes
Sqoop commands.
Sqoop can be effectively used by non-programmers as well and relies on
underlying technologies like HDFS and MapReduce.

Sqoop Architecture
1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise
data warehouse, document-based systems, and a relational database. We
have a connector for each of these; connectors help to work with a range of
accessible databases.
3. Multiple mappers perform map tasks to load the data on to HDFS.

4. Similarly, numerous map tasks will export the data from HDFS on to
RDBMS using the Sqoop export command.

Sqoop Import
The diagram below represents the Sqoop import mechanism.
In this example, a company’s data is present in the RDBMS. All this
metadata is sent to the Sqoop import. Scoop then performs an introspection of the
database to gather metadata (primary key information).
It then submits a map-only job. Sqoop divides the input dataset into splits
and uses individual map tasks to push the splits to HDFS.
Few of the arguments used in Sqoop import are shown below:

Sqoop Export
1. The first step is to gather the metadata through introspection.
2. Sqoop then divides the input dataset into splits and uses individual map tasks
to push the splits to RDBMS.
Few of the arguments used in Sqoop export:

Flume
 Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log
files, events, (etc...) from various sources to a centralized data store.
 Flume is also used for collecting data from various social media websites
such as Twitter and Facebook.
 Flume is used for real-time data capturing in Hadoop.
 It can be applied to assemble a wide variety of data such as network traffic,
data generated via social networking, business transaction data and emails.
 Flume is a highly reliable, distributed, and configurable tool. It is principally
designed to copy streaming data (log data) from various web servers to
HDFS.
 It has a simple and very flexible architecture based on streaming data flows.
 Its quite robust and fault tolerant

The major difference between Flume and Sqoop is that:


 Flume only ingests unstructured data or semi-structured data into HDFS.
While Sqoop can import as well as export structured data from RDBMS or
Enterprise data warehouses to HDFS or vice versa.
Flume Architecture
There is a Flume agent which ingests the streaming data from various data
sources to HDFS.
From the diagram, you can easily understand that the web server indicates the
data source.
Twitter is among one of the famous sources for streaming data.
The flume agent has 3 components: source, sink and channel.
1. Source: It accepts the data from the incoming streamline and stores the data
in the channel.
2. Channel: it acts as the local storage or the primary storage. A Channel is a
temporary storage between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel
and commits or writes the data in the HDFS permanently.

ZooKeeper
Before Zookeeper, it was very difficult and time consuming to coordinate
between different services in Hadoop Ecosystem. The services earlier had many
problems with interactions like common configuration while synchronizing data.
Even if the services are configured, changes in the configurations of the services
make it complex and difficult to handle. The grouping and naming was also a time-
consuming factor.
Due to the above problems, Zookeeper was introduced. It saves a lot of time
by performing synchronization, configuration maintenance, grouping and naming.
Although it’s a simple service, it can be used to build powerful solutions.

 Apache ZooKeeper is an open source distributed coordination service that


helps to manage a large set of hosts.
 Apache ZooKeeper is a service used by a cluster (group of nodes) to
coordinate between themselves and maintain shared data with robust
synchronization techniques.
 Centralized service for maintain configuration information naming services
 Zookeeper is a highly reliable distributed coordination kernel, which can be
used for distributed locking, configuration management, leadership election,
work queues,…
 Zookeeper is a replicated service that holds the metadata of distributed
applications.
 It is a central store of key-value using which distributed systems can
coordinate.
The common services provided by ZooKeeper are as follows −
 Naming service − Identifying the nodes in a cluster by name. It is similar to
DNS, but for nodes.
 Configuration management − Latest and up-to-date configuration
information of the system for a joining node.
 Cluster management − Joining / leaving of a node in a cluster and node
status at real time.
 Leader election − Electing a node as leader for coordination purpose.
 Locking and synchronization service − Locking the data while modifying it.
This mechanism helps you in automatic fail recovery while connecting other
distributed applications like Apache HBase.
 Highly reliable data registry − Availability of data even when one or a few
nodes are down.

Architecture of Zookeeper
Client-Server architecture is used by Apache Zookeeper. The five
components that make up the Zookeeper architecture are as follows:
 Server: When any client connects, the server sends an acknowledgment.
The client will automatically forward the message to another server if the
connected server doesn't respond.
 Client: One of the nodes in the distributed application cluster is called
Client. You can access server-side data more easily as a result. Each client
notifies the server that it is still alive regularly with a message.
 Leader: A Leader server is chosen from the group of servers. The client is
informed that the server is still live and is given access to all the data. If any
of the connected nodes failed, automatic recovery would be carried out.
 Follower: A follower is a server node that complies with the instructions of
the leader. Client read requests are handled by the associated Zookeeper
server. The Zookeeper leader responds to client write requests.
 Ensemble/Cluster: a cluster or ensemble is a group of Zookeeper servers.
When running Apache, you can use ZooKeeper infrastructure in cluster
mode to keep the system functioning at its best.
 ZooKeeperWebUI: You must utilize WebUI if you wish to deal with
ZooKeeper resource management. Instead of utilizing the command line, it
enables using the web user interface to interact with ZooKeeper. It allows
for a quick and efficient connection with the ZooKeeper application.
How does Zookeeper Works?
Hadoop ZooKeeper is a distributed application that uses a simple client-
server architecture, with clients acting as service-using nodes and servers as
service-providing nodes.
The ZooKeeper ensemble is the collective name for several server nodes.
One ZooKeeper client is connected to at least one ZooKeeper server at any one
time. Because a master node is dynamically selected by the ensemble in consensus,
an ensemble of Zookeeper is often an odd number, ensuring a majority vote.
If the master node fails, a new master is quickly selected and replaces the
failed master.
In addition to the master and slaves, Zookeeper also has watchers. Scaling
was a problem, therefore observers were brought in. The performance of writing
will be impacted by the addition of slaves because voting is an expensive
procedure. Therefore, observers are slaves who perform similar tasks to other
slaves but do not participate in voting.
Finally another application is called a Flume, which is a distributed reliable
available service, for efficiently collecting aggregating moving, a large amount of
data into the, of the locks into the HDFS system hence, it is used for data injection,
please use for data ingestion that is the flume system.

Oozie
Apache Oozie is a scheduler system to run and manage Hadoop jobs in a
distributed environment.
It allows to combine multiple complex jobs to be run in a sequential order to
achieve a bigger task.
Within a sequence of task, two or more jobs can also be programmed to run
parallel to each other.
One of the main advantages of Oozie is that it is tightly integrated with
Hadoop stack supporting various Hadoop jobs like Hive, Pig, Sqoop as well as
system-specific jobs like Java and Shell.

Oozie detects completion of tasks through callback and polling.


When Oozie starts a task, it provides a unique callback HTTP URL to the
task, and notifies that URL when it is complete. If the task fails to invoke the
callback URL, Oozie can poll the task for completion.

There are three kinds of Oozie jobs:


1. Oozie workflow: These are represented as Directed Acyclic Graphs (DAGs)
to specify a sequence of actions to be executed. These are
sequential set of actions to be executed. You can assume
it as a relay race. Where each athlete waits for the last one
to complete his part.
2. Oozie Coordinator: These consist of workflow jobs triggered by time and
data availability. These are the Oozie jobs which are
triggered when the data is made available to it. Think of
this as the response-stimuli system in our body. In the
same manner as we respond to an external stimulus, an
Oozie coordinator responds to the availability of data
and it rests otherwise.
3. Oozie Bundle: These can be referred to as a package of multiple coordinator
and workflow jobs.
Pig
Its a scripting language on top of Hadoop Map Reduce. So instead of going
to the complication of a complex Map Reduce application program, rather simple
view of this scripting language is being provided and that language is called a Pig
Latin, and this is useful for the data analysis and as the dataflow.
So, it is based on data, data flow model and it was originally developed at
Yahoo in 2006.
It gives you a platform for building data flow for ETL (Extract, Transform
and Load), processing and analyzing huge data sets.

PIG has two parts:


 Pig Latin, the language: which has SQL like command structure.
10 line of pig latin = approx. 200 lines of Map-Reduce Java code
 pig runtime, for the execution environment.

You can better understand it as Java and JVM.


But don’t be shocked when I say that at the back end of Pig job, a map-
reduce job executes.
 The compiler internally converts pig latin to MapReduce. It produces a
sequential set of MapReduce jobs, and that’s an abstraction (which works
like black box).

How Pig works?


In PIG, first the load command, loads the data. Then we perform various
functions on it like grouping, filtering, joining, sorting, etc. At last, either you can
dump the data on the screen or you can store the result back in HDFS.

Mahout
 Mahout which is renowned for machine learning.
 Mahout provides an environment for creating machine learning applications
which are scalable. Machine learning algorithms allow us to build self-
learning machines that evolve by itself without being explicitly
programmed. Based on user behaviour, data patterns and past experiences it
makes important future decisions.
 You can call it a descendant of Artificial Intelligence (AI).
 Mahout provides a command line to invoke various algorithms. It has a
predefined set of library which already contains different inbuilt algorithms
for different use cases.
What Mahout does?
It performs collaborative filtering, clustering and classification. Some people
also consider frequent item set missing as Mahout’s function. Let us understand
them individually:
1. Collaborative filtering: Mahout mines user behaviors, their patterns and
their characteristics and based on that it predicts and make recommendations
to the users. The typical use case is E-commerce website.
2. Clustering: It organizes a similar group of data together like articles can
contain blogs, news, research papers etc.
3. Classification: It means classifying and categorizing data into various
subdepartments like articles can be categorized into blogs, news, essay,
research papers and other categories.
4. Frequent item set missing: Here Mahout checks, which objects are likely
to be appearing together and make suggestions, if they are missing. For
example, cell phone and cover are brought together in general. So, if you
search for a cell phone, it will also recommend you the cover and cases.

R Connectors
Oracle R Connector for Hadoop is a collection of R packages that provide:
 Interfaces to work with Hive tables, the Apache Hadoop compute
infrastructure, the local R environment, and Oracle database tables
 Predictive analytic techniques, written in R or Java as Hadoop MapReduce
jobs, that can be applied to data in HDFS files

You install and load this package as you would any other R package. Using
simple R functions, you can perform tasks like these:
 Access and transform HDFS data using a Hive-enabled transparency layer
 Use the R language for writing mappers and reducers
 Copy data between R memory, the local file system, HDFS, Hive, and
Oracle databases
 Schedule R programs to execute as Hadoop MapReduce jobs and return the
results to any of those locations

Several analytic algorithms are available in Oracle R Connector for Hadoop:


linear regression, neural networks for prediction, matrix completion using low rank
matrix factorization, clustering, and non-negative matrix factorization. They are
written in either Java or R.
To use Oracle R Connector for Hadoop, you should be familiar with
MapReduce programming, R programming, and statistical methods.

Oracle R Connector for Hadoop APIs


Oracle R Connector for Hadoop provides access from a local R client to
Apache Hadoop using functions with these prefixes:
 hadoop: Identifies functions that provide an interface to Hadoop MapReduce
 hdfs: Identifies functions that provide an interface to HDFS
 orch: Identifies a variety of functions; orch is a general prefix for ORCH
functions
 ore: Identifies functions that provide an interface to a Hive data store
Oracle R Connector for Hadoop uses data frames as the primary object type, but
it can also operate on vectors and matrices to exchange data with HDFS. The APIs
support the numeric, integer, and character data types in R.
HIVE
 Hive is a distributed data management for Hadoop
 It supports sql query option HiveSQL(HSQL) to access big data
 Basically perform the, the storage system and the, the analysis in a much
easier manner
 It runs on top of Hadoop
 Facebook created HIVE for people who are fluent with SQL. Thus, HIVE
makes them feel at home while working in a Hadoop Ecosystem.
 Basically, HIVE is a data warehousing component which performs reading,
writing and managing large data sets in a distributed environment using
SQL-like interface.
o HIVE + SQL = HQL
 The query language of Hive is called Hive Query Language(HQL), which is
very similar like SQL.
 It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
 The Hive Command line interface is used to execute HQL commands.
 While, Java Database Connectivity (JDBC) and Object Database
Connectivity (ODBC) is used to establish connection from data storage.
 Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e.
Interactive query processing).
 It supports all primitive data types of SQL.
 You can use predefined functions, or write tailored user defined functions
(UDF) also to accomplish your specific needs.
HBASE
 HBase is an open source, non-relational distributed database developed by
Apache Software Foundation. In other words, it is a NoSQL database.
 It supports all types of data and that is why, it’s capable of handling anything
and everything inside a Hadoop ecosystem.
 Initially, it was Google Big Table afterwards it was renamed as HBase and it
can hold extremely large data, data set for storage and writable purposes so,
it is now based on the dynamic data model and it is not a relational DBMS
 The HBase was designed to run on top of HDFS and provides BigTable like
capabilities.
 HBase can store massive amounts of data from terabytes to petabytes
 HBASE is a key component of Hadoop stack and its design cater to
application that require really fast random access to the significant data set.
 HBase is nothing but a column oriented, distributed, database management
system, which is based on key value store
 It gives us a fault tolerant way of storing sparse data, which is common in
most Big Data use cases.
 The HBase is written in Java, whereas HBase applications can be written in
REST, Avro and Thrift APIs.

For better understanding, let us take an example. You have billions of customer
emails and you need to find out the number of customers who has used the word
complaint in their emails. The request needs to be processed quickly (i.e. at real
time). So, here we are handling a large data set while retrieving a small amount of
data. For solving these kind of problems,
HBase was designed.
YARN
 Consider YARN as the brain of your Hadoop Ecosystem.
 It performs all your processing activities by allocating resources and
scheduling tasks.
 It has two major components, i.e. Resource Manager and Node Manager.
1. Resource Manager is again a main node in the processing department. It
receives the processing requests, and then passes the parts of requests to
corresponding Node Managers accordingly, where the actual processing
takes place.
2. Node Managers are installed on every Data Node. It is responsible for
execution of task on every single Data Node
 Schedulers: Based on your application resource requirements, Schedulers
perform scheduling algorithms and allocates the resources.
 Applications Manager: While Applications Manager accepts the job
submission, negotiates to containers (i.e. the Data node environment where
process executes) for executing the application specific Application Master
and monitoring the progress. ApplicationMasters are the deamons which
reside on DataNode and communicates to containers for execution of tasks
on each DataNode.

 ResourceManager has two components: Schedulers and application

manager

MAPREDUCE
 MapReduce is a programming model and an associated implementation for
processing and generating large data sets using distributed and parallel
algorithms inside Hadoop environment.
 Users specify a map function that processes a key/value pair to generate a set
of intermediate key/value pairs, and a reduce function that merges all
intermediate values associated with the same intermediate key

HDFS
 Hadoop Distributed File System is the core component or you can say, the
backbone of Hadoop Ecosystem.
 HDFS is the one, which makes it possible to store different types of large
data sets (i.e. structured, unstructured and semi structured data).
 HDFS creates a level of abstraction over the resources, from where we can
see the whole HDFS as a single unit.
 It helps us in storing our data across various nodes and maintaining the log
file about the stored data (metadata).
 HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
2. All your data is stored on the DataNodes and hence it requires more
storage resources. These DataNodes are commodity hardware (like your
laptops and desktops) in the distributed environment. That’s the reason,
why Hadoop solutions are very cost effective.
 You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.

Ambari
 Ambari is an Apache Software Foundation Project which aims at making
Hadoop ecosystemmore manageable.
 It includes software for provisioning, managing and monitoring Apache
Hadoop clusters.
 The Ambari provides:
1. Hadoop cluster provisioning:
 It gives us step by step process for installing Hadoop services
across a number of hosts.
 It also handles configuration of Hadoop services over a cluster.
2. Hadoop cluster management:
 It provides a central management service for starting, stopping and
reconfiguring Hadoop services across the cluster.
3. Hadoop cluster monitoring:
 For monitoring health and status, Ambari provides us a dashboard.
 The Amber Alert framework is an alerting service which notifies
the user, whenever the attention is needed. For example, if a node
goes down or low disk space on a node, etc.

Hadoop Ecosystem owes its success to the whole developer community,


many big companies like Facebook, Google, Yahoo, University of California
(Berkeley) etc. have contributed their part to increase Hadoop’s capabilities.
Inside a Hadoop Ecosystem, knowledge about one or two tools (Hadoop
components) would not help in building a solution. You need to learn a set of
Hadoop components, which works together to build a solution.
Based on the use cases, we can choose a set of services from Hadoop
Ecosystem and create a tailored solution for an organization.

Hadoop Ecosystem Elements at various stages of Data


Processing
Figure: Hadoop Ecosystem Elements at various stages of Data Processing
Apache Chukwa
 It is an open-source project under the Apache Hadoop umbrella that aims at
collecting data from large distributed systems and providing tools for data
analysis.
 Chukwa is designed around a flexible and distributed architecture that
allows for easy scalability and robust fault tolerance.
 The primary function of Apache Chukwa lies in system log collection and
analysis, aiding in understanding system behavior, monitoring, and
troubleshooting.
History
Apache Chukwa was initially developed as a sub-project of Hadoop in 2008.
Its creators designed it to monitor large distributed systems, like Hadoop itself. It
graduated to a top-level project in 2015 and has seen several minor and major
updates since then.

Functionality and Features


Apache Chukwa includes a flexible and powerful toolkit for displaying
monitoring and analysis results. Some of its key features include:
 Adaptive clustering: Chukwa can be configured to dynamically resize its
clusters based on the volume of data.
 Flexibility: It can collect data from many different types of systems,
including Hadoop and other distributed systems.
 Large data handling: Use of Hadoop HDFS and MapReduce features for
storing and processing data, making it suitable for very large datasets.

Challenges and Limitations


While Apache Chukwa is powerful, it comes with its share of challenges and
limitations. It is best used in environments where large scale data collection and
analysis is the norm, and might be overkill for smaller scale data needs. Its learning
curve can be steep, especially to those unfamiliar with Hadoop ecosystem.

Security Aspects
Apache Chukwa, given its close integration with Apache Hadoop, adheres to
the same security measures as its parent project. This includes Hadoop’s in-built
security features such as Kerberos for authentication and HDFS for encryption.

Performance
Apache Chukwa's performance is tightly linked with the underlying Hadoop
infrastructure, gaining advantage from Hadoop's robust scalability and fault
tolerance. However, the performance can be conditioned by the hardware resources
of the deployment and the overall load of data processing.
Avro
 Avro is an open source project that provides data serialization and data
exchange services for Apache Hadoop. These services can be used together
or independently.
 Avro facilitates the exchange of big data between programs written in any
language. With the serialization service, programs can efficiently serialize
data into files or into messages.
 The data storage is compact and efficient.
 Avro stores both the data definition and the data together in one message or
file.
 Avro stores the data definition in JSON format making it easy to read and
interpret; the data itself is stored in binary format making it compact and
efficient.
 Avro files include markers that can be used to split large data sets into
subsets suitable for Apache MapReduce processing. Some data exchange
services use a code generator to interpret the data definition and produce
code to access the data. Avro doesn't require this step, making it ideal for
scripting languages.
 A key feature of Avro is robust support for data schemas that change over
time — often called schema evolution.
 Avro handles schema changes like missing fields, added fields and changed
fields; as a result, old programs can read new data and new programs can
read old data.
 Avro includes APIs for Java, Python, Ruby, C, C++ and more. Data stored
using Avro can be passed from programs written in different languages,
even from a compiled language like C to a scripting language like Apache
Pig.

Hadoop Ecosystem for Big Data Computation


Giraph:
Giraph is a graph processing tool, which is, being used by the Facebook, to
analyse the social network's graph that was made simplified, when it was made out
of Map Reduce.
Giraph, Storm, Spark, Flink, do not use, Map Reduce directly, they run over
Yarn and HDFS.

Storm, Spark and Flink


Storm, Spark and Flink are Fast Streaming data /Stream Processing/ Real
time applications which uses in memory computation.
So, Stream processing, or a real time or the real time, Streaming applications
are done using, Star, Spark and Flink or Yarn and HDFS.

NoSQL
Most of these big data is stored in the form of a key value pair and they are
also, known as, No Sequel Data Store.
This No Sequel Data Store can be supported by, the data base like,
Cassandra, MongoDB and HBase.
Traditional SQL, can be effectively used to handle the large amount of,
structured data. But here in the big data, most of the information is, unstructured
form of the data, so basically, NoSQL is required to handle that information,
NoSQL database is, stored unstructured data also, however, it is not,
enforced to follow a particular, fixed schema structure and schema keeps on,
changing, dynamically. So, each row can have its own set of column values.
NoSQL gives a better performance, in storing the massive amount of data
compared to the SQL, structure.
NoSQL database is primarily a key value store. It is also called a, 'Column
Family' because Column wise, the data is stored, in the form of a key value, pairs.

Cassandra:
Another data base, which supports, data model like NoSQL data model, is
called, 'Cassandra'.
Apache Cassandra is highly scalable, distributed and high-performance,
NoSQL database.
Cassandra is designed, to handle the huge amount of information and the
Cassandra handles this huge data, with its distributed architecture

Spark
Apache Spark is an open-source cluster computing framework.
Its primary purpose is to handle the real-time generated data.
Spark was built on the top of the Hadoop MapReduce. It was optimized to
run in memory whereas alternative approaches like Hadoop's MapReduce writes
data to and from computer hard drives. So, Spark process the data much quicker
than other alternatives.
Spark is a scalable data analytics platform
Supports the in-memory computation,
The Spark was initiated by Matei Zaharia at UC Berkeley's AMPLab in
2009. It was open sourced in 2010 under a BSD license.
In 2013, the project was acquired by Apache Software Foundation. In 2014,
the Spark emerged as a Top-Level Apache Project.

Features of Apache Spark


 Fast: It provides high performance for both batch and streaming data, using
a state-of-the-art DAG scheduler, a query optimizer, and a physical
execution engine.
 Easy to Use: It facilitates to write the application in Java, Scala, Python, R,
and SQL. It also provides more than 80 high-level operators.
 Generality: It provides a collection of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
 Lightweight: It is a light unified analytics engine which is used for large
scale data processing.
 Runs Everywhere: It can easily run on Hadoop, Apache Mesos,
Kubernetes, standalone, or in the cloud.

Usage of Spark
 Data integration: The data generated by systems are not consistent enough
to combine for analysis. To fetch consistent data from systems we can use
processes like Extract, transform, and load (ETL). Spark is used to reduce
the cost and time required for this ETL process.
 Stream processing: It is always difficult to handle the real-time generated
data such as log files. Spark is capable enough to operate streams of data and
refuses potentially fraudulent operations.
 Machine learning: Machine learning approaches become more feasible and
increasingly accurate due to enhancement in the volume of data. As spark is
capable of storing data in memory and can run repeated queries quickly, it
makes it easy to work on machine learning algorithms.
 Interactive analytics: Spark is able to generate the respond rapidly. So,
instead of running pre-defined queries, we can handle the data interactively.

Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with
storage systems and memory management.
Spark SQL
o It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as
the Apache Hive variant of SQL called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between
Java objects and existing databases, data warehouses and business
intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming
analytics.
o It accepts data in mini-batches and performs RDD transformations on that
data.
o Its design ensures that the applications written for streaming data can be
reused to analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real-time
example of a data stream.
MLlib
o The MLlib is a Machine Learning library that contains various machine
learning algorithms.
o These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by Apache
Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
o It facilitates to create a directed graph with arbitrary properties attached to
each vertex and edge.
o To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.

Kafka
Apache Kafka is an open source, distributed stream processing, software
framework. So, through Kafka data streams can be, submitted to the Apache Spark,
for doing the computations. So this will form a pipeline.

Cloudera Distributed Hadoop


Apache Impala
 Impala is an open-source and native analytics database for Hadoop.
 It is a Massive Parallel Processing (MPP) SQL query engine that processes
massive amounts of data stored in a Hadoop cluster.
 Impala provides high performance and low latency compared to other SQL
engines for Apache Hadoop, such as Hive.
 Impala is written in Java & C++.
 We can query data stored in either HDFS or Apache HBase with Apache
Impala.
 We can perform real-time operations like SELECT, JOIN, and aggregation
functions with Impala.
 Apache Impala uses Impala Query Language which is a subset of High
Query Langauge with some functional limitations such as transformations
and its syntax is same as Hive Query Language (SQL) syntax, metadata,
user interface, and ODBC drivers as Apache Hive, providing a familiar and
unified platform for batch-oriented or real-time queries. This allows Hive
users to use Apache Impala with little setup overhead. However, Impala
does not support all SQL queries; some syntax changes may occur.

Apache Impala Architecture

Apache Impala runs several systems in an Apache Hadoop cluster.


Unlike traditional storage systems, Apache impala is not tied to its storage
core.
It is separate from its storage engine.
Impala has three core components:
1. Impala Demon(Impala): The Impala daemon is a core component of
Apache Impala. The impala process physically represents it. The Impala daemon
runs on every computer where Impala is installed.
The main functions of the Impala daemon are:
 Reads and writes to data files.
 Accepts queries passed from impala-shell, JDBC, Hue, or ODBC.
 Impala Daemon parallelizes queries and distributes work across the Hadoop
cluster.
 Transmits ongoing query results back to the central coordinator.
 Impala daemons constantly communicate with the StateStore to confirm
which daemons are healthy and ready to accept new work.
 Impala daemons also receive broadcast messages from the cataloged daemon
at any time
 Any Impala daemons will create, drop, or modify any type of object when
Impala processes an INSERT or LOAD DATA statement.

For Implementing impala, we can use one of these methods:


 Locate HDFS and Impala together, and each Impala daemon should be
running on the same host as the DataNode.
 Deploy Impala alone in a compute cluster that can remotely read data from
HDFS, S3, ADLS, etc.

2. Impstatestoretore: The Impstatestoretore is the one that checks the health of all
the Impala daemons in the cluster and continuously communicates its findings to
each of the Impala daemons.
The Impstatestoretore is not always critical to the normal operation of an
Impala cluster. If the StateStore is not running this case, the Impala daemons will
also be running and distributing work among themselves as usual.
3. Impala Catalog Service: The catalog service is another Impala component that
propagates metadata changes from Impala SQL commands to all Impala daemons
in the cluster.

Apache Impala Features


The key features of the Impala are –
 Provides support for in-memory data processing; it can access or analyze
data stored on Hadoop DataNodes without any data movement.
 Using Impala, we can access data using SQL-like queries.
 Apache Impala provides faster access to data stored in the Hadoop
Distributed File System compared to other SQL engines such as Hive.
 Impala helps us to store data in storage systems like Hadoop HBase, HDFS,
and Amazon s3.
 We can easily integrate Impala with business intelligence tools such as
Tableau, Micro strategy, P, Pentaho, and Zoom data.
 Provides support for various file formats such as LZO, Avro, RCFile,
Sequence File, and Parquet.
 Apache Impala uses the ODBC driver, user interface metadata, and SQL
syntax as Apache Hive.

Moving Data In and Out of Hadoop


Moving Data into Hadoop
1. HDFS Command Line Interface (CLI)
 You can use the hdfs dfs commands to move files into HDFS (Hadoop
Distributed File System).
 Example: hdfs dfs -put localfile /user/hadoop/
2. Apache Sqoop
 Useful for importing data from relational databases (like MySQL, Oracle,
etc.) into HDFS.
 Example: sqoop import --connect jdbc:mysql://localhost/db --table mytable -
-target-dir /user/hadoop/mytable
3. Apache Flume
 Designed for streaming data ingestion into Hadoop.
 You configure sources (e., logs from servers), channels (e., memory or files),
and sinks (e., HDFS) to handle data transfer.
 Example configuration might look like:
agent = src
agent = sink
agent = ch
agent.sources.src1 = spooldir
agent.sources.src1 = /path/to/input
agent.sources.src1 = ch
agent.sinks.sink1 = hdfs
agent.sinks.sink1.hdfs = /user/hadoop/logs
agent.sinks.sink1 = ch
4. Apache Kafka
 For real-time data streaming into Hadoop.
 Data can be ingested into Hadoop through Kafka consumers that read from
Kafka topics and write to HDFS.
5. Hadoop Streaming
 Allows you to use any executable or script as a mapper or reducer for
processing data.
 Example command: hadoop jar /path/to/hadoop-streaming -input /input -
output/output -mapper /path/to/mapper -reducer /path/to/reducer
6. Hadoop MapReduce
 Used for processing data in HDFS, but also supports data loading through
custom input formats and combiners.

Moving Data out of Hadoop


1. HDFS CLI
 You can use hdfs dfs -get to move files from HDFS to a local file system.
 Example: hdfs dfs -get /user/hadoop/datafile /localpath/
2. Apache Sqoop
 For exporting data from HDFS back to relational databases.
 Example: sqoop export --connect jdbc:mysql://localhost/db --table mytable -
-export-dir /user/hadoop/mytable
3. Apache Flume
 Can be configured to move data from HDFS to another system or storage.
 Example configuration might involve setting up a Flume sink to write to
another destination.
4. Apache Kafka
 Similar to ingesting data, Kafka can be used to export data by consuming
from HDFS and producing to other systems.
5. Custom MapReduce Jobs
 You can write custom MapReduce jobs to process data in HDFS and write
results to external systems.
6. Hadoop DistCp
 A tool for copying large amounts of data between HDFS clusters or between
HDFS and other storage systems.
 Example: hadoop distcp hdfs://source-cluster/user/hadoop/data
hdfs://destination-cluster/user/hadoop/data
7. Hive/Impala
 If using Hive or Impala, you can query data and export the results to external
systems using INSERT INTO ... SELECT statements or by using Hives data
export capabilities.
8. Spark
 Apache Spark can also be used to process and move data between HDFS and
other storage systems or databases.

Best Practices
 Data Formatting: Ensure data is properly formatted and compatible with the
destination systems.
 Data Compression: Use compression (e., Snappy, Gzip) to optimize data
transfer and storage.
 Data Security: Implement appropriate security measures, such as encryption
and access controls, to protect data during transfer.
 Error Handling: Set up monitoring and error handling mechanisms to handle
any issues that arise during data transfer.

Moving Data In and Out of Hadoop


Some simple techniques for data movement are using the command line and
Java and more advanced techniques like using NFS and DistCp.
Ingress and egress refer to data movement into and out of a system,
respectively.

Key elements of data movement


1. Idempotence
An idempotent operation produces the same result no matter how many
times it’s executed.
In a relational database, the inserts typically aren’t idempotent, because
executing them multiple times doesn’t produce the same resulting database state.
Alternatively, updates often are idempotent, because they’ll produce the
same end result.
Any time data is being written, idempotence should be a consideration, and
data ingress and egress in Hadoop are no different.

2. Aggregation: The data aggregation process combines multiple data elements.


3. Data format transformation: The data format transformation process converts
one data format into another.
4. Compression: Compression not only helps by reducing the footprint of data at
rest, but also has I/O advantages when reading and writing data.

5. Availability and recoverability: Recoverability allows an ingress or egress tool


to retry in the event of a failed operation

6. Reliable data transfer and data validation: In the context of data


transportation, checking for correctness is how you verify that no data
corruption occurred as the data was in transit. A common method for checking
the correctness of raw data, such as storage devices, is Cyclic Redundancy
Checks (CRCs), which are what HDFS uses internally to maintain block-level
integrity.

7. Resource consumption and performance: Resource consumption and


performance are measures of system resource utilization and system efficiency,
respectively.
8. Monitoring: Monitoring ensures that functions are performing as expected in
automated systems.
9. Speculative execution: MapReduce has a feature called speculative execution
that launches duplicate tasks near the end of a job for tasks that are still
executing. This helps prevent slow hardware from impacting job execution
times.

Moving data into Hadoop


Picking the right ingest tool for the job
The low-level tools in this section work well for one-off file movement
activities, or when working with legacy data sources and destinations that are file-
based. But moving data in this way is quickly becoming obsolete by the
availability of tools such as Flume and Kafka which offer automated data
movement pipelines.
Kafka is a much better platform for getting data from A to B (and B can be a
Hadoop cluster) than the old-school “let’s copy files around!” With Kafka, you
only need to pump your data into Kafka, and you have the ability to consume the
data in real time (such as via Storm) or in offline/batch jobs (such as via Camus).

Steps to move data into hadoop


Technique: Using the CLI to load files
Problem: You want to copy files into HDFS using the shell.
Solution: The HDFS command-line interface can be used for one-off moves, or
it can be incorporated into scripts for a series of moves.
Discussion: Copying a file from local disk to HDFS is done with
the hadoop command:
$ hadoop fs -put local-file.txt hdfs-file.txt
if the destination already exists, it is overwritten; in Hadoop the copy fails
with an error:
put: `hdfs-file.txt': File exists
The -f option must be added to force the file to be overwritten:
$ hadoop fs -put -f local-file.txt hdfs-file.txt
Much like with the Linux cp command, multiple files can be copied using
the same command. In this case, the final argument must be the directory in HDFS
into which the local files are copied:
$ hadoop fs -put local-file1.txt local-file2.txt /hdfs/dest/
To test for the existence of a file or directory, use the -test command with
either the -e or -d option to test for file or directory existence, respectively. The
exit code of the command is 0 if the file or directory exists, and 1 if it doesn’t:
$ hadoop fs -test -e hdfs-file.txt
$ echo $?
1
$ hadoop fs -touchz hdfs-file.txt
$ hadoop fs -test -e hdfs-file.txt
$ echo $?
0
$ hadoop fs -test -d hdfs-file.txt
$ echo $?
1
If all you want to do is “touch” a file in HDFS (create a new empty file),
the touchz option is what you’re looking for:
$ hadoop fs -touchz hdfs-file.txt
There are many more operations supported by the fs command—to see the
full list, run the command without any options:
$ hadoop fs
The CLI is designed for interactive HDFS activities, and it can also be
incorporated into scripts for some tasks you wish to automate.
The disadvantage of the CLI is that it’s low-level and doesn’t have any
automation mechanisms built in.
The next technique is more suited to working with HDFS in programming
languages such as Python.

Technique: Using REST to load files


The CLI is handy for quickly running commands and for scripting.
However, it incurs the overhead of forking a separate process for each
command, which is overhead that you’ll probably want to avoid, especially if
you’re interfacing with HDFS in a programming language.
This technique covers working with HDFS in languages other than Java.
Problem: You want to be able to interact with HDFS from a programming
language that doesn’t have a native interface to HDFS.
Solution: Use Hadoop’s WebHDFS interface, which offers a full-featured REST
API for HDFS operations.

Technique: Accessing HDFS from behind a firewall


Problem: You want to write to HDFS, but there’s a firewall restricting access to
the NameNode and/or the DataNodes.
Solution: Use the HttpFS gateway, which is a standalone server that provides
access to HDFS over HTTP. Because it’s a separate service and it’s HTTP, it can
be configured to run on any host that has access to the Hadoop nodes, and you can
open a firewall rule to allow traffic to the service.

Differences between WebHDFS and HttpFS


The primary difference between WebHDFS and HttpFS is the accessibility
of the client to all the data nodes.
If your client has access to all the data nodes, then WebHDFS will work for
you, as reading and writing files involves the client talking directly to the data
nodes for data transfer.
On the other hand, if you’re behind a firewall, your client probably doesn’t
have access to all the data nodes, in which case the HttpFS option will work best
for you. With HttpFS, the server will talk to the data nodes, and your client just
needs to talk to the single HttpFS server.
If you have a choice, pick WebHDFS because there’s an inherent advantage
in clients talking directly to the data nodes—it allows you to easily scale the
number of concurrent clients across multiple hosts without hitting the network
bottleneck of all the data being streamed via the HttpFS server. This is especially
true if your clients are running on the data nodes themselves, as you’ll be using the
data locality benefits of WebHDFS by directly streaming any locally hosted HDFS
data blocks from the local filesystem instead of over the network.
Technique: Mounting Hadoop with NFS
Problem: You want to treat HDFS as a regular Linux filesystem and use standard
Linux tools to interact with HDFS.
Solution: Use Hadoop’s NFS implementation to access data in HDFS.
Discussion: Prior to Hadoop 2.1, the only way to NFS-mount HDFS was with
FUSE. The new NFS implementation in Hadoop addresses all of the shortcomings
with the old FUSE-based system. It’s a proper NFSv3 implementation, and it
allows you to run one or more NFS gateways for increased availability and
throughput. Figure shows the various Hadoop NFS components in action.

Figure: Hadoop NFS


Technique: Using DistCp to copy data within and between clusters
Problem: You want to efficiently copy large amounts of data between Hadoop
clusters and have the ability for incremental copies.
Solution: Use DistCp, a parallel file-copy tool built into Hadoop.

Technique: Using Java to load files


This technique shows how the Java HDFS API can be used to read and write
data in HDFS.
Problem: You want to incorporate writing to HDFS into your Java application.
Solution: Use the Hadoop Java API to access data in HDFS.
Discussion: The HDFS Java API is nicely integrated with Java’s I/O model, which
means you can work with regular InputStreams and OutputStreams for I/O. To
perform filesystem-level operations such as creating, opening, and removing files,
Hadoop has an abstract class called FileSystem, which is extended and
implemented for specific filesystems that can be leveraged in Hadoop.
There are two main parts to write code that does this: getting a handle to
the FileSystem and creating the file, and then copying the data from standard input
to the OutputStream:

Technique: Pushing system log messages into HDFS with Flume


Continuous movement of log and binary files into HDFS
A bunch of log files are being produced by multiple applications and
systems across multiple servers. There’s no doubt there’s valuable information to
be mined from these logs, but your first challenge is a logistical one of moving
these logs into your Hadoop cluster so that you can perform some analysis.
Problem: You want to push all of your production server’s system log files into
HDFS.
Solution: For this technique you’ll use Flume, a data collection system, to push a
Linux log file into HDFS.
Discussion: Flume, at its heart, is a log file collection and distribution system, and
collecting system logs and transporting them to HDFS is its bread and butter.
Figure: Flume components illustrated within the context of an agent

Sources
Flume sources are responsible for reading data from external clients or from
other Flume sinks. A unit of data in Flume is defined as an event, which is
essentially a payload and optional set of metadata. A Flume source sends these
events to one or more Flume channels, which deal with storage and buffering.
Flume has an extensive set of built-in sources, including HTTP, JMS, and RPC,
and you encountered one of them just a few moments ago.
The exec source allows you to execute a Unix command, and each line
emitted in standard output is captured as an event (standard error is ignored by
default).
To conclude our brief dive into Flume sources, let’s summarize some of the
interesting abilities that they provide:
 Transactional semantics, which allow data to be reliably moved with at-
least-once semantics. Not all data sources support this.

The exec source used in this technique is an example of a source that doesn’t
provide any data-reliability guarantees.
 Interceptors, which provide the ability to modify or drop events. They are
useful for annotating events with host, time, and unique identifiers, which
are useful for deduplication.
 Selectors, which allow events to be fanned out or multiplexed in various
ways. You can fan out events by replicating them to multiple channels, or
you can route them to different channels based on event headers.
Channels
Flume channels provide data storage facilities inside an agent. Sources add
events to a channel, and sinks remove events from a channel. Channels provide
durability properties inside Flume, and you pick a channel based on which level
of durability and throughput you need for your application.
There are three channels bundled with Flume:
 Memory channels store events in an in-memory queue. This is very useful
for high-throughput data flows, but they have no durability guarantees,
meaning that if an agent goes down, you’ll lose data.
 File channels persist events to disk. The implementation uses an efficient
write-ahead log and has strong durability properties.
 JDBC channels store events in a database. This provides the strongest
durability and recoverability properties, but at a cost to performance.
Sinks
A Flume sink drains events out of one or more Flume channels and will
either forward these events to another Flume source (in a multihop flow), or handle
the events in a sink-specific manner. There are a number of sinks built into Flume,
including HDFS, HBase, Solr, and Elasticsearch.
One area that Flume isn’t really optimized for is working with binary data. It
can support moving binary data, but it loads the entire binary event into memory,
so moving files that are gigabytes in size or larger won’t work.

Technique: An automated mechanism to copy files into HDFS


You’ve learned how to use log-collecting tools like Flume to automate
moving data into HDFS. But these tools don’t support working with semistructured
or binary data out of the box. In this technique, we’ll look how to automate moving
such files into HDFS.
You need a mechanism to automate the process of copying files of any
format into HDFS, similar to the Linux tool rsync. The mechanism should be able
to compress files written in HDFS and offer a way to dynamically determine the
HDFS destination for data-partitioning purposes.
Existing file transportation mechanisms such as Flume, Scribe, and Chukwa
are geared toward supporting log files.
Problem: You need to automate the process by which files on remote servers are
copied into HDFS.
Solution: The open source HDFS File Slurper project can copy files of any format
into and out of HDFS.
Discussion: You can use the HDFS File Slurper project (which I wrote) to assist
with your automation (https://github.com/alexholmes/hdfs-file-slurper). The HDFS
File Slurper is a simple utility that supports copying files from a local directory
into HDFS and vice versa.
Figure below provides a high-level overview of the Slurper (my nickname for the
project), with an example of how you can use it to copy files. The Slurper reads
any files that exist in a source directory and optionally consults with a script to
determine the file placement in the destination directory. It then writes the file to
the destination, after which there’s an optional verification step. Finally, the
Slurper moves the source file to a completed folder upon successful completion of
all of the previous steps.

Figure: HDFS File Slurper data flow for copying files


With this technique, there are a few challenges you need to make sure to address:
 How do you effectively partition your writes to HDFS so that you don’t
lump everything into a single directory?
 How do you determine that your data in HDFS is ready for processing (to
avoid reading files that are mid-copy)?
 How do you automate regular execution of your utility?

Technique: Scheduling regular ingress activities with Oozie


If your data is sitting on a filesystem, web server, or any other system
accessible from your Hadoop cluster, you’ll need a way to periodically pull that
data into Hadoop. Tools exist to help with pushing log files and pulling from
databases, but if you need to interface with some other system, it’s likely you’ll
need to handle the data ingress process yourself.
There are two parts to this data ingress process: how you import data from
another system into Hadoop, and how you regularly schedule the data transfer.
Problem: You want to automate a daily task to download content from an HTTP
server into HDFS.
Solution: Oozie can be used to move data into HDFS, and it can also be used to
execute post-ingress activities such as launching a MapReduce job to process the
ingested data. Now an Apache project, Oozie started life inside Yahoo1!. It’s a
Hadoop workflow engine that manages data processing activities. Oozie also has a
coordinator engine that can start workflows based on data and time triggers.
Discussion: In this technique, you’ll perform a download from a number of URLs
every 24 hours, using Oozie to manage the workflow and scheduling. The flow for
this technique is shown in figure. You’ll use Oozie’s triggering capabilities to kick
off a MapReduce job every 24 hours.
Figure: Data flow for this Oozie technique

Databases
Most organizations’ crucial data exists across a number of OLTP databases.
The data stored in these databases contains information about users, products, and
a host of other useful items. If you wanted to analyze this data, the traditional way
to do so would be to periodically copy that data into an OLAP data warehouse.
Hadoop has emerged to play two roles in this space: as a replacement to data
warehouses, and as a bridge between structured and unstructured data and data
warehouses. Figure shows the first role, where Hadoop is used as a large-
scale joining and aggregation mechanism prior to exporting the data to an OLAP
system (a commonly used platform for business intelligence applications).
Figure: Using Hadoop for data ingress, joining, and egress to OLAP

Facebook is an example of an organization that has successfully utilized Hadoop


and Hive as an OLAP platform for working with petabytes of data. Figure below
shows an architecture similar to that of Facebook’s. This architecture also includes
a feedback loop into the OLTP system, which can be used to push discoveries
made in Hadoop, such as recommendations for users.

Figure: Using Hadoop for OLAP and feedback to OLTP systems


In either usage model, you need a way to bring relational data into Hadoop, and
you also need to export it into relational databases.

Technique: Using Sqoop to import data from MySQL


Sqoop is a project that you can use to move relational data into and out of
Hadoop. It’s a great high-level tool as it encapsulates the logic related to the
movement of the relational data into Hadoop—all you need to do is supply Sqoop
the SQL queries that will be used to determine which data is exported.
Problem: You want to load relational data into your cluster and ensure your writes
are efficient and also idempotent.
Solution: In this technique, we’ll look at how you can use Sqoop as a simple
mechanism to bring relational data into Hadoop clusters. We’ll walk through the
process of importing data from MySQL into Sqoop. We’ll also cover bulk imports
using the fast connector (connectors are database-specific components that provide
database read and write access).
Discussion: Sqoop is a relational database import and export system. It was
created by Cloudera and is currently an Apache project in incubation status.
When you perform an import, Sqoop can write to HDFS, Hive, and HBase, and for
exports it can do the reverse. Importing is divided into two activities: connecting to
the data source to gather some statistics, and then firing off a MapReduce job that
performs the actual import. Figure below shows these steps.
Figure: Sqoop import overview: connecting to the data source and using
MapReduce

Sqoop has the notion of connectors, which contain the specialized logic needed to
read and write to external systems. Sqoop comes with two classes of
connectors: common connectors for regular reads and writes, and fast
connectors that use database-proprietary batch mechanisms for efficient
imports. Figure below shows these two classes of connectors and the databases that
they support.
Figure : Sqoop connectors used to read and write to external systems

MySQL table names


MySQL table names in Linux are case-sensitive. Make sure that the table name
you supply in the Sqoop commands uses the correct case.
By default, Sqoop uses the table name as the destination in HDFS for the
MapReduce job that it launches to perform the import.
Import data formats
Sqoop has imported the data as comma-separated text files. It supports a number of
other file formats, which can be activated with the arguments listed in table.
Table: Sqoop arguments that control the file formats of import commands
Argument Description
--as-avrodatafile Data is imported as Avro files.
--as-sequencefile Data is imported as SequenceFiles.
--as-textfile The default file format; data is imported as CSV text files.
Securing passwords
Up until now you’ve been using passwords in the clear on the command line. This
is a security hole, because other users on the host can easily list the running
processes and see your password. Luckily Sqoop has a few mechanisms that you
can use to avoid leaking your password.
Data splitting
A somewhat even distribution of data within the minimum and maximum keys is
assumed by Sqoop as it divides the delta (the range between the minimum and
maximum keys) by the number of mappers. Each mapper is then fed a unique
query containing a range of the primary key.
By default Sqoop runs with four mappers. The number of mappers can be
controlled with the --num-mappers argument.

Figure: Sqoop preprocessing steps to determine query splits

Incremental imports
You can also perform incremental imports. Sqoop supports two
types: append works for numerical data that’s incrementing over time, such as
auto-increment keys; last modified works on timestamped data.

Sqoop jobs and the metastore


How can you best automate a process that can reuse that value? Sqoop has the
notion of a job, which can save this information and reuse it in subsequent
executions:
Fast MySQL imports
What if you want to bypass JDBC altogether and use the fast MySQL Sqoop
connector for a high-throughput load into HDFS? This approach uses
the mysqldump utility shipped with MySQL to perform the load. You must make
sure that mysqldump is in the path of the user running the MapReduce job.

What are the disadvantages of fast connectors?


Fast connectors only work with text output files—specifying Avro or
SequenceFile as the output format of the import won’t work.

Importing to Hive
The final step in this technique is to use Sqoop to import your data into a Hive
table. The only difference between an HDFS import and a Hive import is that the
Hive import has a postprocessing step where the Hive table is created and loaded,
as shown in figure below.

Figure: The Sqoop Hive import sequence of events


When data is loaded into Hive from an HDFS file or directory, as in the case of
Sqoop Hive imports (step 4 in the figure), Hive moves the directory into its
warehouse rather than copying the data (step 5) for the sake of efficiency. The
HDFS directory that the Sqoop MapReduce job writes to won’t exist after the
import.
Continuous Sqoop execution
If you need to regularly schedule imports into HDFS, Oozie has Sqoop integration
that will allow you to periodically perform imports and exports.

HBase
Our final foray into moving data into Hadoop involves taking a look at
HBase. HBase is a real-time, distributed, data storage system that’s often either
colocated on the same hardware that serves as your Hadoop cluster or is in close
proximity to a Hadoop cluster. Being able to work with HBase data directly in
MapReduce, or to push it into HDFS, is one of the huge advantages when picking
HBase as a solution.

Technique: HBase ingress into HDFS


What if you had customer data sitting in HBase that you wanted to use in
MapReduce in conjunction with data in HDFS? You could write a MapReduce job
that takes as input the HDFS dataset and pulls data directly from HBase in your
map or reduce code. But in some cases it may be more useful to take a dump of the
data in HBase directly into HDFS, especially if you plan to utilize that data in
multiple Map-Reduce jobs and the HBase data is immutable or changes
infrequently.
Problem: You want to get HBase data into HDFS.
Solution: HBase includes an Export class that can be used to import HBase data
into HDFS in SequenceFile format. This technique also walks through code that
can be used to read the imported HBase data.
Discussion:
To be able to export data from HBase you first need to load some data into HBase.
GitHub
source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch5/h
base/HBaseWriter.java.

Technique: MapReduce with HBase as a data source


The built-in HBase exporter writes out HBase data using SequenceFile, which isn’t
supported by programming languages other than Java and doesn’t support schema
evolution. It also only supports a Hadoop filesystem as the data sink. If you want to
have more control over HBase data extracts, you may have to look beyond the
built-in HBase facilities.
Problem: You want to operate on HBase directly within your MapReduce jobs
without the intermediary step of copying the data into HDFS.
Solution: HBase has a TableInputFormat class that can be used in your
MapReduce job to pull data directly from HBase.
Discussion: HBase provides an InputFormat class called TableInputFormat, which
can use HBase as a data source in MapReduce.
source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch5/h
base/ImportMapReduce.java.

Importing data from Kafka


Kafka, a distributed publish-subscribe system, is quickly becoming a key part of
our data pipelines thanks to its strong distributed and performance properties. It
can be used for many functions, such as messaging, metrics collection, stream
processing, and log aggregation. Another effective use of Kafka is as a vehicle to
move data into Hadoop. This is useful in situations where you have data being
produced in real time that you want to land in Hadoop.
A key reason to use Kafka is that it decouples data producers and consumers. It
notably allows you to have multiple independent producers (possibly written by
different development teams), and, likewise, multiple independent consumers
(again possibly written by different teams). Also, consumption can be real-
time/synchronous or batch/offline/asynchronous. The latter property is a big
differentiator when you’re looking at other pub-sub tools like RabbitMQ.

Kafka has a handful of concepts that you’ll need to understand:


· Topics —A topic is a feed of related messages.
· Partitions —Each topic is made up of one or more partitions, which are ordered
sequences of messages backed by log files.
I’m not talking about logging files here; Kafka employs log files to store data
flowing through Kafka.
· Producers and consumers —Producers and consumers write messages to and
read them from partitions.
· Brokers —Brokers are the Kafka processes that manage topics and partitions and
serve producer and consumer requests.
Kafka does not guarantee “total” ordering for a topic—instead, it only guarantees
that the individual partitions that make up a topic are ordered. It’s up to the
consumer application to enforce, if needed, a “global” per-topic ordering.
Figure 5.14 shows a conceptual model of how Kafka works and figure 5.15 shows
an example of how partitions could be distributed in an actual Kafka deployment.
Figure 5.14. Conceptual Kafka model showing producers, topics, partitions,
and consumers
Figure 5.15. A physical Kafka model showing how partitions can be
distributed across brokers
To support fault tolerance, topics can be replicated, which means that each
partition can have a configurable number of replicas on different hosts. This
provides increased fault tolerance and means that a single server dying isn’t
catastrophic for your data or for the availability of your producers and consumers.

Technique:Using Camus to copy Avro data from Kafka into HDFS


This technique is useful in situations where you already have data flowing in Kafka
for other purposes and you want to land that data in HDFS.
Problem: You want to use Kafka as a data-delivery mechanism to get your data
into HDFS.
Solution: Use Camus, a LinkedIn-developed solution for copying data in Kafka
into HDFS.
Discussion: Camus is an open-source project developed by LinkedIn. Kafka is
heavily deployed at LinkedIn, and where Camus is used as a tool to copy data from
Kafka into HDFS.
Out of the box, Camus supports two data formats in Kafka: JSON and Avro. In this
technique we’re going to get Camus working with Avro data. Camus’s built-in
support of Avro requires that Kafka publishers write the Avro data in a proprietary
way, so for this technique we’re going to assume that you want to work with
vanilla Avro-serialized data in Kafka.
There are three parts to getting this technique to work: you’ll first write some Avro
data into Kafka, then you’ll write a simple class to help Camus deserialize your
Avro data, and finally you’ll run a Camus job to perform the data import.

Writing data into Kafka


Writing a Camus decoder and schema registry
There are three Camus concepts that you need to understand:
· Decoders —The decoder’s job is to convert raw data pulled from Kafka into a
Camus format.
· Encoders —Encoders serialize decoded data into the format that will be stored in
HDFS.
· Schema registry —The schema registry provides schema information about Avro
data being encoded.
As mentioned earlier, Camus supports Avro data, but it does so in a way that
requires Kafka producers to write data using the Camus
KafkaAvroMessageEncoder class, which prefixes the Avro-serialized binary data
with some proprietary data, presumably so that the decoder in Camus can verify
that it was written by that class.

Figure: A look at how a Camus job executes

As Camus tasks in MapReduce succeed, the Camus OutputCommitter (a


MapReduce construct that allows for custom work to be performed upon task
completion) atomically moves the tasks’ data files to the destination directory. The
OutputCommitter additionally creates the offset files for all the partitions that the
tasks were working on. It’s possible that other tasks in the same job may fail, but
this doesn’t impact the state of tasks that succeed—the data and offset outputs of
successful tasks will still exist, so that subsequent Camus executions will resume
processing from the last-known successful state.
Next, let’s take a look at where Camus writes the imported data and how you can
control the behavior.

Data partitioning
Earlier you saw the location where Camus imported the Avro data sitting in Kafka.
Let’s take a closer look at the HDFS path structure, shown in figure below, and see
what you can do to determine the location.

Figure: Dissecting the Camus output path for exported data in HDFS

The date/time part of the path is determined by the timestamp extracted from the
CamusWrapper. You’ll recall from our earlier discussion that you can extract
timestamps from your records in Kafka in your MessageDecoder and supply them
to the CamusWrapper, which will allow your data to be partitioned by dates that
are meaningful to you, as opposed to the default, which is simply the time at which
the Kafka record is read in MapReduce.
Camus supports a pluggable partitioner, which allows you to control the part of the
path shown in figure below
Figure: The Camus partitioner path

Moving data out of Hadoop


Once you’ve used Hadoop to perform some critical function, be it data
mining or data aggregations, the next step is typically to externalize that data into
other systems in your environment. For example, it’s common to rely on Hadoop
to perform offline aggregations on data that’s pulled from your real-time systems,
and then to feed the derived data back into your real-time systems. A more
concrete example would be building recommendations based on user-behavior
patterns.

Technique: Using the CLI to extract files


Imagine that you’ve run some jobs in Hadoop to aggregate some data, and now
you want to get it out. One method you can use is the HDFS command-line
interface (CLI) to pull out directories and files into your local filesystem. This
technique covers some basic CLI commands that can help you out.
Problem
You want to copy files from HDFS to a local filesystem using the shell.
Solution
The HDFS CLI can be used for one-off moves, or the same commands can be
incorporated into scripts for more regularly utilized moves.
Discussion
Copying a file from HDFS to local disk is achieved via the hadoop command:
$ hadoop fs -get hdfs-file.txt local-file.txt
The behavior of the Hadoop put command differs from the Linux cp command—in
Linux if the destination already exists, it’s overwritten; in Hadoop the copy fails
with an error:
put: `hdfs-file.txt': File exists
The -f option must be added to force the file to be overwritten:
$ hadoop fs -get -f hdfs-file.txt local-file.txt
Much like with the Linux cp command, multiple files can be copied using the same
command. In this case, the final argument must be the directory in the local
filesystem into which the HDFS files are copied:
$ hadoop fs -get hdfs-file1.txt hdfs-file2.txt /local/dest/
Often, one is copying a large number of files from HDFS to local disk—an
example is a MapReduce job output directory that contains a file for each task. If
you’re using a file format that can be concatenated, you can use the -getmerge
command to combine multiple files. By default, a newline is added at the end of
each file during concatenation:
$ hdfs fs -getmerge hdfs-dir/part* /local/output.txt
There are many more operations supported by the fs command—to see the full list,
run the command without any options.
The challenge with using the CLI is that it’s very low-level, and it won’t be able to
assist you with your automation needs. Sure, you could use the CLI within shell
scripts, but once you graduate to more sophisticated programming languages,
forking a process for every HDFS command isn’t ideal. In this situation you may
want to look at using the REST, Java, or C HDFS APIs. The next technique looks
at the REST API.
Technique: Using REST to extract files
Using the CLI is handy for quickly running commands and for scripting, but it
incurs the overhead of forking a separate process for each command, which is
overhead that you’ll probably want to avoid, especially if you’re interfacing with
HDFS in a programming language. This technique covers working with HDFS in
languages other than Java.
Problem
You want to be able to interact with HDFS from a programming language that
doesn’t have a native interface to HDFS.
Solution
Use Hadoop’s WebHDFS interface, which offers a full-featured REST API for
HDFS operations.

Technique: Reading from HDFS when behind a firewall


Production Hadoop environments are often locked down to protect the data
residing in these clusters. Part of the security procedures could include putting your
cluster behind a firewall, and this can be a nuisance if the destination for your
Hadoop cluster is outside of the firewall. This technique looks at using the HttpFS
gateway to provide HDFS access over port 80, which is often opened up on
firewalls.
Problem
You want to pull data out of HDFS, but you’re sitting behind a firewall that’s
restricting access to HDFS.
Solution
Use the HttpFS gateway, which is a standalone server that provides access to
HDFS over HTTP. Because it’s a separate service and it’s HTTP, it can be
configured to run on any host that has access to the Hadoop nodes, and you can
open a firewall rule to allow traffic to the service.

Technique: Mounting Hadoop with NFS


Often it’s a lot easier to work with Hadoop data if it’s accessible as a regular
mount to your file system. This allows you to use existing scripts, tools, and
programming languages and easily interact with your data in HDFS. This section
looks at how you can easily copy data out of HDFS using an NFS mount.
Problem
You want to treat HDFS as a regular Linux filesystem and use standard Linux tools
to interact with HDFS.
Solution
Use Hadoop’s NFS implementation to access data in HDFS.

Technique: Using DistCp to copy data out of Hadoop


Imagine that you have a large amount of data you want to move out of
Hadoop. With most of the techniques in this section, you have a bottleneck because
you’re funneling the data through a single host, which is the host on which you’re
running the process. To optimize data movement as much as possible, you want to
leverage MapReduce to copy data in parallel. This is where DistCp comes into
play, and this technique examines one way you can pull out data to an NFS mount.
Problem
You want to efficiently pull data out of Hadoop and parallelize the copy.
Solution
Use DistCp.
Technique: Using Java to extract files
Let’s say you’ve generated a number of Lucene indexes in HDFS, and you want to
pull them out to an external host. Maybe you want to manipulate the files in some
way using Java. This technique shows how the Java HDFS API can be used to read
data in HDFS.
Problem
You want to copy files in HDFS to the local filesystem.
Solution
Use Hadoop’s filesystem API to copy data out of HDFS.
Discussion
The HDFS Java API is nicely integrated with Java’s I/O model, which
means you can work with regular input streams and output streams for I/O.
To start off, you need to create a file in HDFS using the command line:
$ echo "hello world" | hadoop fs -put - hdfs-file.txt
Now copy that file to the local filesystem using the command line:
$ hadoop fs -get hdfs-file.txt local-file.txt
Let’s explore how you can replicate this copy in Java. There are two main parts to
writing the code to do this—the first part is getting a handle to the FileSystem and
creating the file, and the second part is copying the data from standard input to the
OutputStream:
You can see how this code works in practice by running the following command:
$ echo "the cat" | hadoop fs -put - hdfs-file.txt
$ hip hip.ch5.CopyHdfsFileToLocal \
--input hdfs-file.txt \
--output local-file.txt
$ cat local-file.txt
the cat

Automated file egress


Up until now you’ve seen different options for copying data out of HDFS.
Most of these mechanisms don’t have automation or scheduling capabilities;
they’re ultimately low-level methods for accessing data.
If you’re looking to automate your data copy, you can wrap one of these
low-level techniques inside of a scheduling engine such as cron or Quartz.
However, if you’re looking for out-of-the-box automation, then this section is for
you.
Earlier in this chapter we looked at two mechanisms that can move
semistructured and binary data into HDFS: the open source HDFS File Slurper
project, and Oozie, which triggers a data ingress workflow.
The challenge in using a local filesystem for egress (and ingress for that
matter) is that map and reduce tasks running on clusters won’t have access to the
filesystem on a specific server. You have three broad options for moving data from
HDFS to a filesystem:
 You can host a proxy tier on a server, such as a web server, which you
would then write to using MapReduce.
 You can write to the local filesystem in MapReduce and then, as a
postprocessing step, trigger a script on the remote server to move that data
 You can run a process on the remote server to pull data from HDFS directly.
The third option is the preferred approach because it’s the simplest and most
efficient, and as such it’s the focus of this section. We’ll look at how you can use
the HDFS File Slurper to automatically move files from HDFS out to a local
filesystem.

Technique: An automated mechanism to export files from HDFS


Let’s say you have files being written in HDFS by MapReduce, and you
want to automate their extraction to a local filesystem. This kind of feature isn’t
supported by any Hadoop tools, so you have to look elsewhere.
Problem: You want to automate moving files from HDFS to a local filesystem.
Solution: The HDFS File Slurper can be used to copy files from HDFS to a local
filesystem.

Databases
Databases are usually the target of Hadoop data egress in one of two
circumstances: either when you move data back into production databases to be
used by production systems, or when you move data into OLAP databases to
perform business intelligence and analytics functions.
In this section we’ll use Apache Sqoop to export data from Hadoop to a MySQL
database. Sqoop is a tool that simplifies database imports and exports. Sqoop is
covered in detail in technique above.
We’ll walk through the process of exporting data from HDFS to Sqoop.
We’ll also cover methods for using the regular connector, as well as how to
perform bulk imports using the fast connector.

Technique: Using Sqoop to export data to MySQL


Hadoop excels at performing operations at scales that defeat most relational
databases, so it’s common to extract OLTP data into HDFS, perform some
analysis, and then export it back out to a database.
Problem
You want to write data to relational databases, and at the same time ensure
that writes are idempotent.
Solution
This technique covers how Sqoop can be used to export text files to a
relational database and also looks at how Sqoop can be configured to work with
files with custom field and record delimiters. We’ll also cover idempotent exports
to make sure that failed exports don’t leave your database in an inconsistent state.
Idempotent exports
The Sqoop map tasks that perform the exports use multiple transactions for
their database writes. If a Sqoop export MapReduce job fails, your table could
contain partial writes. For idempotent database writes, Sqoop can be instructed to
perform the MapReduce writes to the staging table. After successful job
completion, the staging table is moved to the target table in a single transaction,
which is idempotent. You can see the sequence of events in figure 5.19.
Figure: Sqoop staging sequence of events, which helps ensure idempotent
writes

Direct exports
You used the fast connector in the import technique, which was an optimization
that used the mysqldump utility. Sqoop exports also support using the fast
connector, which uses the mysqlimport tool. As with mysqldump, all of the nodes
in your cluster need to have mysqlimport installed and available in the path of the
user that’s used to run MapReduce tasks. And as with the import, the

Idempotent exports with mysqlimport


Sqoop doesn’t support using fast connectors in conjunction with a staging table,
which is how you achieve idempotent writes with regular connectors. But it’s still
possible to achieve idempotent writes with fast connectors with a little extra work
at your end. You need to use the fast connector to write to a staging table, and then
trigger the INSERT statement, which atomically copies the data into the target
table.

NoSQL
MapReduce is a powerful and efficient way to bulk-load data into external
systems. So far we’ve covered how Sqoop can be used to load relational data, and
now we’ll look at NoSQL systems, and specifically HBase.
Apache HBase is a distributed key/value, column-oriented data store. Earlier
in this chapter we looked at how to import data from HBase into HDFS, as well as
how to use HBase as a data source for a MapReduce job.
The most efficient way to load data into HBase is via its built-in bulk-
loading mechanism, which is described in detail on the HBase wiki page titled
“Bulk Loading” at https://hbase.apache.org/book/arch.bulk.load.html. But this
approach bypasses the write-ahead log (WAL), which means that the data being
loaded isn’t replicated to slave HBase nodes.
HBase also comes with an org.apache.hadoop.hbase.mapreduce.Export
class, which will load HBase tables from HDFS, similar to how the equivalent
import worked earlier in this chapter. But you must have your data in SequenceFile
form, which has disadvantages, including no support for versioning.
You can also use the TableOutputFormat class in your own MapReduce job
to export data to HBase, but this approach is slower than the bulk-loading tool.
We’ve now concluded our examination of Hadoop egress tools. We covered how
you can use the HDFS File Slurper to move data out to a filesystem and how to use
Sqoop for idempotent writes to relational databases, and we wrapped up with a
look at ways to move Hadoop data into HBase.
Understanding inputs and outputs of MapReduce
Big Data Processing employs the Map Reduce Programming Model.
A job means a Map Reduce Program. Each job consists of several smaller
unit, called MapReduce Tasks.
A software execution framework in MapReduce programming defines the
parallel tasks.
The Hadoop MapReduce implementation uses Java framework.

Fig: MapReduce Programming Model

The model defines two important tasks, namely Map and Reduce.
Map takes input data set as pieces of data and maps them on various nodes
for parallel processing.
The reduce task, which takes the output from the maps as an input and
combines those data pieces into a smaller set of data. A reduce task always run
after the map task(s).
Many real-world situations are expressible using this model.
Inner join: It is the default natural join. It refers to two tables that join based
on common columns mentioned using the ON clause. Inner Join returns all rows
from both tables if the columns match.
Node refers to a place for storing data, data block or read or write
computations.
Data center in a DB refers to a collection of related nodes. Many nodes
form a data center or rack.
Cluster refers to a collection of many nodes.
Keyspace means a namespace to group multiple column families, especially
one per partition.
Indexing to a field means providing reference to a field in a document of
collections that support the queries and operations using that index. A DB creates
an index on the _id field of every collection.

The input data is in the form of an HDFS file. The output of the task also
gets stored in the HDFS.
The compute nodes and the storage nodes are the same at a cluster, that is,
the MapReduce program and the HDFS are running on the same set of nodes.
Fig: MapReduce process on client submitting a job

Figure above shows MapReduce process when a client submits a job, and
the succeeding actions by the JobTracker and TaskTracker.
JobTracker and Task Tracker MapReduce consists of a single master
JobTracker and one slave TaskTracker per cluster node.
The master is responsible for scheduling the component tasks in a job onto
the slaves, monitoring them and re-executing the failed tasks.
The slaves execute the tasks as directed by the master.
The data for a MapReduce task is initially at input files. The input files
typically reside in the HDFS. The files may be line-based log files, binary format
file, multiline input records, or something else entirely different.
The MapReduce framework operates entirely on key, value-pairs. The
framework views the input to the task as a set of (key, value)pairs and produces a
set of (key, value) pairs asthe output of the task, possiblyof different types.
Map-Tasks
Map task means a task that implements a map(), which runs user
application codes for each key-value pair (kl, vl). Key kl is a set of keys. Key kl
maps to group of data values. Values vl are a large string which is read from the
input file(s).
The output of map() would be zero (when no values are found) or
intermediate key-value pairs (k2, v2). The value v2 is the information for the
transformation operation at the reduce task using aggregation or other reducing
functions.
Reduce task refers to a task which takes the output v2 from the map as an
input and combines those data pieces into a smaller set of data using a combiner.
The reduce task is always performed after the map task.
The Mapper performs a function on individual values in a dataset
irrespective of the data size of the input. That means that the Mapper works on a
single data set.
Fig: Logical view of functioning of map()

Hadoop Java API includes Mapper class. An abstract function map() is


present in the Mapper class. Any specific Mapper implementation should be a
subclass of this class and overrides the abstract function, map ().

The Sample Code for Mapper Class


public clase SampleMapper extends Mapper<kl, Vl, k2, v2>
{
void map (kl key, Vl value, Context context) throws
IOException, InterruptedException
{..}
}
Individual Mappers do not communicate with each other.
Number of Maps: The number of maps depends on the size of the input
files, i.e., the total number of blocks of the input files.
If the input files are of 1TB in size and the block size is 128 MB, there will
be 8192 maps. The number of map task Nmap can be explicitly set by using
setNumMapTasks(int). Suggested number is nearly 10-100 maps per node. Nmap
can be set even higher.

Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value
pairs as input and output. Data should be first converted into key-value pairs before
it is passed to the Mapper, as the Mapper only understands key-value pairs of data.

Key-value pairs in Hadoop MapReduce are generated as follows:


InputSplit - Defines a logical representation of data and presents a Split data
for processing at individual map().
RecordReader - Communicates with the InputSplit and converts the Split
into records which are in the form of key-value pairs in a format suitable for
reading by the Mapper.
RecordReader uses TextlnputFormat by default for converting data into
key-value pairs.
RecordReader communicates with the InputSplit until the file is read.

Fig: Key-value pairing in MapReduce


Figure above shows the steps in MapReduce key-value pairing.
Generation of a key-value pair in MapReduce depends on the dataset and the
required output. Also, the functions use the key-value pairs at four places: map()
input, map() output, reduce() input and reduce() output.

Grouping by Key
When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the value v2
append in a list of values. A "Group By" operation on intermediate keys creates v2.

Shuffle and Sorting Phase


All pairs with the same group key (k2) collect and group together, creating
one group for each key.
Shuffle output format will be a List of <k2, List (v2)>. Thus, a different
subset of the intermediate key space assigns to each reduce node.
These subsets of the intermediate keys (known as "partitions") are inputs to
the reduce tasks.
Each reduce task is responsible for reducing the values associated with
partitions. HDFS sorts the partitions on a single node automatically before they
input to the Reducer.

Partitioning
 The Partitioner does the partitioning. The partitions are the semi-mappers in
MapReduce.
 Partitioner is an optional class. MapReduce driver class can specify the
Partitioner.
 A partition processes the output of map tasks before submitting it to Reducer
tasks.
 Partitioner function executes on each machine that performs a map task.
 Partitioner is an optimization in MapReduce that allows local partitioning
before reduce-task phase.
 The same codes implement the Partitioner, Combiner as well as reduce()
functions.
 Functions forPartitioner and sorting functions are at the mapping node.
 The main function of a Partitioner is to split the map output records with the
same key.

Combiners
Combiners are semi-reducers in MapReduce. Combiner is an optional class.
MapReduce driver class can specify the combiner.
The combiner() executes on each machine that performs a map task.
Combiners optimize MapReduce task that locally aggregates before the shuffle
and sort phase.
The same codes implement both the combiner and the reduce functions,
combiner() on map node and reducer() on reducer node.
The main function of a Combiner is to consolidate the map output records
with the same key.
The output (key-value collection) of the combiner transfers over the network
to the Reducer task as input.
This limits the volume of data transfer between map and reduce tasks, and
thus reduces the cost of data transfer across the network. Combiners use grouping
by key for carrying out this function.

The combiner works as follows:


 It does not have its own interface and it must implement the interface at
reduce().
 It operates on each map output key. It must have the same input and output
key-value types as the Reducer class.
 It can produce summary information from a large dataset because it replaces
the original Map output with fewer records or smaller records.

Reduced Tasks
Java API at Hadoop includes Reducer class. An abstract function, reduce() is
in the Reducer.
 Any specific Reducer implementation should be subclass of this class and
override the abstract reduce().
 Reduce task implements reduce() that takes the Mapper output (which
shuffles and sorts), which is grouped by key-values (k2, v2) and applies it in
parallel to each group.
 Intermediate pairs are at input of each Reducer in order after sorting using
the key.
 Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
 The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file.Reduce:{(k2, list (v2) -> list (k3, v3)}

Sample code for Reducer Class


public class ExampleReducer extends Reducer<k2, v2, k3, v3>
{
void reduce (k2 key, Iterable<V2> values, Context context) throws
IOException, InterruptedException
{... }
}

Details of Map Reduce processing Steps.


Fig: Map Reduce Execution steps

Execution of MapReduce job does not consider how the distributed


processing implements. Rather, the execution involves the formatting
(transforming) of data at each step
Figure above shows the execution steps, data flow, splitting, partitioning
and sorting on a map node and reducer on reducer node.

How to write a Hadoop Map class Subclass from MapReduceBase and


implement the Mapper interface.
public class MyMapper extends MapReduceBase implements Mapper { ... }
The Mapper interface provides a single method:
public void map(K key, V val, OutputCollector output, Reporter reporter)
WriteableComparable key:
Writeable value:
OutputCollector output: this has the collect method to output a pair
Reporter reporter: allows the application code to permit alteration of status .
• The Hadoop system divides the input data into logical “records” and then
calls map() once for each record.
• For text files, a record is one line of text.
• The key then is the byte-offset and the value is a line from the text file.
• For other input types, it can be defined differently.
• The main method is responsible for setting output key values and value
types.

How to write a Hadoop Reduce class


Subclass from MapReduceBase and implement the Reducer interface.
public class MyReducer extends MapReduceBase implements Reducer {...}
The Reducer interface provides a single method:
public void reduce(K key, Iterator values, OutputCollector output, Reporter
reporter)
WriteableComparable key:
Iterator values:
OutputCollector output:
Reporter reporter:

Copying with Node Failure


The primary way using which Hadoop achieves fault tolerance is through
restarting the tasks.
 Each task nodes (TaskTracker) regularly communicates with the master
node, JobTracker. If a TaskTracker fails to communicate with the
JobTracker for a pre-defined period (by default, it is set to 10 minutes), a
task node failure by the JobTracker is assumed.
 The JobTracker knows which map and reduce tasks were assigned to each
TaskTracker.
 If the job is currently in the mapping phase, then another TaskTracker will
be assigned to re-execute all map tasks previously run by the failed
TaskTracker.
 If the job is in the reducing phase, then another TaskTracker will re-execute
all reduce tasks that were in progress on the failed TaskTracker.
 Once reduce tasks are completed, the output writes back to the HDFS. Thus,
if a TaskTracker has already completed nine out of ten reduce tasks assigned
to it, only the tenth task must execute at a different node.

The failure of JobTracker (if only one master node) can bring the entire
process down; Master handles other failures, and the MapReduce job eventually
completes.
When the Master compute-node at which the JobTracker is executing fails,
then the entire MapReduce job must restart. Following points summarize the
coping mechanism with distinct Node Failures:
 Map TaskTracker failure:
- Map tasks completed or in-progress at TaskTracker, are reset to idle on
failure
- Reduce TaskTracker gets a notice when a task is rescheduled on another
TaskTracker
 Reduce TaskTracker failure:
- Only in-progress tasks are reset to idle
 Master JobTracker failure:
- Map-Reduce task aborts and notifies the client (in case of one master
node).

Data serialization
Data serialization is the process of converting data objects present in
complex data structures into a byte stream for storage, transfer and distribution
purposes on physical devices
Once the serialized data is transmitted the reverse process of creating objects
from the byte sequence called deserialization.

HOW IT WORKS?
 Computer data is generally organized in data structures such as arrays,
tables, trees, classes. When data structures need to be stored or transmitted
to another location, such as across a network, they are serialized.
 Serialization becomes complex for nested data structures and object
references.
What are Data Serialization Storage format?
Storage formats are a way to define how information is stored in the file.
Most of the time, this information can be assumed from the extension of the data.
Both structured and unstructured data can be stored on HADOOP enabled systems.
Common Hdfs file formats are
• Plain text storage
• Sequence files
• RC files
• AVRO
• Parquet

Why Storage Formats?


• File format must be handy to serve complex data structures
• HDFS enabled applications to take time to find relevant data in a
particular location and write back data to another location.
• Dataset is large
• Having schemas
• Having storage constraints
Why choose different File Formats?
Proper selection of file format leads to
• Faster read time
• Faster write time
• Splittable files (for partial data read)
• Schema evolution support (modifying dataset fields)
• Advance compression support
• Snappy compression leads to high speed and reasonable
compression/decompression.
• File formats help to manage Diverse data.
Guide to Data Serialization in Hadoop
• Data serialization is a process to format structured data in such a
way that it can be reconverted back to the original form.
• Serialization is done to translate data structures into a stream of
data. This stream of data can be transmitted over the network or
stored in DB regardless of the system architecture.
• Isn’t storing information in binary form or stream of bytes is a
right approach.
• Serialization does the same but isn’t dependent on architecture.
Consider CSV files contains a comma (,) in between data, so while
Deserialization wrong outputs may occur. Now, if metadata is stored in XML
form, which is a self architected form of data storage, data can be easily
deserialized in the future.

Why Data Serialization for Storage Formats?


• To process records faster (Time-bound).
• When Proper format of data need to be maintained and to be
transmitted over data without schema support on another end.
• Now when in future, data without structure or format needs to be
processed, complex Errors may occur.
• Serialization offers data validation over transmission.
Areas of Serialization for Storage Formats
To maintain the proper format of a data serialization system must have the
following four properties –
• Compact – helps in the best use of network bandwidth
• Fast – reduces the performance overhead
• Extensible – can match new requirements
• Inter-operable – not language-specific
Serialization in Hadoop has two areas –
Inter process communication
When a client calls a function or subroutine from one pc to the
pc in-network or server, that Procedure of calling is known as a
remote procedure call.
Persistent storage
It is better than java ‘s inbuilt serialization as java serialization
isn’ t compact Serialization and Deserialization of data helps in
maintaining and managing corporate decisions for effective use of
resources and data available in Data warehouse or any other database
-writable – language specific to java

Text-based Data Serialization formats and their key features

Here are some common ones:


 XML (Extensible Markup Language) –
o Nested textual format. Human-readable and editable.
o Schema based validation.
o Used in metadata applications, web services data transfer, web
publishing.

 CSV (Comma-Separated Values)


o Table structure with delimiters.
o Human-readable textual data.
o Opens as spreadsheet or plaintext.
o Used as plaintext Database.
o CSV file is the most commonly used data file format.
o Easy to read, Easy to parse, Easy to export data from an RDBMS
table.
It has three major drawbacks when used for HDFS.
1. All lines in a CSV file is a record, therefore, we should not include any
headers or footers. In other word, CSV file cannot be stored in HDFS with
any meta data.
2. CSV file has very limited support for schema evolution. Because the fields
for each record are ordered, we are not able to change the orders. We can
only append new fields to the end of each line.
3. It does not support block compression which many other file formats
support. The whole file has to be compressed and decompressed for reading,
adding a significant read performance cost to the files.

 JSON (JavaScript Object Notation)


o Short syntax textual format with limited data types.
o Human readable. Derived from JavaScript data formats.
o No need of a separate parser (like XML) since they map to JavaScript
objects.
o Can be fetched with an XMLHttpRequest call.
o No direct support for DATE data type.
o All data is dynamically processed.
o Popular format for web API parameter passing.
o Mobile apps use this extensively for user interaction and database
services.
o It is in text format that stores meta data with the data, so it fully
supports schema evolution and also spiltable.
o We can easily add or remove attributes for each datum. However,
because it’s text file, it doesn’t support block compression.

 YAML (YAML Ain't Markup Language)


o It is a data serialization language which is designed to be human -
friendly and works well with other programming languages for
everyday tasks.
o Superset of JSON
o Supports complex data types. Maps easily to native data structures.
o Lightweight text format.
o Human-readable.
o Supports comments and thus easily editable.
o Used in configuration settings, document headers, Apps with need for
MySQL style self references in relational data.

Binary Data Serialization formats and their key features


Here are some common ones:
 BSON (Binary JSON)
o It is a binary-encoded serialization of JSON-like documents.
o MongoDB uses BSON ,when storing documents in collections
o It deals with attribute-value pairs like JSON.
o Includes datetime, bytearray and other data types not present in JSON
o Binary format, not human-readable.
o Used in web apps with rich media data types such as live video.
o Primary use is storage, not network communication.
 MessagePack
o It is designed for data to be transparently converted from/to JSON.
o Support rich set of data Structures
o It create schema based annotation
o Primary use is network communication, not storage
o Compressed binary format, not human-readable.
o Supports static typing.
o Supports RPC.
o Used in apps with distributed file systems.

 protobuf (Protocol Buffers)


o It is Created by Google
o It is Google's language-neutral,
o platform-neutral, extensible mechanism for serializing structured data
o Protocol buffers currently support generated code in Java, Python,
Objective-C, and C++.
o Binary message format that allows programmers to specify a schema
for the data.
o Also includes a set of rules and tools to define and exchange these
messages.
o Transparent data compression.
o Used in multi-platform applications due to easy interoperability
between languages.
o Universal RPC framework.
o Used in performance-critical distributed applications.

• AVRO
o Apache Avro is a language-neutral data serialization system,
developed by Doug Cutting, the father of Hadoop.
o It also called a schema-based serialization technique.
FEATURES
o Avro uses JSON format to declare the data structures.
o Presently, it supports languages such as Java, C, C++, C#, Python, and
Ruby.
o Avro creates binary structured format that is both
compressible and splitable. Hence it can be efficiently used as the
input to Hadoop MapReduce jobs.
o Avro provides rich data structures.
o Avro schemas defined in JSON, facilitate implementation in the
languages that already have JSON libraries.
o Avro creates a self-describing file named Avro Data File, in which it
stores data along with its schema in the metadata section.
o Avro is also used in Remote Procedure Calls (RPCs).
o Thrift and Protocol Buffers are the most competent libraries with
Avro. Avro differs from these frameworks in the following ways
o Avro supports both dynamic and static types as per the
requirement. Protocol Buffers and Thrift use Interface
Definition Languages (IDLs) to specify schemas and their
types. These IDLs are used to generate code for serialization
and deserialization.
o Avro is built in the Hadoop ecosystem. Thrift and Protocol
Buffers are not built in Hadoop ecosystem.
APPLICATIONS OF DATA SERIALIZATION
• Serialization allows a program to save the state of an object and recreate it
when needed.
• Persisting data onto files – happens mostly in language-neutral formats
such as CSV or XML. However, most languages allow objects to be
serialized directly into binary using APIs
• Storing data into Databases – when program objects are converted into
byte streams and then stored into DBs, such as in Java JDBC.
• Transferring data through the network – such as web applications and
mobile apps passing on objects from client to server and vice versa.
• Sharing data in a Distributed Object Model – When programs written in
different languages need to share object data over a distributed network
.However, SOAP, REST and other web services have replaced these
applications now.
POTENTIAL RISK DUE TO SERIALIZATION
• It may allow a malicious party with access to the serialization byte stream
to read private data, create objects with illegal or dangerous state, or obtain
references to the private fields of deserialized objects. Workarounds are
tedious, not guaranteed.
• Open formats too have their security issues.
• XML might be tampered using external entities like macros or unverified
schema files.
• JSON data is vulnerable to attack when directly passed to a JavaScript
engine due to features like JSONP requests.
PERFORMANCE CHARACTERISTICS
• Speed– Binary formats are faster than textual formats. A late entrant,
protobuf reports the best times. JSON is preferable due to readability and
being schema-less.
• Data size– This refers to the physical space in bytes post serialization. For
small data, compressed JSON data occupies more space compared to binary
formats like protobuf. Generally, binary formats always occupy less space.
• Usability– Human readable formats like JSON are naturally preferred over
binary formats. For editing data, YAML is good. Schema definition is easy
in protobuf, with in-built tools.
• Compatibility-Extensibility – JSON is a closed format. XML is average with
schema versioning. Backward compatibility (extending schemas) is best
handled by protobuf.

You might also like