Unit III
Unit III
Introduction to Hadoop: Big Data – Apache Hadoop & Hadoop Eco System –
Moving Data in andout ofHadoop – Understanding inputs and outputs of
MapReduce - Data Serialization.
Features of DFS
Transparency
• Structure transparency: There is no need for the client to know
about the number orlocations of file servers and the storage devices.
Multiple fileservers should be providedfor performance, adaptability, and
dependability.
• Access transparency: Both local and remote files should be
accessible in the samemanner. The file system should be automatically
located on the accessed file and send it tothe client’s side
Naming transparency: Thereshould not be any hint in the name
of the file to thelocation of the file. Once a name is given to the file, it
should not be changed duringtransferring from one node to another.
• Replication transparency: If a file is copied on multiple nodes,
both the copies ofthefile and their locations should be hidden from one node
to another.
Simplicity and ease of use: The user interface of a file system should be
simple and thenumber of commands in the file shouldbesmall.
Introduction to Hadoop:
Hadoop is an open-source project of the Apache foundation.
Hadoop is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is
designed to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of which may
be prone to failures.
In simple words, Hadoop is a software library that allows its users to
process large datasets across distributed clusters of computers, thereby
enabling them to gather, store and analyze huge sets of data.
Hadoop is now a core part of the computing infrastructure for companies
such as Yahoo, Facebook, LinkedIn, Twitter etc
Features of Hadoop
1. It is optimized to handle massive quantities of structured, semi-structured
and unstructured data, using commodity hardware, that is, relatively
inexpensive computers.
2. Hadoop has a share nothing architecture.
3. It replicates its data across multiple computers so that if one goes down, the
data can still be processed from another machine that stores its replica
4. Hadoop is for high throughput rather than low latency. It is a batch operation
handling massive quantities of data; therefor the response time is not
immediate.
5. It complements On-Line Transaction Processing(OLTP) and On-Line
Analytical Processing(OLAP). However, it is not a replacement for a
relational database management system.
6. It is NOT good when work cannot be parallelized or when there are
dependencies within the data.
7. It is NOT good for processing small files. It works best with huge data files
and datasets.
Hadoop Versions:
As the data beings stored and processed increases in its complexity so do
Hadoop where the developers bring out various versions to address the issues (bug
fixes) and simplify the complex data processes. The updates are automatically
implemented as Hadoop development follows the trunk (base code) – branch
(fix)model.
Hadoop has two versions:
Hadoop 1.x (Version 1)
Hadoop 2 (Version 2)
1. Hadoop 1.x
Below are the Components of Hadoop 1.x
1. The Hadoop Common Module is a jar file which acts as the base API on
top of which all the other components work.
2. No new updates because of 1st Version.
3. Maximum of 4000 nodes only for each cluster
4. The functionality is limited utilizing the slot concept, i.e., the slots are
capable of running a map task or a reduce task.
5. HDFS is used for distributed storage system that is designed to cater to
large data, with a block size of 64 Mega Bytes (64MB) for supporting the
architecture. It is further divided into two components:
Name Node which is used to store metadata about the Data node,
placed with the Master Node. They contain details like the details
about the slave note, indexing and their respective locations along
with timestamps for timelining.
Data Nodes used for storage of data related to the applications in
use placed in the Slave Nodes.
6. Hadoop 1 uses Map Reduce (MR) data processing model. It is not capable
of supporting other non-MR tools.
MR has two components:
Job Tracker is used to assign or reassign task to MapReduce to an
application called task tracker located in the node clusters. It
additionally maintains a log about the status of the task tracker.
Task Tracker is responsible for executing the functions which have
been allocated by the job tracker and sensor cross the status report
of those task to the job tracker.
7. The network of the cluster is formed by organizing the master node and
slave nodes.
8. Whenever a large storage operation for big data set is received by the
Hadoop system, the data is divided into decipherable and organized
blocks that are distributed into different nodes.
2. Hadoop Version 2
Version 2 for Hadoop was released to provide improvements over the lags
which the users faced with version 1.
Improvements that the new version provides:
HDFS Federation: In prior HDFS architecture for entire cluster allows only
single namespace. In that configuration, Single Name Node manages
namespace. If Name Node fails, the cluster as a whole would be out of
services. The cluster will be unavailable until the Name Node restarts or
brought on a separate machine. Federation overcomes this limitation by
adding support for many Name Node/Namespaces ( A layer responsible for
managing the directories, files and blocks) to HDFS.
YARN(Yet Another Resource Negotiator)
Version 2.7.x - Released on 31st May 2018: Provide two major
functionalities that are providing for your application and providing for a
global resource manager, thereby improving its overall utility and versatility,
increasing scalability up to 10000 nodes for each cluster.
Version 2.8.x - Released in September 2018: Capacity scheduler is
designed to provide multi-tenancy support for processing data over Hadoop
and it has been made to be accessible for window uses so that there is an
increase in the rate of adoption for the software across the industry for
dealing with problems related to big data.
Hadoop EcoSystem
Apache Hadoop is an open source framework intended to make interaction
with big data easier,
However, for those who are not acquainted with this technology, one
question arises that what is big data ?
Big data is a term given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology such as RDBMS.
Hadoop has made its place in the industries and companies that need to work
on large data sets which are sensitive and needs efficient handling.
To process data using Hadoop, the data first needs to be loaded into Hadoop
clusters from several sources.
Sqoop is also a command-line interpreter, which sequentially executes
Sqoop commands.
Sqoop can be effectively used by non-programmers as well and relies on
underlying technologies like HDFS and MapReduce.
Sqoop Architecture
1. The client submits the import/ export command to import or export data.
2. Sqoop fetches data from different databases. Here, we have an enterprise
data warehouse, document-based systems, and a relational database. We
have a connector for each of these; connectors help to work with a range of
accessible databases.
3. Multiple mappers perform map tasks to load the data on to HDFS.
4. Similarly, numerous map tasks will export the data from HDFS on to
RDBMS using the Sqoop export command.
Sqoop Import
The diagram below represents the Sqoop import mechanism.
In this example, a company’s data is present in the RDBMS. All this
metadata is sent to the Sqoop import. Scoop then performs an introspection of the
database to gather metadata (primary key information).
It then submits a map-only job. Sqoop divides the input dataset into splits
and uses individual map tasks to push the splits to HDFS.
Few of the arguments used in Sqoop import are shown below:
Sqoop Export
1. The first step is to gather the metadata through introspection.
2. Sqoop then divides the input dataset into splits and uses individual map tasks
to push the splits to RDBMS.
Few of the arguments used in Sqoop export:
Flume
Apache Flume is a tool/service/data ingestion mechanism for collecting
aggregating and transporting large amounts of streaming data such as log
files, events, (etc...) from various sources to a centralized data store.
Flume is also used for collecting data from various social media websites
such as Twitter and Facebook.
Flume is used for real-time data capturing in Hadoop.
It can be applied to assemble a wide variety of data such as network traffic,
data generated via social networking, business transaction data and emails.
Flume is a highly reliable, distributed, and configurable tool. It is principally
designed to copy streaming data (log data) from various web servers to
HDFS.
It has a simple and very flexible architecture based on streaming data flows.
Its quite robust and fault tolerant
ZooKeeper
Before Zookeeper, it was very difficult and time consuming to coordinate
between different services in Hadoop Ecosystem. The services earlier had many
problems with interactions like common configuration while synchronizing data.
Even if the services are configured, changes in the configurations of the services
make it complex and difficult to handle. The grouping and naming was also a time-
consuming factor.
Due to the above problems, Zookeeper was introduced. It saves a lot of time
by performing synchronization, configuration maintenance, grouping and naming.
Although it’s a simple service, it can be used to build powerful solutions.
Architecture of Zookeeper
Client-Server architecture is used by Apache Zookeeper. The five
components that make up the Zookeeper architecture are as follows:
Server: When any client connects, the server sends an acknowledgment.
The client will automatically forward the message to another server if the
connected server doesn't respond.
Client: One of the nodes in the distributed application cluster is called
Client. You can access server-side data more easily as a result. Each client
notifies the server that it is still alive regularly with a message.
Leader: A Leader server is chosen from the group of servers. The client is
informed that the server is still live and is given access to all the data. If any
of the connected nodes failed, automatic recovery would be carried out.
Follower: A follower is a server node that complies with the instructions of
the leader. Client read requests are handled by the associated Zookeeper
server. The Zookeeper leader responds to client write requests.
Ensemble/Cluster: a cluster or ensemble is a group of Zookeeper servers.
When running Apache, you can use ZooKeeper infrastructure in cluster
mode to keep the system functioning at its best.
ZooKeeperWebUI: You must utilize WebUI if you wish to deal with
ZooKeeper resource management. Instead of utilizing the command line, it
enables using the web user interface to interact with ZooKeeper. It allows
for a quick and efficient connection with the ZooKeeper application.
How does Zookeeper Works?
Hadoop ZooKeeper is a distributed application that uses a simple client-
server architecture, with clients acting as service-using nodes and servers as
service-providing nodes.
The ZooKeeper ensemble is the collective name for several server nodes.
One ZooKeeper client is connected to at least one ZooKeeper server at any one
time. Because a master node is dynamically selected by the ensemble in consensus,
an ensemble of Zookeeper is often an odd number, ensuring a majority vote.
If the master node fails, a new master is quickly selected and replaces the
failed master.
In addition to the master and slaves, Zookeeper also has watchers. Scaling
was a problem, therefore observers were brought in. The performance of writing
will be impacted by the addition of slaves because voting is an expensive
procedure. Therefore, observers are slaves who perform similar tasks to other
slaves but do not participate in voting.
Finally another application is called a Flume, which is a distributed reliable
available service, for efficiently collecting aggregating moving, a large amount of
data into the, of the locks into the HDFS system hence, it is used for data injection,
please use for data ingestion that is the flume system.
Oozie
Apache Oozie is a scheduler system to run and manage Hadoop jobs in a
distributed environment.
It allows to combine multiple complex jobs to be run in a sequential order to
achieve a bigger task.
Within a sequence of task, two or more jobs can also be programmed to run
parallel to each other.
One of the main advantages of Oozie is that it is tightly integrated with
Hadoop stack supporting various Hadoop jobs like Hive, Pig, Sqoop as well as
system-specific jobs like Java and Shell.
Mahout
Mahout which is renowned for machine learning.
Mahout provides an environment for creating machine learning applications
which are scalable. Machine learning algorithms allow us to build self-
learning machines that evolve by itself without being explicitly
programmed. Based on user behaviour, data patterns and past experiences it
makes important future decisions.
You can call it a descendant of Artificial Intelligence (AI).
Mahout provides a command line to invoke various algorithms. It has a
predefined set of library which already contains different inbuilt algorithms
for different use cases.
What Mahout does?
It performs collaborative filtering, clustering and classification. Some people
also consider frequent item set missing as Mahout’s function. Let us understand
them individually:
1. Collaborative filtering: Mahout mines user behaviors, their patterns and
their characteristics and based on that it predicts and make recommendations
to the users. The typical use case is E-commerce website.
2. Clustering: It organizes a similar group of data together like articles can
contain blogs, news, research papers etc.
3. Classification: It means classifying and categorizing data into various
subdepartments like articles can be categorized into blogs, news, essay,
research papers and other categories.
4. Frequent item set missing: Here Mahout checks, which objects are likely
to be appearing together and make suggestions, if they are missing. For
example, cell phone and cover are brought together in general. So, if you
search for a cell phone, it will also recommend you the cover and cases.
R Connectors
Oracle R Connector for Hadoop is a collection of R packages that provide:
Interfaces to work with Hive tables, the Apache Hadoop compute
infrastructure, the local R environment, and Oracle database tables
Predictive analytic techniques, written in R or Java as Hadoop MapReduce
jobs, that can be applied to data in HDFS files
You install and load this package as you would any other R package. Using
simple R functions, you can perform tasks like these:
Access and transform HDFS data using a Hive-enabled transparency layer
Use the R language for writing mappers and reducers
Copy data between R memory, the local file system, HDFS, Hive, and
Oracle databases
Schedule R programs to execute as Hadoop MapReduce jobs and return the
results to any of those locations
For better understanding, let us take an example. You have billions of customer
emails and you need to find out the number of customers who has used the word
complaint in their emails. The request needs to be processed quickly (i.e. at real
time). So, here we are handling a large data set while retrieving a small amount of
data. For solving these kind of problems,
HBase was designed.
YARN
Consider YARN as the brain of your Hadoop Ecosystem.
It performs all your processing activities by allocating resources and
scheduling tasks.
It has two major components, i.e. Resource Manager and Node Manager.
1. Resource Manager is again a main node in the processing department. It
receives the processing requests, and then passes the parts of requests to
corresponding Node Managers accordingly, where the actual processing
takes place.
2. Node Managers are installed on every Data Node. It is responsible for
execution of task on every single Data Node
Schedulers: Based on your application resource requirements, Schedulers
perform scheduling algorithms and allocates the resources.
Applications Manager: While Applications Manager accepts the job
submission, negotiates to containers (i.e. the Data node environment where
process executes) for executing the application specific Application Master
and monitoring the progress. ApplicationMasters are the deamons which
reside on DataNode and communicates to containers for execution of tasks
on each DataNode.
manager
MAPREDUCE
MapReduce is a programming model and an associated implementation for
processing and generating large data sets using distributed and parallel
algorithms inside Hadoop environment.
Users specify a map function that processes a key/value pair to generate a set
of intermediate key/value pairs, and a reduce function that merges all
intermediate values associated with the same intermediate key
HDFS
Hadoop Distributed File System is the core component or you can say, the
backbone of Hadoop Ecosystem.
HDFS is the one, which makes it possible to store different types of large
data sets (i.e. structured, unstructured and semi structured data).
HDFS creates a level of abstraction over the resources, from where we can
see the whole HDFS as a single unit.
It helps us in storing our data across various nodes and maintaining the log
file about the stored data (metadata).
HDFS has two core components, i.e. NameNode and DataNode.
1. The NameNode is the main node and it doesn’t store the actual data. It
contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
2. All your data is stored on the DataNodes and hence it requires more
storage resources. These DataNodes are commodity hardware (like your
laptops and desktops) in the distributed environment. That’s the reason,
why Hadoop solutions are very cost effective.
You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
Ambari
Ambari is an Apache Software Foundation Project which aims at making
Hadoop ecosystemmore manageable.
It includes software for provisioning, managing and monitoring Apache
Hadoop clusters.
The Ambari provides:
1. Hadoop cluster provisioning:
It gives us step by step process for installing Hadoop services
across a number of hosts.
It also handles configuration of Hadoop services over a cluster.
2. Hadoop cluster management:
It provides a central management service for starting, stopping and
reconfiguring Hadoop services across the cluster.
3. Hadoop cluster monitoring:
For monitoring health and status, Ambari provides us a dashboard.
The Amber Alert framework is an alerting service which notifies
the user, whenever the attention is needed. For example, if a node
goes down or low disk space on a node, etc.
Security Aspects
Apache Chukwa, given its close integration with Apache Hadoop, adheres to
the same security measures as its parent project. This includes Hadoop’s in-built
security features such as Kerberos for authentication and HDFS for encryption.
Performance
Apache Chukwa's performance is tightly linked with the underlying Hadoop
infrastructure, gaining advantage from Hadoop's robust scalability and fault
tolerance. However, the performance can be conditioned by the hardware resources
of the deployment and the overall load of data processing.
Avro
Avro is an open source project that provides data serialization and data
exchange services for Apache Hadoop. These services can be used together
or independently.
Avro facilitates the exchange of big data between programs written in any
language. With the serialization service, programs can efficiently serialize
data into files or into messages.
The data storage is compact and efficient.
Avro stores both the data definition and the data together in one message or
file.
Avro stores the data definition in JSON format making it easy to read and
interpret; the data itself is stored in binary format making it compact and
efficient.
Avro files include markers that can be used to split large data sets into
subsets suitable for Apache MapReduce processing. Some data exchange
services use a code generator to interpret the data definition and produce
code to access the data. Avro doesn't require this step, making it ideal for
scripting languages.
A key feature of Avro is robust support for data schemas that change over
time — often called schema evolution.
Avro handles schema changes like missing fields, added fields and changed
fields; as a result, old programs can read new data and new programs can
read old data.
Avro includes APIs for Java, Python, Ruby, C, C++ and more. Data stored
using Avro can be passed from programs written in different languages,
even from a compiled language like C to a scripting language like Apache
Pig.
NoSQL
Most of these big data is stored in the form of a key value pair and they are
also, known as, No Sequel Data Store.
This No Sequel Data Store can be supported by, the data base like,
Cassandra, MongoDB and HBase.
Traditional SQL, can be effectively used to handle the large amount of,
structured data. But here in the big data, most of the information is, unstructured
form of the data, so basically, NoSQL is required to handle that information,
NoSQL database is, stored unstructured data also, however, it is not,
enforced to follow a particular, fixed schema structure and schema keeps on,
changing, dynamically. So, each row can have its own set of column values.
NoSQL gives a better performance, in storing the massive amount of data
compared to the SQL, structure.
NoSQL database is primarily a key value store. It is also called a, 'Column
Family' because Column wise, the data is stored, in the form of a key value, pairs.
Cassandra:
Another data base, which supports, data model like NoSQL data model, is
called, 'Cassandra'.
Apache Cassandra is highly scalable, distributed and high-performance,
NoSQL database.
Cassandra is designed, to handle the huge amount of information and the
Cassandra handles this huge data, with its distributed architecture
Spark
Apache Spark is an open-source cluster computing framework.
Its primary purpose is to handle the real-time generated data.
Spark was built on the top of the Hadoop MapReduce. It was optimized to
run in memory whereas alternative approaches like Hadoop's MapReduce writes
data to and from computer hard drives. So, Spark process the data much quicker
than other alternatives.
Spark is a scalable data analytics platform
Supports the in-memory computation,
The Spark was initiated by Matei Zaharia at UC Berkeley's AMPLab in
2009. It was open sourced in 2010 under a BSD license.
In 2013, the project was acquired by Apache Software Foundation. In 2014,
the Spark emerged as a Top-Level Apache Project.
Usage of Spark
Data integration: The data generated by systems are not consistent enough
to combine for analysis. To fetch consistent data from systems we can use
processes like Extract, transform, and load (ETL). Spark is used to reduce
the cost and time required for this ETL process.
Stream processing: It is always difficult to handle the real-time generated
data such as log files. Spark is capable enough to operate streams of data and
refuses potentially fraudulent operations.
Machine learning: Machine learning approaches become more feasible and
increasingly accurate due to enhancement in the volume of data. As spark is
capable of storing data in memory and can run repeated queries quickly, it
makes it easy to work on machine learning algorithms.
Interactive analytics: Spark is able to generate the respond rapidly. So,
instead of running pre-defined queries, we can handle the data interactively.
Spark Components
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and
monitor multiple applications.
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with
storage systems and memory management.
Spark SQL
o It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as
the Apache Hive variant of SQL called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between
Java objects and existing databases, data warehouses and business
intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-
tolerant processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming
analytics.
o It accepts data in mini-batches and performs RDD transformations on that
data.
o Its design ensures that the applications written for streaming data can be
reused to analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real-time
example of a data stream.
MLlib
o The MLlib is a Machine Learning library that contains various machine
learning algorithms.
o These include correlations and hypothesis testing, classification and
regression, clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by Apache
Mahout.
GraphX
o The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
o It facilitates to create a directed graph with arbitrary properties attached to
each vertex and edge.
o To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.
Kafka
Apache Kafka is an open source, distributed stream processing, software
framework. So, through Kafka data streams can be, submitted to the Apache Spark,
for doing the computations. So this will form a pipeline.
2. Impstatestoretore: The Impstatestoretore is the one that checks the health of all
the Impala daemons in the cluster and continuously communicates its findings to
each of the Impala daemons.
The Impstatestoretore is not always critical to the normal operation of an
Impala cluster. If the StateStore is not running this case, the Impala daemons will
also be running and distributing work among themselves as usual.
3. Impala Catalog Service: The catalog service is another Impala component that
propagates metadata changes from Impala SQL commands to all Impala daemons
in the cluster.
Best Practices
Data Formatting: Ensure data is properly formatted and compatible with the
destination systems.
Data Compression: Use compression (e., Snappy, Gzip) to optimize data
transfer and storage.
Data Security: Implement appropriate security measures, such as encryption
and access controls, to protect data during transfer.
Error Handling: Set up monitoring and error handling mechanisms to handle
any issues that arise during data transfer.
Sources
Flume sources are responsible for reading data from external clients or from
other Flume sinks. A unit of data in Flume is defined as an event, which is
essentially a payload and optional set of metadata. A Flume source sends these
events to one or more Flume channels, which deal with storage and buffering.
Flume has an extensive set of built-in sources, including HTTP, JMS, and RPC,
and you encountered one of them just a few moments ago.
The exec source allows you to execute a Unix command, and each line
emitted in standard output is captured as an event (standard error is ignored by
default).
To conclude our brief dive into Flume sources, let’s summarize some of the
interesting abilities that they provide:
Transactional semantics, which allow data to be reliably moved with at-
least-once semantics. Not all data sources support this.
The exec source used in this technique is an example of a source that doesn’t
provide any data-reliability guarantees.
Interceptors, which provide the ability to modify or drop events. They are
useful for annotating events with host, time, and unique identifiers, which
are useful for deduplication.
Selectors, which allow events to be fanned out or multiplexed in various
ways. You can fan out events by replicating them to multiple channels, or
you can route them to different channels based on event headers.
Channels
Flume channels provide data storage facilities inside an agent. Sources add
events to a channel, and sinks remove events from a channel. Channels provide
durability properties inside Flume, and you pick a channel based on which level
of durability and throughput you need for your application.
There are three channels bundled with Flume:
Memory channels store events in an in-memory queue. This is very useful
for high-throughput data flows, but they have no durability guarantees,
meaning that if an agent goes down, you’ll lose data.
File channels persist events to disk. The implementation uses an efficient
write-ahead log and has strong durability properties.
JDBC channels store events in a database. This provides the strongest
durability and recoverability properties, but at a cost to performance.
Sinks
A Flume sink drains events out of one or more Flume channels and will
either forward these events to another Flume source (in a multihop flow), or handle
the events in a sink-specific manner. There are a number of sinks built into Flume,
including HDFS, HBase, Solr, and Elasticsearch.
One area that Flume isn’t really optimized for is working with binary data. It
can support moving binary data, but it loads the entire binary event into memory,
so moving files that are gigabytes in size or larger won’t work.
Databases
Most organizations’ crucial data exists across a number of OLTP databases.
The data stored in these databases contains information about users, products, and
a host of other useful items. If you wanted to analyze this data, the traditional way
to do so would be to periodically copy that data into an OLAP data warehouse.
Hadoop has emerged to play two roles in this space: as a replacement to data
warehouses, and as a bridge between structured and unstructured data and data
warehouses. Figure shows the first role, where Hadoop is used as a large-
scale joining and aggregation mechanism prior to exporting the data to an OLAP
system (a commonly used platform for business intelligence applications).
Figure: Using Hadoop for data ingress, joining, and egress to OLAP
Sqoop has the notion of connectors, which contain the specialized logic needed to
read and write to external systems. Sqoop comes with two classes of
connectors: common connectors for regular reads and writes, and fast
connectors that use database-proprietary batch mechanisms for efficient
imports. Figure below shows these two classes of connectors and the databases that
they support.
Figure : Sqoop connectors used to read and write to external systems
Incremental imports
You can also perform incremental imports. Sqoop supports two
types: append works for numerical data that’s incrementing over time, such as
auto-increment keys; last modified works on timestamped data.
Importing to Hive
The final step in this technique is to use Sqoop to import your data into a Hive
table. The only difference between an HDFS import and a Hive import is that the
Hive import has a postprocessing step where the Hive table is created and loaded,
as shown in figure below.
HBase
Our final foray into moving data into Hadoop involves taking a look at
HBase. HBase is a real-time, distributed, data storage system that’s often either
colocated on the same hardware that serves as your Hadoop cluster or is in close
proximity to a Hadoop cluster. Being able to work with HBase data directly in
MapReduce, or to push it into HDFS, is one of the huge advantages when picking
HBase as a solution.
Data partitioning
Earlier you saw the location where Camus imported the Avro data sitting in Kafka.
Let’s take a closer look at the HDFS path structure, shown in figure below, and see
what you can do to determine the location.
Figure: Dissecting the Camus output path for exported data in HDFS
The date/time part of the path is determined by the timestamp extracted from the
CamusWrapper. You’ll recall from our earlier discussion that you can extract
timestamps from your records in Kafka in your MessageDecoder and supply them
to the CamusWrapper, which will allow your data to be partitioned by dates that
are meaningful to you, as opposed to the default, which is simply the time at which
the Kafka record is read in MapReduce.
Camus supports a pluggable partitioner, which allows you to control the part of the
path shown in figure below
Figure: The Camus partitioner path
Databases
Databases are usually the target of Hadoop data egress in one of two
circumstances: either when you move data back into production databases to be
used by production systems, or when you move data into OLAP databases to
perform business intelligence and analytics functions.
In this section we’ll use Apache Sqoop to export data from Hadoop to a MySQL
database. Sqoop is a tool that simplifies database imports and exports. Sqoop is
covered in detail in technique above.
We’ll walk through the process of exporting data from HDFS to Sqoop.
We’ll also cover methods for using the regular connector, as well as how to
perform bulk imports using the fast connector.
Direct exports
You used the fast connector in the import technique, which was an optimization
that used the mysqldump utility. Sqoop exports also support using the fast
connector, which uses the mysqlimport tool. As with mysqldump, all of the nodes
in your cluster need to have mysqlimport installed and available in the path of the
user that’s used to run MapReduce tasks. And as with the import, the
NoSQL
MapReduce is a powerful and efficient way to bulk-load data into external
systems. So far we’ve covered how Sqoop can be used to load relational data, and
now we’ll look at NoSQL systems, and specifically HBase.
Apache HBase is a distributed key/value, column-oriented data store. Earlier
in this chapter we looked at how to import data from HBase into HDFS, as well as
how to use HBase as a data source for a MapReduce job.
The most efficient way to load data into HBase is via its built-in bulk-
loading mechanism, which is described in detail on the HBase wiki page titled
“Bulk Loading” at https://hbase.apache.org/book/arch.bulk.load.html. But this
approach bypasses the write-ahead log (WAL), which means that the data being
loaded isn’t replicated to slave HBase nodes.
HBase also comes with an org.apache.hadoop.hbase.mapreduce.Export
class, which will load HBase tables from HDFS, similar to how the equivalent
import worked earlier in this chapter. But you must have your data in SequenceFile
form, which has disadvantages, including no support for versioning.
You can also use the TableOutputFormat class in your own MapReduce job
to export data to HBase, but this approach is slower than the bulk-loading tool.
We’ve now concluded our examination of Hadoop egress tools. We covered how
you can use the HDFS File Slurper to move data out to a filesystem and how to use
Sqoop for idempotent writes to relational databases, and we wrapped up with a
look at ways to move Hadoop data into HBase.
Understanding inputs and outputs of MapReduce
Big Data Processing employs the Map Reduce Programming Model.
A job means a Map Reduce Program. Each job consists of several smaller
unit, called MapReduce Tasks.
A software execution framework in MapReduce programming defines the
parallel tasks.
The Hadoop MapReduce implementation uses Java framework.
The model defines two important tasks, namely Map and Reduce.
Map takes input data set as pieces of data and maps them on various nodes
for parallel processing.
The reduce task, which takes the output from the maps as an input and
combines those data pieces into a smaller set of data. A reduce task always run
after the map task(s).
Many real-world situations are expressible using this model.
Inner join: It is the default natural join. It refers to two tables that join based
on common columns mentioned using the ON clause. Inner Join returns all rows
from both tables if the columns match.
Node refers to a place for storing data, data block or read or write
computations.
Data center in a DB refers to a collection of related nodes. Many nodes
form a data center or rack.
Cluster refers to a collection of many nodes.
Keyspace means a namespace to group multiple column families, especially
one per partition.
Indexing to a field means providing reference to a field in a document of
collections that support the queries and operations using that index. A DB creates
an index on the _id field of every collection.
The input data is in the form of an HDFS file. The output of the task also
gets stored in the HDFS.
The compute nodes and the storage nodes are the same at a cluster, that is,
the MapReduce program and the HDFS are running on the same set of nodes.
Fig: MapReduce process on client submitting a job
Figure above shows MapReduce process when a client submits a job, and
the succeeding actions by the JobTracker and TaskTracker.
JobTracker and Task Tracker MapReduce consists of a single master
JobTracker and one slave TaskTracker per cluster node.
The master is responsible for scheduling the component tasks in a job onto
the slaves, monitoring them and re-executing the failed tasks.
The slaves execute the tasks as directed by the master.
The data for a MapReduce task is initially at input files. The input files
typically reside in the HDFS. The files may be line-based log files, binary format
file, multiline input records, or something else entirely different.
The MapReduce framework operates entirely on key, value-pairs. The
framework views the input to the task as a set of (key, value)pairs and produces a
set of (key, value) pairs asthe output of the task, possiblyof different types.
Map-Tasks
Map task means a task that implements a map(), which runs user
application codes for each key-value pair (kl, vl). Key kl is a set of keys. Key kl
maps to group of data values. Values vl are a large string which is read from the
input file(s).
The output of map() would be zero (when no values are found) or
intermediate key-value pairs (k2, v2). The value v2 is the information for the
transformation operation at the reduce task using aggregation or other reducing
functions.
Reduce task refers to a task which takes the output v2 from the map as an
input and combines those data pieces into a smaller set of data using a combiner.
The reduce task is always performed after the map task.
The Mapper performs a function on individual values in a dataset
irrespective of the data size of the input. That means that the Mapper works on a
single data set.
Fig: Logical view of functioning of map()
Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value
pairs as input and output. Data should be first converted into key-value pairs before
it is passed to the Mapper, as the Mapper only understands key-value pairs of data.
Grouping by Key
When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the value v2
append in a list of values. A "Group By" operation on intermediate keys creates v2.
Partitioning
The Partitioner does the partitioning. The partitions are the semi-mappers in
MapReduce.
Partitioner is an optional class. MapReduce driver class can specify the
Partitioner.
A partition processes the output of map tasks before submitting it to Reducer
tasks.
Partitioner function executes on each machine that performs a map task.
Partitioner is an optimization in MapReduce that allows local partitioning
before reduce-task phase.
The same codes implement the Partitioner, Combiner as well as reduce()
functions.
Functions forPartitioner and sorting functions are at the mapping node.
The main function of a Partitioner is to split the map output records with the
same key.
Combiners
Combiners are semi-reducers in MapReduce. Combiner is an optional class.
MapReduce driver class can specify the combiner.
The combiner() executes on each machine that performs a map task.
Combiners optimize MapReduce task that locally aggregates before the shuffle
and sort phase.
The same codes implement both the combiner and the reduce functions,
combiner() on map node and reducer() on reducer node.
The main function of a Combiner is to consolidate the map output records
with the same key.
The output (key-value collection) of the combiner transfers over the network
to the Reducer task as input.
This limits the volume of data transfer between map and reduce tasks, and
thus reduces the cost of data transfer across the network. Combiners use grouping
by key for carrying out this function.
Reduced Tasks
Java API at Hadoop includes Reducer class. An abstract function, reduce() is
in the Reducer.
Any specific Reducer implementation should be subclass of this class and
override the abstract reduce().
Reduce task implements reduce() that takes the Mapper output (which
shuffles and sorts), which is grouped by key-values (k2, v2) and applies it in
parallel to each group.
Intermediate pairs are at input of each Reducer in order after sorting using
the key.
Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file.Reduce:{(k2, list (v2) -> list (k3, v3)}
The failure of JobTracker (if only one master node) can bring the entire
process down; Master handles other failures, and the MapReduce job eventually
completes.
When the Master compute-node at which the JobTracker is executing fails,
then the entire MapReduce job must restart. Following points summarize the
coping mechanism with distinct Node Failures:
Map TaskTracker failure:
- Map tasks completed or in-progress at TaskTracker, are reset to idle on
failure
- Reduce TaskTracker gets a notice when a task is rescheduled on another
TaskTracker
Reduce TaskTracker failure:
- Only in-progress tasks are reset to idle
Master JobTracker failure:
- Map-Reduce task aborts and notifies the client (in case of one master
node).
Data serialization
Data serialization is the process of converting data objects present in
complex data structures into a byte stream for storage, transfer and distribution
purposes on physical devices
Once the serialized data is transmitted the reverse process of creating objects
from the byte sequence called deserialization.
HOW IT WORKS?
Computer data is generally organized in data structures such as arrays,
tables, trees, classes. When data structures need to be stored or transmitted
to another location, such as across a network, they are serialized.
Serialization becomes complex for nested data structures and object
references.
What are Data Serialization Storage format?
Storage formats are a way to define how information is stored in the file.
Most of the time, this information can be assumed from the extension of the data.
Both structured and unstructured data can be stored on HADOOP enabled systems.
Common Hdfs file formats are
• Plain text storage
• Sequence files
• RC files
• AVRO
• Parquet
• AVRO
o Apache Avro is a language-neutral data serialization system,
developed by Doug Cutting, the father of Hadoop.
o It also called a schema-based serialization technique.
FEATURES
o Avro uses JSON format to declare the data structures.
o Presently, it supports languages such as Java, C, C++, C#, Python, and
Ruby.
o Avro creates binary structured format that is both
compressible and splitable. Hence it can be efficiently used as the
input to Hadoop MapReduce jobs.
o Avro provides rich data structures.
o Avro schemas defined in JSON, facilitate implementation in the
languages that already have JSON libraries.
o Avro creates a self-describing file named Avro Data File, in which it
stores data along with its schema in the metadata section.
o Avro is also used in Remote Procedure Calls (RPCs).
o Thrift and Protocol Buffers are the most competent libraries with
Avro. Avro differs from these frameworks in the following ways
o Avro supports both dynamic and static types as per the
requirement. Protocol Buffers and Thrift use Interface
Definition Languages (IDLs) to specify schemas and their
types. These IDLs are used to generate code for serialization
and deserialization.
o Avro is built in the Hadoop ecosystem. Thrift and Protocol
Buffers are not built in Hadoop ecosystem.
APPLICATIONS OF DATA SERIALIZATION
• Serialization allows a program to save the state of an object and recreate it
when needed.
• Persisting data onto files – happens mostly in language-neutral formats
such as CSV or XML. However, most languages allow objects to be
serialized directly into binary using APIs
• Storing data into Databases – when program objects are converted into
byte streams and then stored into DBs, such as in Java JDBC.
• Transferring data through the network – such as web applications and
mobile apps passing on objects from client to server and vice versa.
• Sharing data in a Distributed Object Model – When programs written in
different languages need to share object data over a distributed network
.However, SOAP, REST and other web services have replaced these
applications now.
POTENTIAL RISK DUE TO SERIALIZATION
• It may allow a malicious party with access to the serialization byte stream
to read private data, create objects with illegal or dangerous state, or obtain
references to the private fields of deserialized objects. Workarounds are
tedious, not guaranteed.
• Open formats too have their security issues.
• XML might be tampered using external entities like macros or unverified
schema files.
• JSON data is vulnerable to attack when directly passed to a JavaScript
engine due to features like JSONP requests.
PERFORMANCE CHARACTERISTICS
• Speed– Binary formats are faster than textual formats. A late entrant,
protobuf reports the best times. JSON is preferable due to readability and
being schema-less.
• Data size– This refers to the physical space in bytes post serialization. For
small data, compressed JSON data occupies more space compared to binary
formats like protobuf. Generally, binary formats always occupy less space.
• Usability– Human readable formats like JSON are naturally preferred over
binary formats. For editing data, YAML is good. Schema definition is easy
in protobuf, with in-built tools.
• Compatibility-Extensibility – JSON is a closed format. XML is average with
schema versioning. Backward compatibility (extending schemas) is best
handled by protobuf.