0% found this document useful (0 votes)
22 views47 pages

BDA Unit-3

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 47

Unit III

Introduction to Hadoop: Big Data – Apache Hadoop & Hadoop Eco System – Moving
Data in andout ofHadoop – Understanding inputs and outputs of MapReduce - Data
Serialization.

Introduction to File System


The first storage mechanism used by computers to store data was punch cards.
Each groupof related punch cards (Punch cards related to same program) used to be
stored into a file; and files were stored in file cabinets.
This is very similar to what we do nowadays to archive papers in government
instituitions who still use paper work on daily basis. This is where the word “File System”
(FS) comes from. The computer systems evolved; but the concept remains the same.

Figure: Storage Mechanism

What is File System?


Instead of storing information on punch cards; we can now store information / data
in a digital format on a digital storage device such as hard disk, flash drive…etc.
Relateddata arestill categorized as files;
Related groups of files are stored in folders.
Each file has a name, extension and icon. The file name gives an indication about
the content it has while file extension indicates the type of information stored in that file.
for example; EXE extension refers to executable files, TXT refers to text files…etc.
File managementsystem is used by the operating system to access the files and
folders stored in a computer or any external storage devices.
What is Distributed File System?
In Big Data, we deal with multiple clusters (computers) often. One of the main
advantagesof Big Data which is that it goes beyond the capabilities of one single super
powerfulserver with extremely high computing power.
The whole idea of Big Data is to distributedata across multiple clusters and to make
use of computing power of each cluster (node)to process information.
Distributed file system is a system that can handle accessing dataacross multiple
clusters (nodes).

Advantages of Distributed File System


 Scalability: You can scale up your infrastructure by adding more racks or
clusters to yoursystem.
 Fault Tolerance: Data replication will help to achieve fault tolerance in the
Followingcases:
• Cluster is down
• Rack is down
• Rack is disconnected from the network.
• Job failure or restart.
 High Concurrency: utilize the compute power of each node to handle multiple
Clientrequests (in a parallel way) at the same time.
• DFS allows multiple users to access or store the data.
• It allows the data to be share remotely.
• It improved the availability of file, access time and network
efficiency.
• Improved the capacity to change the size of the data and also
improves the ability toexchange the data.
• Distributed File System provides transparency of data even if
server or disk fails.

Introduction to Hadoop:
 Hadoop is an open-source project of the Apache foundation.
 Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models.
 It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
 Rather than rely on hardware to deliver high-availability, the library itself is
designed to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of which may be
prone to failures.
 In simple words, Hadoop is a software library that allows its users to process
large datasets across distributed clusters of computers, thereby enabling
them to gather, store and analyze huge sets of data.
 Hadoop is now a core part of the computing infrastructure for companies such as
Yahoo, Facebook, LinkedIn, Twitter etc
Features of Hadoop
Hadoop is an open source framework that is meant for storage and processing of big data
in a distributed manner. It is the best solution for handling big data challenges.
Some important features of Hadoop are –
 Open Source – Hadoop is an open source framework which means it is available
free of cost. Also, the users are allowed to change the source code as per their
requirements.
 Distributed Processing – Hadoop supports distributed processing of data i.e.
faster processing. The data in Hadoop HDFS is stored in a distributed manner and
MapReduce is responsible for the parallel processing of data.
 Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each
block (default) at different nodes.
 Reliability – Hadoop stores data on the cluster in a reliable manner that is
independent of machine. So, the data stored in Hadoop environment is not affected
by the failure of the machine.
 Scalability – It is compatible with the other hardware and we can easily
add/remove the new hardware to the nodes.
 High Availability – The data stored in Hadoop is available to access even after the
hardware failure. In case of hardware failure, the data can be accessed from
another node.
 Scale-Out Architecture - Add servers to increase capacity
 Flexible Access – Multiple and open frameworks for serialization and file system
mounts
 Load Balancing - Place data intelligently for maximum efficiency and utilization
 Tunable Replication - Multiple copies of each file provide data protection and
computational performance
 Security - POSIX-based file permissions for users and groups with optional LDAP
integration
The Core Components Of Hadoop Are –

1. HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of
Hadoop. The large data files running on a cluster of commodity hardware are stored in
HDFS. It can store data in a reliable manner even when hardware fails.
The key aspects of HDFS are:
a. Storage component
b. Distributes data across several nodes
c. Natively redundant.
2. Map Reduce: MapReduce is the Hadoop layer that is responsible for data processing.
It writes an application to process unstructured and structured data stored in HDFS.
It is responsible for the parallel processing of high volume of data by dividing data into
independent tasks. The processing is done in two phases Map and Reduce.

The Map is the first phase of processing that specifies complex logic code and the
Reduce is the second phase of processing that specifies light-weight operations.
The key aspects of Map Reduce are:
a. Computational frame work
b. Splits a task across multiple nodes
c. Processes data in parallel
Key Advantages of Hadoop
 Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
 Cost-Effective: Owing to its scale-out architecture, Hadoop has a much reduced
cost / terabyte of storage and processing.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can
continue to operate even in the presence of hardware failures.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide
variety of data processing tasks.

 Fast: Processing is extremely fast in Hadoop as compared to other conventional


systems owing to the “move code to data” paradigm.
Hadoop Versions:
As the data beings stored and processed increases in its complexity so do Hadoop
where the developers bring out various versions to address the issues (bug fixes) and
simplify the complex data processes. The updates are automatically implemented as
Hadoop development follows the trunk (base code) – branch (fix)model.
Hadoop has two versions: a) Hadoop 1.x (Version 1) and b) Hadoop 2 (Version 2)
 Hadoop 1.x (Version 1)
 Hadoop 2 (Version 2)

1. Hadoop 1.x
Below are the Components of Hadoop 1.x
1. The Hadoop Common Module is a jar file which acts as the base API on top of
which all the other components work.
2. Version one being the first one to come in existence is rock solid and has got no
new updates
3. It has a limitation on the scaling nodes with just a maximum of 4000 nodes for
each cluster
4. The functionality is limited utilizing the slot concept, i.e., the slots are capable of
running a map task or a reduce task.
5. The next component if the Hadoop Distributed File System commonly known as
HDFS, which plays the role of a distributed storage system that is designed to cater to
large data, with a block size of 64 MegaBytes (64MB) for supporting the architecture. It is
further divided into two components:
 Name Node which is used to store metadata about the Data node,
placed with the Master Node. They contain details like the details about
the slave note, indexing and their respective locations along with
timestamps for timelining.
 Data Nodes used for storage of data related to the applications in use
placed in the Slave Nodes.
6. Hadoop 1 uses Map Reduce (MR) data processing model It is not capable of
supporting other non-MR tools.
MR has two components:
 Job Tracker is used to assigning or reassigning task-related (in case
scenario fails or shutdown) to MapReduce to an application called task
tracker is located in the node clusters. It additionally maintains a log
about the status of the task tracker.
 The Task Tracker is responsible for executing the functions which have
been allocated by the job tracker and sensor cross the status report of
those task to the job tracker.
7. The network of the cluster is formed by organizing the master node and slave
nodes. Which of this cluster is further divided into tracks which contain a set of commodity
computers or nodes.
8. Whenever a large storage operation for big data set is is received by the Hadoop
system, the data is divided into decipherable and organized blocks that are distributed into
different nodes.

2. Hadoop Version 2
Version 2 for Hadoop was released to provide improvements over the lags which
the users faced with version 1. Let’s throw some light over the improvements that the new
version provides:
 HDFS Federation which has improved to provide for horizontal scalability for the
name node. Moreover, the namenode was available for a single point of failure
only, it is available on varied points. This is going to the Hadoop stat has been
increased to include the stacks such as Hive, Pig, which make this tap well
equipped enabling me to handle failures pertaining to NameNode.
 YARN stands for Yet Another Resource Negotiator has been improved with the
new ability to process data in the larger term that is petabyte and terabyte to make
it available for the HDFS while using the applications which are not MapReduce
based. These include applications like MPI and GIRAPH.
 Version – 2.7.x Released on 31st May 2018: The update focused to provide for
two major functionalities that are providing for your application and providing for a
global resource manager, thereby improving its overall utility and versatility,
increasing scalability up to 10000 nodes for each cluster.
 Version 2.8.x – Released in September 2018: The updated provided
improvements include the capacity scheduler which is designed to provide multi-
tenancy support for processing data over Hadoop and it has been made to be
accessible for window uses so that there is an increase in the rate of adoption for
the software across the industry for dealing with problems related to big data.

Version 3
Below is the latest running Hadoop Updated Version
 Version 3.1.x – released on 21 October 2019: This update enables Hadoop to
be utilized as a platform to serve a big chunk of Data Analytics Functions and
utilities to be performed over event processing alongside using real-time operations
give a better result.
 It has now improved feature work on the container concept which enables
had to perform generic which were earlier not possible with version 1.
 The latest version 3.2.1 released on 22nd September 2019 addresses issues of
non-functionality (in terms of support) of data nodes for multi-Tenancy, limitation to
you only MapReduce processing and the biggest problem than needed for an
alternate data storage which is needed for the real-time processing and graphical
analysis.
 The ever-increasing Avalanche of data and Big Data Analytics pertaining to just
business standing at an estimated 169 billion dollars (USD), the predicted growth to
274 billion dollars by 2022, the market seems to be growing ecstatically.
 This all the more calls for a system that is integrable in its functioning for the
abandoned Utah which is growing day by day. Hadoop app great to store, process
and access the great solution which works to store process and access this
heterogeneous set of data which can be unstructured/ structure in an organized
manner.
 With the feature of constant updates which act as tools to rectify the bugs that
developers say while using Hadoop, and the improved versions increase the scope
of application and improve the dimension and flexibility of using Hadoop, increases
the chances of it is the next biggest to for all functions related to big data
processing and Analytics.

Hadoop EcoSystem
Apache Hadoop is an open source framework intended to make interaction with big
data easier,
However, for those who are not acquainted with this technology, one question
arises that what is big data ?
Big data is a term given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology such as RDBMS. Hadoop
has made its place in the industries and companies that need to work on large data
sets which are sensitive and needs efficient handling.

Being a framework, Hadoop is made up of several modules that are supported by a


large ecosystem of technologies. Hadoop Ecosystem is a framework of various types of
complex and evolving tools and components.
Hadoop Ecosystem can be defined as a comprehensive collection of tools and
technologies that can be effectively implemented and deployed to provide Big Data
solutions in a cost-effective manner.
MapReduce and HDFS are two core components of the Hadoop ecosystem that
provide a great starting point to manage Big Data, however they are not sufficient to deal
with the Big Data challenges.
Hadoop Ecosystem is neither a programming language nor a service, it is a
platform or framework which solves big data problems. You can consider it as a suite
which encompasses a number of services (ingesting, storing, analyzing and maintaining)
inside it.

Hadoop ecosystem and its major components

Sqoop

Sqoop is application of the Hadoop ecosystem .

sqoop is full form is basically the SQL on Hadoop so, you see that the SQL is
basically the database and this entire database is now pulled into the
Hadoop system hence it is called Sqoop that is SQL on the Hadoop. It is the
application for efficiently transporting bulk data between the Apache Hadoop
and the SQL data store.
Apache HBase.

HBASE is a key component of Hadoop stack and its design cater to


application that require really fast random access to the significant data set.
HBase is a column oriented,distributed, database management system,
which is based on key value store.
The design of HBase is based on the original Google's Big Table and it can
hold extremely large data, data set for storage and writable purposes so, it is
now based on the dynamic data model and it is not a relational DBMS

PIG
Its a scripting language on top of Hadoop Map Reduce.
Instead of going to the complication of a complex Map Reduce application
program, rather simple view of this scripting language is being provided and
that language is called a Pig Latin, and this is useful for the data analysis and
as the data flow.
So, it is based on data, data flow model and it was originally developed at
Yahoo in 2006.

Apache Hive

The next application is hive, which is an SQL query. So, using SQL query or
the Map Reduce, this hive will basically perform the, the storage system and
the, the analysis in a much easier manner.
Hive is originated and developed at Facebook.
Apache Oozie
Apache Oozie is a scheduler system to run and manage Hadoop
jobs in a distributed environment.
It allows to combine multiple complex jobs to be run in a sequential
order to achieve a bigger task.
Within a sequence of task, two or more jobs can also be
programmed to run parallel to each other.
One of the main advantages of Oozie is that it is tightly integrated
with Hadoop stack supporting various Hadoop jobs like Hive, Pig,
Sqoop as well as system-specific jobs like Java and Shell.
Oozie detects completion of tasks through callback and polling.
When Oozie starts a task, it provides a unique callback HTTP
URL to the task, and notifies that URL when it is complete.

If the task fails to invoke the callback URL, Oozie can poll the task
for completion.

Following three types of jobs are common in Oozie −

 Oozie Workflow Jobs − These are represented as Directed


Acyclic Graphs (DAGs) to specify a sequence of actions to be
executed.
 Oozie Coordinator Jobs − These consist of workflow jobs
triggered by time and data availability.
 Oozie Bundle − These can be referred to as a package of
multiple coordinator and workflow jobs.

ZooKeeper
Another coordination service is called a Zookeeper, which provides the
coordination service and it will give you a centralized service, for maintaining
the configuration and the naming service, it provides the distributed
synchronization and the group services.
Originated and developed at Yahoo

Apache Zookeeper is an open source distributed coordination service that


helps to manage a large set of hosts.

Apache ZooKeeper is a service used by a cluster (group of nodes)


to coordinate between themselves and maintain shared data with
robust synchronization techniques.

ZooKeeper is itself a distributed application providing services for


writing a distributed application.

The common services provided by ZooKeeper are as follows −

 Naming service − Identifying the nodes in a cluster by


name. It is similar to DNS, but for nodes.
 Configuration management − Latest and up-to-date
configuration information of the system for a joining node.
 Cluster management − Joining / leaving of a node in a
cluster and node status at real time.
 Leader election − Electing a node as leader for coordination
purpose.
 Locking and synchronization service − Locking the data
while modifying it. This mechanism helps you in automatic fail
recovery while connecting other distributed applications like
Apache HBase.
 Highly reliable data registry − Availability of data even
when one or a few nodes are down.
How does Zookeeper Works?
Hadoop ZooKeeper is a distributed application that uses a simple client-
server architecture, with clients acting as service-using nodes and servers as
service-providing nodes.
The ZooKeeper ensemble is the collective name for several server nodes.

One ZooKeeper client is connected to at least one ZooKeeper server at any


one time. Because a master node is dynamically selected by the ensemble in
consensus, an ensemble of Zookeeper is often an odd number, ensuring a
majority vote.

If the master node fails, a new master is quickly selected and replaces the
failed master. In addition to the master and slaves, Zookeeper also has
watchers.

Scaling was a problem, therefore observers were brought in. The


performance of writing will be impacted by the addition of slaves because
voting is an expensive procedure. Therefore, observers are slaves who
perform similar tasks to other slaves but do not participate in voting.

Apache Flume
Finally another application is called a Flume, which is a distributed reliable
available service, for efficiently collecting aggregating moving, a large
amount of data into the, of the locks into the HDFS system hence, it is used
for data injection.

Apache Flume is a tool/service/data ingestion mechanism for


collecting aggregating and transporting large amounts of streaming
data such as log files, events (etc...) from various sources to a
centralized data store.

Flume is a highly reliable, distributed, and configurable tool. It is


principally designed to copy streaming data (log data) from various
web servers to HDFS.
So, its query engine runs on top of Apache Hadoop. So, Impala brings a
scalable parallel database technology to the Hadoop and allows user to
submit low latency queries within a particular system.
Apache Spark
Apache Spark, is a fast general-purpose engine for a large-scale data
processing.
So, spark is a scalable data analytics platform and it supports the in-memory
computation, it enhance its performance is much better way because it
supports in-memory computation.
So, Spark GraphX is another Hadoop open source, Apache project and this
is the component, which is build over the, core Spark, for computation of a
large scale graphs.
That is parallel computation of a graph is done using GraphX. So, GraphX
extends the Spark RDD’s, by introducing the new graph abstraction.
Hadoop Ecosystem for Big Data Computation
Difference between Hadoop version 1.0 and version 2.0
1. In Hadoop version 2.0 YARN is added which is not in Hadoop version 1.0.
Yarn(Yet Another Resource Negotiator) is a Resource Manager for Hadoop. This is
also called, 'Resource Manager'.
2. Similarly, the applications which will simplify the use of, Map Reduce further on,
called, Hive and Pig they are on over, Map Reduce.
3. In Hadoop version 1.0, all the applications were running over Map Reduce. Now
there is a choice that means Map Reduce, or a non-Map Reduce applications, can
run with the help of, Yarn HDFS. So, now as if, HDFS, is it Hadoop 2.0, it is possible
now, to have more flexible. More than hundred projects are available in Hadoop eco
system.

Giraph:
Giraph is a graph processing tool, which is, being used by the Facebook, to
analyse the social network's graph that was made simplified, when it was made out of
Map Reduce.
So, it uses Yarn and HDFS and this is non-Map Reduce application, for,
computation or computing large graphs, of the social network.
So, Giraph is the tool which is now, runs over, Yarn HDFS, and this is used, the big
graphs computations that we will see, later on, this part of the course.
Giraph, Storm, Spark, Flink, do not use, Map Reduce directly, they run over Yarn
and HDFS.

Storm, Spark and Flink


The fast data are Streaming data applications, either can we do using; a Storm,
Spark and Flink and they basically, are in memory computation, which are faster than
regular computation.
So, Stream processing, or a real time or the real time, Streaming applications are
done using, Star, Spark and Flink or Yarn and HDFS.

NoSQL
Most of these big data is stored in the form of a key value pair and they are also,
known as, No Sequel Data Store.
This No Sequel Data Store can be supported by, the data base like, Cassandra,
MongoDB and HBase.
Traditional SQL, can be effectively used to handle the large amount of, structured
data. But here in the big data, most of the information is, unstructured form of the data, so
basically, NoSQL is required to handle that information,
NoSQL data base is, stored unstructured data also, however, it is not, enforced to
follow a particular, fixed schema structure and schema keeps on, changing, dynamically.
So, each row can have its own set of column values.
NoSQL gives a better performance, in storing the massive amount of data
compared to the SQL, structure.
NoSQL database is primarily a key value store. It is also called a, 'Column Family'
Column wise, the data is stored, in the form of a key value, pairs.

Moving Data In and Out of Hadoop


 Some simple techniques for data movement are using the command line and Java
and more advanced techniques like using NFS and DistCp.
 Ingress and egress refer to data movement into and out of a system, respectively.
 Moving data into Hadoop
 The first step in working with data in Hadoop is to make it available to Hadoop.
There are two primary methods that can be used to move data into Hadoop: writing
external data at the HDFS level (a data push), or reading external data at the
MapReduce level (more like a pull). Reading data in MapReduce has advantages
in the ease with which the operation can be parallelized and made fault tolerant.
Moving Data into Hadoop
1. HDFS Command Line Interface (CLI)
 You can use the hdfs dfs commands to move files into HDFS (Hadoop Distributed
File System).
 Example: hdfs dfs -put localfile /user/hadoop/
2. Apache Sqoop
 Useful for importing data from relational databases (like MySQL, Oracle, etc.) into
HDFS.
 Example: sqoop import --connect jdbc:mysql://localhost/db --table mytable --target-
dir /user/hadoop/mytable
3. Apache Flume
 Designed for streaming data ingestion into Hadoop.
 You configure sources (e., logs from servers), channels (e., memory or files), and
sinks (e., HDFS) to handle data transfer.
4. Apache Kafka
 For real-time data streaming into Hadoop.
 Data can be ingested into Hadoop through Kafka consumers that read from Kafka
topics and write to HDFS.
5. Hadoop Streaming
 Allows you to use any executable or script as a mapper or reducer for processing
data.
 Example command: hadoop jar /path/to/hadoop-streaming -input /input
-output/output -mapper /path/to/mapper -reducer /path/to/reducer
6. Hadoop MapReduce
 Used for processing data in HDFS, but also supports data loading through custom
input formats and combiners.

Moving Data out of Hadoop


1. HDFS CLI
 You can use hdfs dfs -get to move files from HDFS to a local file system.
 Example: hdfs dfs -get /user/hadoop/datafile /localpath/
2. Apache Sqoop
 For exporting data from HDFS back to relational databases.
 Example: sqoop export --connect jdbc:mysql://localhost/db --table mytable --export-
dir /user/hadoop/mytable
3. Apache Flume
 Can be configured to move data from HDFS to another system or storage.
 Example configuration might involve setting up a Flume sink to write to another
destination.
4. Apache Kafka
 Similar to ingesting data, Kafka can be used to export data by consuming from
HDFS and producing to other systems.
5. Custom MapReduce Jobs
 You can write custom MapReduce jobs to process data in HDFS and write results
to external systems.
6. Hadoop DistCp
 A tool for copying large amounts of data between HDFS clusters or between HDFS
and other storage systems.
 Example: hadoop distcp hdfs://source-cluster/user/hadoop/data hdfs://destination-
cluster/user/hadoop/data
7. Hive/Impala
 If using Hive or Impala, you can query data and export the results to external
systems using INSERT INTO ... SELECT statements or by using Hives data export
capabilities.
8. Spark
 Apache Spark can also be used to process and move data between HDFS and
other storage systems or databases.

Understanding Inputs and Outputs of MapReduce


Your data might be XML files sitting behind a number of FTP servers, text log files sitting
on a central web server, or Lucene indexes1 in HDFS. How does MapReduce support
reading and writing to these different serialization structures across the various storage
mechanisms? You’ll need to know the answer in order to support a specific serialization
format.
Data input :-
The two classes that support data input in MapReduce are InputFormat and Record-
Reader. The InputFormat class is consulted to determine how the input data should be
partitioned for the map tasks, and the RecordReader performs the reading of data from
the inputs.

INPUT FORMAT:-
Every job in MapReduce must define its inputs according to contracts specified in the
InputFormat abstract class. InputFormat implementers must fulfill three contracts: first,
they describe type information for map input keys and values; next, they specify how the
input data should be partitioned; and finally, they indicate the RecordReader instance that
should read the data from source
RECORD READER:-
The RecordReader class is used by MapReduce in the map tasks to read data from an
input split and provide each record in the form of a key/value pair for use by mappers. A
task is commonly created for each input split, and each task has a single RecordReader
that’s responsible for reading the data for that input split.
DATA OUTPUT:-
MapReduce uses a similar process for supporting output data as it does for input
data.Two classes must exist, an OutputFormat and a RecordWriter. The OutputFormat
performs some basic validation of the data sink properties, and the RecordWriter writes
each reducer output to the data sink.
OUTPUT FORMAT:-
Much like the InputFormat class, the OutputFormat class defines the contracts that
implementers must fulfill, including checking the information related to the job output,
providing a RecordWriter, and specifying an output committer, which allows writes to be
staged and then made “permanent” upon task and/or job success.
RECORD WRITER:-
You’ll use the RecordWriter to write the reducer outputs to the destination data sink.It’s a
simple class.

Hadoop Map-Reduce Inputs and Outputs

 The Map/Reduce framework operates exclusively on pairs, that is, the framework
views the input to the job as a set of pairs and produces a set of pairs as the output
of the job, conceivably of different types.
 The key and value classes have to be serializable by the framework and hence
need to implement the Writable interface.
 Additionally, the key classes have to implement the WritableComparable interface
to facilitate sorting by the framework.
 The user needs to implement a Mapper class as well as a Reducer class.
 Optionally, the user can also write a Combiner class.
 (input) < k1, v1 >→ map →< k2, v2 >→ combine →< k2, v2 > → reduce →< k3,
v3 > (output)
How to write a Hadoop Map class Subclass from MapReduceBase and implement
the Mapper interface.

 public class MyMapper extends MapReduceBase implements Mapper { ... }


 The Mapper interface provides a single method:
 public void map(K key, V val, OutputCollector output, Reporter reporter)
 WriteableComparable key:
 Writeable value:
 OutputCollector output: this has the collect method to output a pair
 Reporter reporter: allows the application code to permit alteration of status .
 The Hadoop system divides the input data into logical “records” and then calls
map() once for each record.
 For text files, a record is one line of text.
 The key then is the byte-offset and the value is a line from the text file.
 For other input types, it can be defined differently.
 The main method is responsible for setting output key values and value types.
How to write a Hadoop Reduce class

 Subclass from MapReduceBase and implement the Reducer interface.


 public class MyReducer extends MapReduceBase implements Reducer {...}
 The Reducer interface provides a single method:
public void reduce(K key, Iterator values, OutputCollector output, Reporter
reporter)

 WriteableComparable key:
 Iterator values:
 OutputCollector output:
 Reporter reporter:
Given all the values for the key, the Reduce code typically iterates over all the
values and either concatenates the values together in some way to make a large
summary object, or combines and reduces the values in some way to yield a short
summary value.

Inputs and Outputs (Java Perspective)

 The MapReduce framework operates on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
 The key and the value classes should be in serialized manner by the
framework and hence, need to implement the Writable interface.
 Additionally, the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework.
 Input and Output types of a MapReduce job –
 (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).

  Input  Output

 Map  <k1, v1>  list (<k2, v2>)

 Reduce  <k2, list(v2)>  list (<k3, v3>)


 When moving data in and out of Hadoop using MapReduce, both input and
output are handled as "key-value" pairs, meaning the data is structured as sets
of keys associated with corresponding values, and the MapReduce framework
processes these pairs through the Map and Reduce phases to perform operations
on large datasets distributed across the cluster; essentially, data is "pulled" into the
MapReduce job during processing rather than being explicitly pushed in at the
beginning.
Key points about MapReduce inputs and outputs:

 Input Format:
The input data is divided into "splits" which are chunks of data processed by
individual mappers, and the "InputFormat" class defines how these splits are read
and interpreted as key-value pairs.

 Map Phase:
Each mapper takes a set of key-value pairs from an input split, performs
operations on them, and generates a new set of key-value pairs as
intermediate output.

 Shuffle and Sort:


The intermediate key-value pairs are shuffled and sorted based on their keys,
ensuring all values with the same key are sent to the same reducer.

 Reduce Phase:
Each reducer receives a set of key-value pairs with the same key, performs
aggregations or other calculations on the values, and produces a final key-value
pair as the output.

How to move data in and out:

 Moving Data into Hadoop (Input):


o HDFS File System: Most commonly, data is loaded into the Hadoop
Distributed File System (HDFS) before being processed by a MapReduce
job.
o InputFormat Class: The specific "InputFormat" class is chosen based on the
data format (text, CSV, etc.) to define how the data is read from HDFS.
o Custom Input Formats: Users can create custom InputFormat classes to
handle specialized data sources or formats.
 Moving Data out of Hadoop (Output):
o Output Path: The MapReduce job specifies an output directory in HDFS
where the final processed key-value pairs will be written.
o RecordWriter: The "RecordWriter" class controls how the output key-value
pairs are written to the output file.
o Data Retrieval: Once the MapReduce job is complete, the processed data
can be retrieved from the HDFS output directory and transferred to other
systems if needed.
 Important Considerations:
Data Types:

Both keys and values in MapReduce need to be serializable objects that


implement the "Writable" interface.

Partitioning:

The "Partitioner" class is used to determine which reducer receives which set of
key-value pairs for efficient processing.

Job Configuration:

The MapReduce job configuration specifies parameters like the number of


mappers and reducers, input and output paths, and other options for
optimizing the processing.

Data Types In Hadoop

While programming for a distributed system, we cannot use standard data types. This is
because they do not know how to read from/write to the disk i.e. they are not serializable.

 Serialization is the process of converting object data into byte stream data for
transmission over a network across different nodes in a cluster or for persistent
data storage.

 De-serialization is the reverse process of serialization and converts byte stream


data into object data for reading data from HDFS.
Why are Writables Introduced in Hadoop?

Hadoop frame work definitely needs Writable type of interface in order to perform the
following tasks:

 Implement serialization ,Transfer data between clusters and networks


 Store the deserialized data in the local disk of the system

It can be done by simply writing the keyword ‘implements’ and overriding the default
writable method.

Writable is a strong interface in Hadoop which while serializing the data, reduces the
data size enormously, so that data can be exchanged easily within the networks.

It has separate read and write fields to read data from network and write data into local
disk respectively.

Every data inside Hadoop should accept writable and comparable interface properties.

Hadoop provides Writable interface based data types for serialization and de-serialization
of data storage in HDFS and MapReduce computations.

Serialization is not the only concern of Writable interface; it also has to perform compare
and sorting operation in Hadoop.

Why use Hadoop Writable(s)?

 As we already know, data needs to be transmitted between different nodes in a


distributed computing environment.
 This requires serialization and deserialization of data to convert the data that is in
structured format to byte stream and vice-versa.
 Hadoop therefore uses simple and efficient serialization protocol to serialize data
between map and reduce phase and these are called Writable(s).
 Hadoop provides Writable wrappers for almost all Java primitive types and some
other types.
 However, we might sometimes need to create our own wrappers for custom
objects.

All the Writable wrapper classes have a get() and a set() method for retrieving and
storing the wrapped value.
Hadoop also provides another interface called WritableComparable.

The WritableComparable interface is a sub-interface of Hadoop’s Writable and Java’s


Comparable interfaces.

As we know, data flows from mappers to reducers in the form of (key, value) pairs. It is
important to note that any data type used for the key must implement the
WritableComparable interface along with Writable interface to compare the keys of
this type with each other for sorting purposes, and any data type used for the value
must implement the Writable interface.

Data Types in Hadoop

Writable Classes

Primitive Writable Classes-These are Writable wrappers for Java primitive data types and
they hold a single primitive value.

Below is the list of primitive writable data types available in Hadoop:

Primitive Writable Classes:

1. BooleanWritable
2. ByteWritable
3. IntWritable
4. VIntWritable
5. FloatWritable
6. LongWritable
7. VLongWritable
8. DoubleWritable

Note that the serialized sizes of the above primitive Writable data types are same as
the size of the actual Java data types.

Array Writable Classes:

Hadoop provides two types of array Writable classes: one for single-dimensional and
another for two-dimensional arrays:

 ArrayWritable

 TwoDArrayWritable
The elements of these arrays must be other Writable objects like IntWritable or
FLoatWritable only; not the Java native data types like int or float.

Map Writable Classes:

Hadoop provides the following MapWritable data types which implement the java.util.Map
interface:

 AbstractMapWritable: this is the abstract or base class for other MapWritable


classes

 MapWritable: this is a general purpose map, mapping Writable keys to Writable


values

 SortedMapWritable: this is a specialization of the MapWritable class that also


implements the SortedMap interface

Other Writable Classes:

1. NullWritable: it is a special type of Writable representing a null value.


 No bytes are read or written when a data type is specified as NullWritable.
 So, in Mapreduce, a key or a value can be declared as a NullWritable when
we don’t need to use that field
2. ObjectWritable: this is a general-purpose generic object wrapper which can store
any objects like Java primitives, String, Enum, Writable, null, or arrays.
3. Text: it can be used as the Writable equivalent of java.lang.String and its max size
is 2 GB. Unlike Java’s String data type, Text is mutable in Hadoop.
4. BytesWritable: it is a wrapper for an array of binary data
5. GenericWritable: it is similar to ObjectWritable but supports only a few types.The
user needs to subclass this GenericWritable class and needs to specify the types
to support.

Programmers mainly specify two functions:


map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*
 All values with the same key are sent to the same reducer.

Programmers may also specify partitions for keys:

partition (k’, number of partitions) → partition for k’


 This is often a simple hash of the key, Ex. hash(k’) mod n.
 The aim is to divide up key space for parallel reduce operations i.e. partitioners
control which reducers process which keys.
 All values with the same key are sent to the same reducer.
 This ensures that the outputs of all the reducers will be disjoint/unique, i.e. there
will be no common/overlapping tuples.

combine (k’, v’) → <k’, v’>*

 Combiners are mini-reducers that run in-memory immediately after the map
phase.
 They reside on the same machines as the mappers.
 They are used for local aggregation of intermediate results before the tuples are
sent to the reducers.
 The aim of using combiners is reduce network traffic and minimize the load on the
reducers.
 Note that the combiners have the exact same code as the reducers, so they
perform the exact same operation as the reducers, but locally.
 Another component, known as the reporter, allows us to report the status to
Hadoop.

The execution framework handles everything else, apart from the above operations,
including:

 Scheduling: assigns workers to map and reduce tasks


 Data distribution: moves processes to data
 Synchronization: gathers, sorts, and shuffles intermediate data
 Errors and faults: detects worker failures and restarts

We don’t know:

 Where mappers and reducers run


 When a mapper or reducer begins or finishes
 Which input a particular mapper is processing
 Which intermediate key a particular reducer is processing

Hadoop uses the class definition to create objects.


So, while submitting a job to Hadoop, we must provide class definitions for the mapper,
reducer, combiner etc to Hadoop so that Hadoop can create as many instances
(objects) of those classes at runtime, as and when required.

Need for Combiners:

Combiners reside on the same machine as the mappers. So, data doesn't have to
be written to the disk while being moved between the mappers and the combiners.

Combiners run the same code that the reducers run, but they run this code locally
i.e. on the data that is generated by their corresponding mappers.

This is called local aggregation. This not only speeds up processing, but also
reduces the workload on the reducers, as the reducers will now receive locally
aggregated tuples.

A Simple Overview of a MapReduce Job

1. Configure the Job: Specify Input, Output, Mapper, Reducer and Combiner

2. Implement the Mapper: For example, to count the number of occurrences of words in a
line, tokenize the text and emit the words with a count of 1 i.e. <word, 1>

3. Implement the Reducer: Sum up counts for each word and write the result to HDFS

4. Run the Job

The entire workflow of a MapReduce job exclusively uses <k, v> pairs and can be
denoted as follows:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3>
(output)

1. Configure the Job

i. The Job class encapsulates information about the job and handles the execution of
the job.
ii. A job is packaged within a jar file and this file is distributed among nodes by the
Hadoop framework. We need to specify our jar file to Hadoop using the
Job.setJarByClass() function.
iii. Specify the Input: Input is specified by implementing InputFormat, for example
TextInputFormat. The input can be a file/directory/file pattern. InputFormat is
responsible for creating splits (InputSplitters) and a RecordReader. It also controls
the input types of the (key, value) pairs. Mappers receive input one line at a time.
TextInputFormat.addInputPath(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
iv. Specify the Output: Output is specified by implementing OutputFormat, for example
TextOutputFormat. It basically defines the specification for the output returned by
the MapReduce job. We must define the output folder. However, the MapReduce
program will not work if the output folder exists in advance. By default, a
MapReduce job uses a single reducer. However, if we use multiple reducers, there
will be multiple output files, one per reducer, and we must manually concatenate
these files to get the expected output of the MapReduce job.
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
v. We must also set the output types for the (key, value) pairs for both mappers and
reducers:
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
vi. Usually we use the same output types for both mappers and reducers, but if we
need to set different types, we can use setMapOutputKeyClass(),
setMapOutputValueClass(), etc.

2. Implement the Mapper

i. The Mapper class has 4 parameters: input key, input value, output key, output
value
ii. It makes use of Hadoop's IO framework for input/output operations
iii. We must define the map() function that takes some input (key, value) pair and
outputs another (key, value) pair, depending on the problem at hand

3. Implement the Reducer

i. The Reducer class also has 4 parameters: input key, input value, output key,
output value
ii. The (key, value) pairs generated by the mappers are grouped by key, sorted and
sent to the reducers; each reducer receives all values corresponding to a certain
key
iii. We must implement the reduce() function that takes some input (key, <set of
values>) pair and outputs (key, value) pairs
iv. The output types of the map function must match the input type of the reduce
function
We can choose to use combiners immediately after the mappers, provided that their
output type matches that of the mappers.

This will reduce the number of tuples that will be sent to the reducers, due to local
aggregation by the combiners.

4. Run the Job

In this stage, we run the MapReduce job and the output(s) get(s) saved in the
output folder.

Data Serialization
In Distributed Systems like Hadoop, especially for Interprocess Communication and
Persistent Storage the concept of serialization is used.

Interprocess Communication

1. Basically, RPC technique was used, to establish the interprocess communication


between the nodes connected in a network.
2. In order to convert the message into the binary format before sending it to the remote
node via the network, RPC uses internal serialization. Further, the remote system
deserializes the binary stream into the original message, at the other end.
Persistent Storage
A digital storage which does not lose its data if any loss of power supply happens, that
storage system is we call Persistent Storage. Its examples could be Magnetic disks and
Hard Disk Drives.

Data Serialization:
Serialization is the process of turning structured objects into a byte stream for
transmission over a network or for writing to persistent storage.
Deserialization is the process of turning a byte stream back into a series of structured
objects.
In Hadoop, interprocess communication between nodes in the system is implemented
using remote procedure calls(RPCs).
In general, it is desirable that an RPC serialization format is:
 Compact: A compact format makes the best use of network bandwidth
 Fast: Interprocess communication forms the backbone for a distributed
system, so it is essential that there is as little performance overhead as
possible for the serialization and deserialization process.
 Extensible: Protocols change over time to meet new requirements, so it
should be straightforward to evolve the protocol in a controlled manner for
clients and servers.
 Interoperable: For some systems, it is desirable to be able to support
clients that are written in different languages to the server.
It consists of
a) The Writable Interface
b) Writable Classes
c) Implementing a Custom Writable
d) Serialization Frameworks
e) Avro
a) The Writable Interface: The Writable interface defines two methods are one for
writing its state to a Data Output binary stream, and one for reading its state from a Data
Input binary stream.

c) Implementing a Custom Writable: Hadoop comes with a useful set of Writable


implementations that serve most purposes; however, on occasion, you may need to write
your own custom implementation. With Serialization | 105 a custom Writable, you have full
control over the binary representation and the sort order.
d) Serialization Frameworks: Hadoop has an API for pluggable serialization frameworks.
A serialization framework is represented by an implementation of Serialization (in the
org.apache.hadoop.io. serializer package). WritableSerialization, for example, is the
implementation of Serialization for Writable types.
A Serialization defines a mapping from types to Serializer instances (for turning an object
into a byte stream) and Deserializer instances (for turning a byte stream into an object).

e) Avro: Apache Avro4 is a language-neutral data serialization system. The project was
created by Doug Cutting (the creator of Hadoop) to address the major downside of
Hadoop Writables: lack of language portability.
The Avro specification precisely defines the binary format that all implementations must
support.

What Are Common Languages for Data Serialization?


A number of popular object-oriented programming languages provide either native support
for serialization or have libraries that add non-native capabilities for serialization to their
feature set. Java, .NET, C++, Node.js, Python, and Go, for example, all either have native
serialization support or integrate with libraries for serialization.

Data formats such as JSON and XML are often used as the format for storing serialized
data.
Customer binary formats are also used, which tend to be more space-efficient due to less
markup/tagging in the serialization.

What Is Data Serialization in Big Data?


Big data systems often include technologies/data that are described as “schemaless.”
This
means that the managed data in these systems are not structured in a strict format, as
defined
by a schema. Serialization provides several benefits in this type of environment:
• Structure: By inserting some schema or criteria for a data structure through
serialization on read, we can avoid reading data that misses mandatory fields, is
incorrectly classified, or lacks some other quality control requirement.
• Portability: Big data comes from a variety of systems and may be written in a variety
of languages. Serialization can provide the necessary uniformity to transfer such data
to other enterprise systems or applications.
• Versioning: Big data is constantly changing. Serialization allows us to apply version
numbers to objects for lifecycle management.

Data serialization is the process of converting data objects present in complex data
structures into a byte stream for storage, transfer and distribution purposes on physical
devices.
Computer systems may vary in their hardware architecture, OS, addressing mechanisms.
Internal binary representations of data also vary accordingly in every environment. Storing
and exchanging data between such varying environments requires a platform-and-
languageneutral data format that all systems understand.
Once the serialized data is transmitted from the source machine to the destination
machine,
the reverse process of creating objects from the byte sequence called deserialization is
carried out. Reconstructed objects are clones of the original object.
Choice of data serialization format for an application depends on factors such as data
complexity, need for human readability, speed and storage space constraints. XML,
JSON,
BSON, YAML, MessagePack, and protobuf are some commonly used data serialization
formats.
computer data is generally organized in data structures such as arrays, tables, trees,
classes. When data structures need to be stored or transmitted to another location, such
as across a network, they are serialized.

What are the applications of Data Serialization?


Serialization allows a program to save the state of an object and recreate it when needed.
Its common uses are:
Persisting data onto files – happens mostly in language-neutral formats such as CSV or
XML.
However, most languages allow objects to be serialized directly into binary using APIs
such as the Serializable interface in Java, fstream class in C++, or Pickle module in
Python.
Storing data into Databases – when program objects are converted into byte streams
and then stored into DBs, such as in Java JDBC.
Transferring data through the network – such as web applications and mobile apps
passing on objects from client to server and vice versa.
Remote Method Invocation (RMI) – by passing serialized objects as parameters to
functions running on a remote machine as if invoked on a local machine. This data can be
transmitted across domains through firewalls.
Sharing data in a Distributed Object Model – when programs written in different
languages (running on diverse platforms) need to share object data over a distributed
network using frameworks such as COM and CORBA. However, SOAP, REST and other
web services have replaced these applications now.

Could you list some text-based Data Serialization formats and their key features?
Without being exhaustive, here are some common ones:
XML (Extensible Markup Language) - Nested textual format. Human-readable and
editable.
Schema based validation. Used in metadata applications, web services data transfer, web
publishing.
CSV (Comma-Separated Values) - Table structure with delimiters. Human-readable
textual data.Opens as spreadsheet or plaintext. Used as plaintext Database.
JSON (JavaScript Object Notation) - Short syntax textual format with limited data types.
Humanreadable. Derived from JavaScript data formats. No need of a separate parser (like
XML) since they map to JavaScript objects. Can be fetched with an XMLHttpRequest call.
No direct support for DATE data type. All data is dynamically processed. Popular format
for web API parameter passing. Mobile apps use this extensively for user interaction and
database services.
YAML (YAML Ain't Markup Language) - Lightweight text format. Human-readable.
Supports comments and thus easily editable. Superset of JSON. Supports complex data
types. Maps easily to native data structures. Used in configuration settings, document
headers, Apps with need for MySQL style self-references in relational data.

Could you list some binary Data Serialization formats and their key features?
Without being exhaustive, here are some common ones:
BSON (Binary JSON) - Created and internally used by MongoDB. Binary format, not
humanreadable. Deals with attribute-value pairs like JSON. Includes datetime, bytearray
and other data types not present in JSON. Used in web apps with rich media data types
such as live video. Primary use is storage, not network communication.
MessagePack - Designed for data to be transparently converted from/to JSON.
Compressed binary format, not human-readable. Supports static typing. Supports RPC.
Better JSON compatibility than BSON. Primary use is network communication, not
storage. Used in apps with distributed file systems.
protobuf (Protocol Buffers) - Created by Google. Binary message format that allows
programmers to specify a schema for the data. Also includes a set of rules and tools to
define and exchange these messages. Transparent data compression. Used in multi-
platform applications due to easy interoperability between languages. Universal RPC
framework. Used in performance-critical distributed applications.

Data serialization in Hadoop refers to the process of converting data into a format that can
be efficiently stored, transmitted, and reconstructed.

Serialization is critical in Hadoop because it is responsible for transferring data between


various components, such as between mappers and reducers in MapReduce, or between
HDFS and the Hadoop ecosystem.

Efficient serialization mechanisms ensure the smooth functioning and performance of


Hadoop by minimizing the overhead of data movement.

What is Serialization?

Serialization is the process of translating data structures or objects state into binary or
textual form to transport the data over network or to store on some persisten storage.
Once the data is transported over network or retrieved from the persistent storage, it
needs to be deserialized again.
Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Writable Interface

This is the interface in Hadoop which provides methods for serialization and
deserialization.

The following table describes the methods −

S.No. Methods and Description

1 void readFields(DataInput in)


This method is used to deserialize the fields of the given object.

2 void write(DataOutput out)


This method is used to serialize the fields of the given object.
Writable Comparable Interface

It is the combination of Writable and Comparable interfaces. This interface


inherits Writable interface of Hadoop as well as Comparable interface of Java. Therefore
it provides methods for data serialization, deserialization, and comparison.
S.No. Methods and Description

1 int compareTo(class obj)


This method compares current object with the given object obj.

In addition to these classes, Hadoop supports a number of wrapper classes that


implement WritableComparable interface.

Each class wraps a Java primitive type.

The class hierarchy of Hadoop serialization is given below –


These classes are useful to serialize various types of data in Hadoop.
For instance, let us consider the IntWritable class. Let us see how this class is used to
serialize and deserialize the data in Hadoop.

IntWritable Class

This class implements Writable, Comparable, and WritableComparable interfaces. It


wraps an integer data type in it. This class provides methods used to serialize and
deserialize integer type of data.
Constructors
S.No. Summary
1 IntWritable()

2 IntWritable( int value)

Methods
S.No. Summary

1 int get()
Using this method you can get the integer value present in the current object.

2 void readFields(DataInput in)


This method is used to deserialize the data in the given DataInput object.

3 void set(int value)


This method is used to set the value of the current IntWritable object.

void write(DataOutput out)


4 This method is used to serialize the data in the current object to the
given DataOutput object.

You might also like