BDA Unit-3
BDA Unit-3
BDA Unit-3
Introduction to Hadoop: Big Data – Apache Hadoop & Hadoop Eco System – Moving
Data in andout ofHadoop – Understanding inputs and outputs of MapReduce - Data
Serialization.
Introduction to Hadoop:
Hadoop is an open-source project of the Apache foundation.
Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is
designed to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of which may be
prone to failures.
In simple words, Hadoop is a software library that allows its users to process
large datasets across distributed clusters of computers, thereby enabling
them to gather, store and analyze huge sets of data.
Hadoop is now a core part of the computing infrastructure for companies such as
Yahoo, Facebook, LinkedIn, Twitter etc
Features of Hadoop
Hadoop is an open source framework that is meant for storage and processing of big data
in a distributed manner. It is the best solution for handling big data challenges.
Some important features of Hadoop are –
Open Source – Hadoop is an open source framework which means it is available
free of cost. Also, the users are allowed to change the source code as per their
requirements.
Distributed Processing – Hadoop supports distributed processing of data i.e.
faster processing. The data in Hadoop HDFS is stored in a distributed manner and
MapReduce is responsible for the parallel processing of data.
Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each
block (default) at different nodes.
Reliability – Hadoop stores data on the cluster in a reliable manner that is
independent of machine. So, the data stored in Hadoop environment is not affected
by the failure of the machine.
Scalability – It is compatible with the other hardware and we can easily
add/remove the new hardware to the nodes.
High Availability – The data stored in Hadoop is available to access even after the
hardware failure. In case of hardware failure, the data can be accessed from
another node.
Scale-Out Architecture - Add servers to increase capacity
Flexible Access – Multiple and open frameworks for serialization and file system
mounts
Load Balancing - Place data intelligently for maximum efficiency and utilization
Tunable Replication - Multiple copies of each file provide data protection and
computational performance
Security - POSIX-based file permissions for users and groups with optional LDAP
integration
The Core Components Of Hadoop Are –
1. HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of
Hadoop. The large data files running on a cluster of commodity hardware are stored in
HDFS. It can store data in a reliable manner even when hardware fails.
The key aspects of HDFS are:
a. Storage component
b. Distributes data across several nodes
c. Natively redundant.
2. Map Reduce: MapReduce is the Hadoop layer that is responsible for data processing.
It writes an application to process unstructured and structured data stored in HDFS.
It is responsible for the parallel processing of high volume of data by dividing data into
independent tasks. The processing is done in two phases Map and Reduce.
The Map is the first phase of processing that specifies complex logic code and the
Reduce is the second phase of processing that specifies light-weight operations.
The key aspects of Map Reduce are:
a. Computational frame work
b. Splits a task across multiple nodes
c. Processes data in parallel
Key Advantages of Hadoop
Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
Cost-Effective: Owing to its scale-out architecture, Hadoop has a much reduced
cost / terabyte of storage and processing.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can
continue to operate even in the presence of hardware failures.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide
variety of data processing tasks.
1. Hadoop 1.x
Below are the Components of Hadoop 1.x
1. The Hadoop Common Module is a jar file which acts as the base API on top of
which all the other components work.
2. Version one being the first one to come in existence is rock solid and has got no
new updates
3. It has a limitation on the scaling nodes with just a maximum of 4000 nodes for
each cluster
4. The functionality is limited utilizing the slot concept, i.e., the slots are capable of
running a map task or a reduce task.
5. The next component if the Hadoop Distributed File System commonly known as
HDFS, which plays the role of a distributed storage system that is designed to cater to
large data, with a block size of 64 MegaBytes (64MB) for supporting the architecture. It is
further divided into two components:
Name Node which is used to store metadata about the Data node,
placed with the Master Node. They contain details like the details about
the slave note, indexing and their respective locations along with
timestamps for timelining.
Data Nodes used for storage of data related to the applications in use
placed in the Slave Nodes.
6. Hadoop 1 uses Map Reduce (MR) data processing model It is not capable of
supporting other non-MR tools.
MR has two components:
Job Tracker is used to assigning or reassigning task-related (in case
scenario fails or shutdown) to MapReduce to an application called task
tracker is located in the node clusters. It additionally maintains a log
about the status of the task tracker.
The Task Tracker is responsible for executing the functions which have
been allocated by the job tracker and sensor cross the status report of
those task to the job tracker.
7. The network of the cluster is formed by organizing the master node and slave
nodes. Which of this cluster is further divided into tracks which contain a set of commodity
computers or nodes.
8. Whenever a large storage operation for big data set is is received by the Hadoop
system, the data is divided into decipherable and organized blocks that are distributed into
different nodes.
2. Hadoop Version 2
Version 2 for Hadoop was released to provide improvements over the lags which
the users faced with version 1. Let’s throw some light over the improvements that the new
version provides:
HDFS Federation which has improved to provide for horizontal scalability for the
name node. Moreover, the namenode was available for a single point of failure
only, it is available on varied points. This is going to the Hadoop stat has been
increased to include the stacks such as Hive, Pig, which make this tap well
equipped enabling me to handle failures pertaining to NameNode.
YARN stands for Yet Another Resource Negotiator has been improved with the
new ability to process data in the larger term that is petabyte and terabyte to make
it available for the HDFS while using the applications which are not MapReduce
based. These include applications like MPI and GIRAPH.
Version – 2.7.x Released on 31st May 2018: The update focused to provide for
two major functionalities that are providing for your application and providing for a
global resource manager, thereby improving its overall utility and versatility,
increasing scalability up to 10000 nodes for each cluster.
Version 2.8.x – Released in September 2018: The updated provided
improvements include the capacity scheduler which is designed to provide multi-
tenancy support for processing data over Hadoop and it has been made to be
accessible for window uses so that there is an increase in the rate of adoption for
the software across the industry for dealing with problems related to big data.
Version 3
Below is the latest running Hadoop Updated Version
Version 3.1.x – released on 21 October 2019: This update enables Hadoop to
be utilized as a platform to serve a big chunk of Data Analytics Functions and
utilities to be performed over event processing alongside using real-time operations
give a better result.
It has now improved feature work on the container concept which enables
had to perform generic which were earlier not possible with version 1.
The latest version 3.2.1 released on 22nd September 2019 addresses issues of
non-functionality (in terms of support) of data nodes for multi-Tenancy, limitation to
you only MapReduce processing and the biggest problem than needed for an
alternate data storage which is needed for the real-time processing and graphical
analysis.
The ever-increasing Avalanche of data and Big Data Analytics pertaining to just
business standing at an estimated 169 billion dollars (USD), the predicted growth to
274 billion dollars by 2022, the market seems to be growing ecstatically.
This all the more calls for a system that is integrable in its functioning for the
abandoned Utah which is growing day by day. Hadoop app great to store, process
and access the great solution which works to store process and access this
heterogeneous set of data which can be unstructured/ structure in an organized
manner.
With the feature of constant updates which act as tools to rectify the bugs that
developers say while using Hadoop, and the improved versions increase the scope
of application and improve the dimension and flexibility of using Hadoop, increases
the chances of it is the next biggest to for all functions related to big data
processing and Analytics.
Hadoop EcoSystem
Apache Hadoop is an open source framework intended to make interaction with big
data easier,
However, for those who are not acquainted with this technology, one question
arises that what is big data ?
Big data is a term given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology such as RDBMS. Hadoop
has made its place in the industries and companies that need to work on large data
sets which are sensitive and needs efficient handling.
Sqoop
sqoop is full form is basically the SQL on Hadoop so, you see that the SQL is
basically the database and this entire database is now pulled into the
Hadoop system hence it is called Sqoop that is SQL on the Hadoop. It is the
application for efficiently transporting bulk data between the Apache Hadoop
and the SQL data store.
Apache HBase.
PIG
Its a scripting language on top of Hadoop Map Reduce.
Instead of going to the complication of a complex Map Reduce application
program, rather simple view of this scripting language is being provided and
that language is called a Pig Latin, and this is useful for the data analysis and
as the data flow.
So, it is based on data, data flow model and it was originally developed at
Yahoo in 2006.
Apache Hive
The next application is hive, which is an SQL query. So, using SQL query or
the Map Reduce, this hive will basically perform the, the storage system and
the, the analysis in a much easier manner.
Hive is originated and developed at Facebook.
Apache Oozie
Apache Oozie is a scheduler system to run and manage Hadoop
jobs in a distributed environment.
It allows to combine multiple complex jobs to be run in a sequential
order to achieve a bigger task.
Within a sequence of task, two or more jobs can also be
programmed to run parallel to each other.
One of the main advantages of Oozie is that it is tightly integrated
with Hadoop stack supporting various Hadoop jobs like Hive, Pig,
Sqoop as well as system-specific jobs like Java and Shell.
Oozie detects completion of tasks through callback and polling.
When Oozie starts a task, it provides a unique callback HTTP
URL to the task, and notifies that URL when it is complete.
If the task fails to invoke the callback URL, Oozie can poll the task
for completion.
ZooKeeper
Another coordination service is called a Zookeeper, which provides the
coordination service and it will give you a centralized service, for maintaining
the configuration and the naming service, it provides the distributed
synchronization and the group services.
Originated and developed at Yahoo
If the master node fails, a new master is quickly selected and replaces the
failed master. In addition to the master and slaves, Zookeeper also has
watchers.
Apache Flume
Finally another application is called a Flume, which is a distributed reliable
available service, for efficiently collecting aggregating moving, a large
amount of data into the, of the locks into the HDFS system hence, it is used
for data injection.
Giraph:
Giraph is a graph processing tool, which is, being used by the Facebook, to
analyse the social network's graph that was made simplified, when it was made out of
Map Reduce.
So, it uses Yarn and HDFS and this is non-Map Reduce application, for,
computation or computing large graphs, of the social network.
So, Giraph is the tool which is now, runs over, Yarn HDFS, and this is used, the big
graphs computations that we will see, later on, this part of the course.
Giraph, Storm, Spark, Flink, do not use, Map Reduce directly, they run over Yarn
and HDFS.
NoSQL
Most of these big data is stored in the form of a key value pair and they are also,
known as, No Sequel Data Store.
This No Sequel Data Store can be supported by, the data base like, Cassandra,
MongoDB and HBase.
Traditional SQL, can be effectively used to handle the large amount of, structured
data. But here in the big data, most of the information is, unstructured form of the data, so
basically, NoSQL is required to handle that information,
NoSQL data base is, stored unstructured data also, however, it is not, enforced to
follow a particular, fixed schema structure and schema keeps on, changing, dynamically.
So, each row can have its own set of column values.
NoSQL gives a better performance, in storing the massive amount of data
compared to the SQL, structure.
NoSQL database is primarily a key value store. It is also called a, 'Column Family'
Column wise, the data is stored, in the form of a key value, pairs.
INPUT FORMAT:-
Every job in MapReduce must define its inputs according to contracts specified in the
InputFormat abstract class. InputFormat implementers must fulfill three contracts: first,
they describe type information for map input keys and values; next, they specify how the
input data should be partitioned; and finally, they indicate the RecordReader instance that
should read the data from source
RECORD READER:-
The RecordReader class is used by MapReduce in the map tasks to read data from an
input split and provide each record in the form of a key/value pair for use by mappers. A
task is commonly created for each input split, and each task has a single RecordReader
that’s responsible for reading the data for that input split.
DATA OUTPUT:-
MapReduce uses a similar process for supporting output data as it does for input
data.Two classes must exist, an OutputFormat and a RecordWriter. The OutputFormat
performs some basic validation of the data sink properties, and the RecordWriter writes
each reducer output to the data sink.
OUTPUT FORMAT:-
Much like the InputFormat class, the OutputFormat class defines the contracts that
implementers must fulfill, including checking the information related to the job output,
providing a RecordWriter, and specifying an output committer, which allows writes to be
staged and then made “permanent” upon task and/or job success.
RECORD WRITER:-
You’ll use the RecordWriter to write the reducer outputs to the destination data sink.It’s a
simple class.
The Map/Reduce framework operates exclusively on pairs, that is, the framework
views the input to the job as a set of pairs and produces a set of pairs as the output
of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence
need to implement the Writable interface.
Additionally, the key classes have to implement the WritableComparable interface
to facilitate sorting by the framework.
The user needs to implement a Mapper class as well as a Reducer class.
Optionally, the user can also write a Combiner class.
(input) < k1, v1 >→ map →< k2, v2 >→ combine →< k2, v2 > → reduce →< k3,
v3 > (output)
How to write a Hadoop Map class Subclass from MapReduceBase and implement
the Mapper interface.
WriteableComparable key:
Iterator values:
OutputCollector output:
Reporter reporter:
Given all the values for the key, the Reduce code typically iterates over all the
values and either concatenates the values together in some way to make a large
summary object, or combines and reduces the values in some way to yield a short
summary value.
The MapReduce framework operates on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the
framework and hence, need to implement the Writable interface.
Additionally, the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job –
(Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Input Output
Input Format:
The input data is divided into "splits" which are chunks of data processed by
individual mappers, and the "InputFormat" class defines how these splits are read
and interpreted as key-value pairs.
Map Phase:
Each mapper takes a set of key-value pairs from an input split, performs
operations on them, and generates a new set of key-value pairs as
intermediate output.
Reduce Phase:
Each reducer receives a set of key-value pairs with the same key, performs
aggregations or other calculations on the values, and produces a final key-value
pair as the output.
Partitioning:
The "Partitioner" class is used to determine which reducer receives which set of
key-value pairs for efficient processing.
Job Configuration:
While programming for a distributed system, we cannot use standard data types. This is
because they do not know how to read from/write to the disk i.e. they are not serializable.
Serialization is the process of converting object data into byte stream data for
transmission over a network across different nodes in a cluster or for persistent
data storage.
Hadoop frame work definitely needs Writable type of interface in order to perform the
following tasks:
It can be done by simply writing the keyword ‘implements’ and overriding the default
writable method.
Writable is a strong interface in Hadoop which while serializing the data, reduces the
data size enormously, so that data can be exchanged easily within the networks.
It has separate read and write fields to read data from network and write data into local
disk respectively.
Every data inside Hadoop should accept writable and comparable interface properties.
Hadoop provides Writable interface based data types for serialization and de-serialization
of data storage in HDFS and MapReduce computations.
Serialization is not the only concern of Writable interface; it also has to perform compare
and sorting operation in Hadoop.
All the Writable wrapper classes have a get() and a set() method for retrieving and
storing the wrapped value.
Hadoop also provides another interface called WritableComparable.
As we know, data flows from mappers to reducers in the form of (key, value) pairs. It is
important to note that any data type used for the key must implement the
WritableComparable interface along with Writable interface to compare the keys of
this type with each other for sorting purposes, and any data type used for the value
must implement the Writable interface.
Writable Classes
Primitive Writable Classes-These are Writable wrappers for Java primitive data types and
they hold a single primitive value.
1. BooleanWritable
2. ByteWritable
3. IntWritable
4. VIntWritable
5. FloatWritable
6. LongWritable
7. VLongWritable
8. DoubleWritable
Note that the serialized sizes of the above primitive Writable data types are same as
the size of the actual Java data types.
Hadoop provides two types of array Writable classes: one for single-dimensional and
another for two-dimensional arrays:
ArrayWritable
TwoDArrayWritable
The elements of these arrays must be other Writable objects like IntWritable or
FLoatWritable only; not the Java native data types like int or float.
Hadoop provides the following MapWritable data types which implement the java.util.Map
interface:
Combiners are mini-reducers that run in-memory immediately after the map
phase.
They reside on the same machines as the mappers.
They are used for local aggregation of intermediate results before the tuples are
sent to the reducers.
The aim of using combiners is reduce network traffic and minimize the load on the
reducers.
Note that the combiners have the exact same code as the reducers, so they
perform the exact same operation as the reducers, but locally.
Another component, known as the reporter, allows us to report the status to
Hadoop.
The execution framework handles everything else, apart from the above operations,
including:
We don’t know:
Combiners reside on the same machine as the mappers. So, data doesn't have to
be written to the disk while being moved between the mappers and the combiners.
Combiners run the same code that the reducers run, but they run this code locally
i.e. on the data that is generated by their corresponding mappers.
This is called local aggregation. This not only speeds up processing, but also
reduces the workload on the reducers, as the reducers will now receive locally
aggregated tuples.
1. Configure the Job: Specify Input, Output, Mapper, Reducer and Combiner
2. Implement the Mapper: For example, to count the number of occurrences of words in a
line, tokenize the text and emit the words with a count of 1 i.e. <word, 1>
3. Implement the Reducer: Sum up counts for each word and write the result to HDFS
The entire workflow of a MapReduce job exclusively uses <k, v> pairs and can be
denoted as follows:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3>
(output)
i. The Job class encapsulates information about the job and handles the execution of
the job.
ii. A job is packaged within a jar file and this file is distributed among nodes by the
Hadoop framework. We need to specify our jar file to Hadoop using the
Job.setJarByClass() function.
iii. Specify the Input: Input is specified by implementing InputFormat, for example
TextInputFormat. The input can be a file/directory/file pattern. InputFormat is
responsible for creating splits (InputSplitters) and a RecordReader. It also controls
the input types of the (key, value) pairs. Mappers receive input one line at a time.
TextInputFormat.addInputPath(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
iv. Specify the Output: Output is specified by implementing OutputFormat, for example
TextOutputFormat. It basically defines the specification for the output returned by
the MapReduce job. We must define the output folder. However, the MapReduce
program will not work if the output folder exists in advance. By default, a
MapReduce job uses a single reducer. However, if we use multiple reducers, there
will be multiple output files, one per reducer, and we must manually concatenate
these files to get the expected output of the MapReduce job.
TextOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
v. We must also set the output types for the (key, value) pairs for both mappers and
reducers:
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
vi. Usually we use the same output types for both mappers and reducers, but if we
need to set different types, we can use setMapOutputKeyClass(),
setMapOutputValueClass(), etc.
i. The Mapper class has 4 parameters: input key, input value, output key, output
value
ii. It makes use of Hadoop's IO framework for input/output operations
iii. We must define the map() function that takes some input (key, value) pair and
outputs another (key, value) pair, depending on the problem at hand
i. The Reducer class also has 4 parameters: input key, input value, output key,
output value
ii. The (key, value) pairs generated by the mappers are grouped by key, sorted and
sent to the reducers; each reducer receives all values corresponding to a certain
key
iii. We must implement the reduce() function that takes some input (key, <set of
values>) pair and outputs (key, value) pairs
iv. The output types of the map function must match the input type of the reduce
function
We can choose to use combiners immediately after the mappers, provided that their
output type matches that of the mappers.
This will reduce the number of tuples that will be sent to the reducers, due to local
aggregation by the combiners.
In this stage, we run the MapReduce job and the output(s) get(s) saved in the
output folder.
Data Serialization
In Distributed Systems like Hadoop, especially for Interprocess Communication and
Persistent Storage the concept of serialization is used.
Interprocess Communication
Data Serialization:
Serialization is the process of turning structured objects into a byte stream for
transmission over a network or for writing to persistent storage.
Deserialization is the process of turning a byte stream back into a series of structured
objects.
In Hadoop, interprocess communication between nodes in the system is implemented
using remote procedure calls(RPCs).
In general, it is desirable that an RPC serialization format is:
Compact: A compact format makes the best use of network bandwidth
Fast: Interprocess communication forms the backbone for a distributed
system, so it is essential that there is as little performance overhead as
possible for the serialization and deserialization process.
Extensible: Protocols change over time to meet new requirements, so it
should be straightforward to evolve the protocol in a controlled manner for
clients and servers.
Interoperable: For some systems, it is desirable to be able to support
clients that are written in different languages to the server.
It consists of
a) The Writable Interface
b) Writable Classes
c) Implementing a Custom Writable
d) Serialization Frameworks
e) Avro
a) The Writable Interface: The Writable interface defines two methods are one for
writing its state to a Data Output binary stream, and one for reading its state from a Data
Input binary stream.
e) Avro: Apache Avro4 is a language-neutral data serialization system. The project was
created by Doug Cutting (the creator of Hadoop) to address the major downside of
Hadoop Writables: lack of language portability.
The Avro specification precisely defines the binary format that all implementations must
support.
Data formats such as JSON and XML are often used as the format for storing serialized
data.
Customer binary formats are also used, which tend to be more space-efficient due to less
markup/tagging in the serialization.
Data serialization is the process of converting data objects present in complex data
structures into a byte stream for storage, transfer and distribution purposes on physical
devices.
Computer systems may vary in their hardware architecture, OS, addressing mechanisms.
Internal binary representations of data also vary accordingly in every environment. Storing
and exchanging data between such varying environments requires a platform-and-
languageneutral data format that all systems understand.
Once the serialized data is transmitted from the source machine to the destination
machine,
the reverse process of creating objects from the byte sequence called deserialization is
carried out. Reconstructed objects are clones of the original object.
Choice of data serialization format for an application depends on factors such as data
complexity, need for human readability, speed and storage space constraints. XML,
JSON,
BSON, YAML, MessagePack, and protobuf are some commonly used data serialization
formats.
computer data is generally organized in data structures such as arrays, tables, trees,
classes. When data structures need to be stored or transmitted to another location, such
as across a network, they are serialized.
Could you list some text-based Data Serialization formats and their key features?
Without being exhaustive, here are some common ones:
XML (Extensible Markup Language) - Nested textual format. Human-readable and
editable.
Schema based validation. Used in metadata applications, web services data transfer, web
publishing.
CSV (Comma-Separated Values) - Table structure with delimiters. Human-readable
textual data.Opens as spreadsheet or plaintext. Used as plaintext Database.
JSON (JavaScript Object Notation) - Short syntax textual format with limited data types.
Humanreadable. Derived from JavaScript data formats. No need of a separate parser (like
XML) since they map to JavaScript objects. Can be fetched with an XMLHttpRequest call.
No direct support for DATE data type. All data is dynamically processed. Popular format
for web API parameter passing. Mobile apps use this extensively for user interaction and
database services.
YAML (YAML Ain't Markup Language) - Lightweight text format. Human-readable.
Supports comments and thus easily editable. Superset of JSON. Supports complex data
types. Maps easily to native data structures. Used in configuration settings, document
headers, Apps with need for MySQL style self-references in relational data.
Could you list some binary Data Serialization formats and their key features?
Without being exhaustive, here are some common ones:
BSON (Binary JSON) - Created and internally used by MongoDB. Binary format, not
humanreadable. Deals with attribute-value pairs like JSON. Includes datetime, bytearray
and other data types not present in JSON. Used in web apps with rich media data types
such as live video. Primary use is storage, not network communication.
MessagePack - Designed for data to be transparently converted from/to JSON.
Compressed binary format, not human-readable. Supports static typing. Supports RPC.
Better JSON compatibility than BSON. Primary use is network communication, not
storage. Used in apps with distributed file systems.
protobuf (Protocol Buffers) - Created by Google. Binary message format that allows
programmers to specify a schema for the data. Also includes a set of rules and tools to
define and exchange these messages. Transparent data compression. Used in multi-
platform applications due to easy interoperability between languages. Universal RPC
framework. Used in performance-critical distributed applications.
Data serialization in Hadoop refers to the process of converting data into a format that can
be efficiently stored, transmitted, and reconstructed.
What is Serialization?
Serialization is the process of translating data structures or objects state into binary or
textual form to transport the data over network or to store on some persisten storage.
Once the data is transported over network or retrieved from the persistent storage, it
needs to be deserialized again.
Serialization is termed as marshalling and deserialization is termed as unmarshalling.
Writable Interface
This is the interface in Hadoop which provides methods for serialization and
deserialization.
IntWritable Class
Methods
S.No. Summary
1 int get()
Using this method you can get the integer value present in the current object.