BDT Unit 2 Textbook

What is Hadoop Ecosystem?
Hadoop ecosystem is a platform or framework which helps in solving the big data problems. It
comprises of different components and services (ingesting, storing, analyzing, and maintaining) inside
of it. Most of the services available in the Hadoop ecosystem are to supplement the main four core
components of Hadoop which include HDFS, YARN, MapReduce and Common.
Hadoop ecosystem includes both Apache Open Source projects and other wide variety of commercial
tools and solutions. Some of the well-known open source examples include
Spark,Hive,Pig, Sqoop andOozie.
As we have got some idea about what is Hadoop ecosystem, what it does, and what are its components,
let’s discuss each concept in detail.
Below mentioned are the hadoop components which all together can construct a Hadoop ecosystem.
Let's get into the details without wasting much time
HDFS (Hadoop distributed file system)

The Hadoop distributed file system is a storage system which runs on Java programming language and
used as a primary storage device in Hadoop applications. HDFS consists of two components, which
are Namenode and Datanode; these applications are used to store large data across multiple nodes on
the Hadoop cluster. First, let’s discuss about the NameNode.
NameNode:
• NameNode is a daemon which maintains and operates all DATA nodes (slave nodes).
• It acts as the recorder of metadata for all blocks in it, and it contains information like size,
location, source, and hierarchy, etc.
• It records all changes that happen to metadata.
• If any file gets deleted in the HDFS, the NameNode will automatically record it in EditLog.
• NameNode frequently receives heartbeat and block report from the data nodes in the cluster to
ensure they are working and live.
DataNode:
• It acts as a slave node daemon which runs on each slave machine.

• The data nodes act as a storage device.
• It takes responsibility to serve read and write request from the user.
• It takes the responsibility to act according to the instructions of NameNode, which includes
deleting blocks, adding blocks, and replacing blocks.
• It sends heartbeat reports to the NameNode regularly and the actual time is once in every 3
seconds.
YARN:
YARN (Yet Another Resource Negotiator) acts as a brain of the Hadoop ecosystem. It takes
responsibility in providing the computational resources needed for the application executions
YARN consists of two essential components. They are Resource Manager and Node Manager
Resource Manager:
• It works at the cluster level and takes responsibility for running the master machine.
• It stores the track of heartbeats from the Node manager.
• It takes the job submissions and negotiates the first container for executing an application.
• It consists of two components: Application manager and Scheduler.
Node manager:
• It works on node level component and runs on every slave machine.

• It is responsible for monitoring resource utilization in each container and managing containers.
• It also keeps track of log management and node health.
• It maintains continuous communication with a resource manager to give updates.
MapReduce
MapReduce acts as a core component in Hadoop Ecosystem as it facilitates the logic of processing. To
make it simple, MapReduce is a software framework which enables us in writing applications that
process large data sets using distributed and parallel algorithms in a Hadoop environment.
Parallel processing feature of MapReduce plays a crucial role in Hadoop ecosystem. It helps in
performing Big data analysis using multiple machines in the same cluster.
How does MapReduce Work?
In the MapReduce program, we have two Functions; one is Map, and the other is Reduce.
Map function: It converts one set of data into another, where individual elements are broken down
into tuples. (Key /value pairs).
Reduce function: It takes data from the Map function as an input. Reduce function aggregates &
summarizes the results produced by Map function.
Apache Spark
Apache Spark is an essential product from the Apache software foundation, and it is considered as
a powerful data processing engine. Spark is empowering the big data applications around the world. It
all started with the increasing needs of enterprises and where MapReduce is unable to handle them.
The growth of large unstructured amounts of data increased need for speed and to fulfill the real-time
analytics led to the invention of Apache Spark.
Spark Features
• It is a framework for real-time analytics in a distributed computing environment.

• It acts as an executor of in-memory computations which results in increased speed of data
processing compared to MapReduce.
• It is 100X faster than Hadoop while processing data with its exceptional in-memory execution
ability and other optimization features.
Spark is equipped with high-level libraries, which support R, Python, Scala, Java etc. These standard
libraries make the data processing seamless and highly reliable. Spark can process the enormous
amounts of data with ease and Hadoop was designed to store the unstructured data which must be
processed. When we combine these two, we get the desired results.
Hive
Apache Hive is a data warehouse open source software built on Apache Hadoop for performing data
query and analysis. Hive mainly does three functions; data summarization, query, and analysis. Hive
uses a language called HiveQL( HQL), which is similar to SQL. Hive QL works as a translator which
translates the SQL queries into MapReduce Jobs, which will be executed on Hadoop.
Main components of Hive are
Metastore- It serves as a storage device for the metadata. This metadata holds the information of
each table such as location and schema. Metadata keeps track of data and replicates it, and acts as a
backup store in case of data loss.
Driver- Driver receives the HiveQL instructions and acts as a Controller. It observes the progress and
life cycle of various executions by creating sessions. Whenever HiveQL executes a statement, driver
stores the metadata generated out of that action.
Compiler- The compiler is allocated with the task of converting the HiveQL query into MapReduce
input. A compiler is designed with the process to execute the steps and functions needed to enable the
HiveQL output, as required by the MapReduce.
H Base
Hbase is considered as a Hadoop database, because it is scalable, distributed, and because NoSQL
database that runs on top of Hadoop. Apache HBase is designed to store the structured data on table
format which has millions of columns and billions of rows. HBase gives access to get the real-time
data to read or write on HDFS.
HBase features
• HBase is an open source, NoSQL database.

• It is featured after Google’s big table, which is considered as a distributed storage system
designed to handle big data sets.
• It has a unique feature to support all types of data. With this feature, it plays a crucial role in
handling various types of data in Hadoop.
• The HBase is originally written in Java, and its applications can be written in Avro, REST,
and Thrift APIs.
Components of HBase
There are majorly two components in HBase. They are HBase master and Regional server.
a) HBase master: It is not part of the actual data storage, but it manages load balancing activities
across all RegionServers.
• It controls the failovers.

• Performs administration activities which provide an interface for creating, updating and
deleting tables.
• Handles DDL operations.
• It maintains and monitors the Hadoop cluster.
b) Regional server: It is a worker node. It reads, writes, and deletes request from Clients. Region
server runs on every node of Hadoop cluster. Its server runs on HDFS data nodes.
H Catalogue
H Catalogue is a table and storage management tool for Hadoop. It exposes the tabular metadata stored
in the hive to all other applications of Hadoop. H Catalogue accepts all kinds of components available
in Hadoop such as Hive, Pig, and MapReduce to quickly read and write data from the cluster. H
Catalogue is a crucial feature of Hive which allows users to store their data in any format and
structure.
H Catalogue defaulted supports CSV, JSON, RCFile,ORC file from and sequenceFile formats.
Benefits of H Catalogue
• It assists the integration with the other Hadoop tools and provides read data from a Hadoop
cluster or write data into a Hadoop cluster. It allows notifications of data availability.
• It enables APIs and web servers to access the metadata from hive metastore.
• It gives visibility for data archiving and data cleaning tools.
Apache Pig
Apache Pig is a high-level language platform for analyzing and querying large data sets that are stored
in HDFS. Pig works as an alternative language to Java programming for MapReduce and generates
MapReduce functions automatically. Pig included with Pig Latin, which is a scripting language. Pig
can translate the Pig Latin scripts into MapReduce which can run on YARN and process data in HDFS
cluster.
Pig is best suitable for solving complex use cases that require multiple data operations. It is more like
a processing language than a query language (ex:Java, SQL). Pig is considered as a highly customized
one because the users have a choice to write their functions by using their preferred scripting language.
How does Pig work?
We use ‘load’ command to load the data in the pig. Then, we can perform various functions such as
grouping data, filtering, joining, sorting etc. At last, you can dump the data on a screen, or you can
store the result back in HDFS according to your requirement.
Apache Sqoop
Sqoop works as a front-end loader of Big data. Sqoop is a front-end interface that enables in moving
bulk data from Hadoop to relational databases and into variously structured data marts.
Sqoop replaces the function called ‘developing scripts’ to import and export data. It mainly helps in
moving data from an enterprise database to Hadoop cluster to performing the ETL process.
What Sqoop does
Apache Sqoop undertakes the following tasks to integrate bulk data movement between Hadoop and
structured databases.
• Sqoop fulfills the growing need to transfer data from the mainframe to HDFS.
• Sqoop helps in achieving improved compression and lightweight indexing for advanced query
performance.
• It facilitates the feature to transfer of data parallelly for effective performance and optimal
system utilization.
• Sqoop creates fast data copies from an external source into Hadoop.
• It acts as a load balancer by mitigating extra storage and processing loads to other devices.
Oozie
Apache Ooze is a tool in which all sort of programs can be pipelined in a required manner to work in
Hadoop's distributed environment. Oozie works as a scheduler system to run and manage Hadoop
jobs.
Oozie allows combining multiple complex jobs to be run in a sequential order to achieve the desired
output. It is strongly integrated with Hadoop stack supporting various jobs like Pig, Hive, Sqoop, and
system-specific jobs like Java, and Shell. Oozie is an open-source Java web application.
Oozie consists of two jobs
1. Oozie workflow: It is a collection of actions arranged to perform the jobs one after another. It is
just like a relay race where one has to start right after one finishes, to complete the race.
2. Oozie Coordinator: It runs workflow jobs based on data availability and predefined schedules.
Avro
Apache Avro is a part of the Hadoop ecosystem, and it works as a data serialization system. It is an
open-source project which helps Hadoop in data serialization and data exchange. Avro enables big data
in exchanging programs written in different languages. It serializes data into files or messages.
Avro Schema: Schema helps Avaro in the serialization and deserialization process without code
generation. Avro needs a schema for data to read and write. Whenever we store data in a file it’s
schema also stored along with it, with this the files may be processed later by any program.
Dynamic typing: it means serializing and deserializing data without generating any code. It replaces
the code generation process with its statistically typed language as an optional optimization.
Avro features
• Avro makes Fast, compact, dynamic data formats.

• It has a Container file to store continuous data format.
• It helps in creating efficient data structures.
Apache Drill
The primary purpose of the Hadoop ecosystem is to process large sets of data either it is structured or
unstructured. Apache Drill is a low latency-distributed query engine that is designed to measure several
thousands of nodes and query petabytes of data. The drill has a specialized skill to eliminate cache
data and release space.
Features of Drill
• It gives an extensible architecture at all layers.

• Drill provides data in a hierarchical format which is easy to process and understandable.
• The drill does not require centralized metadata, and the user doesn’t need to create and manage
tables in metadata to query data.
Apache Zookeeper
Apache Zookeeper is an open-source project designed to coordinate multiple services in the Hadoop
ecosystem. Organizing and maintaining a service in a distributed environment is a complicated task.
Zookeeper solves this problem with its simple APIs and Architecture. Zookeeper allows developers to
focus on core applications instead of concentrating on a distributed environment of the application.
Features of Zookeeper
• Zookeeper acts fast enough with workloads where reads to data are more common than writes.
• Zookeeper acts as a disciplined one because it maintains a record of all transactions.
Apache Flume
Flume collects, aggregates, and moves large sets of data from its origin and sends it back to HDFS. It
works as a fault-tolerant mechanism. It helps in transmitting data from a source into a Hadoop
environment. Flume enables its users in getting data from multiple servers immediately into Hadoop.
Apache Ambari
Ambari is an open-source software of Apache software foundation. It makes Hadoop manageable. It
consists of software that is capable of provisioning, managing, and monitoring of Apache Hadoop
clusters. Let's discuss each concept.
Hadoop cluster provisioning: It guides us with a step-by-step procedure on how to install Hadoop
services across many hosts. Ambari handles the configuration of Hadoop services across all clusters.
Hadoop Cluster management: It acts as a central management system for starting, stopping, and
reconfiguring of Hadoop services across all clusters.
Hadoop cluster monitoring: Ambari provides us with a dashboard for monitoring health and status.
The Ambari framework acts as an alarming system to notify when anything goes wrong. For example,
if a node goes down or low disk space on a node etc, it intimates us through notification.
2. Moving data into and out of Hadoop

✓ Understanding key design considerations for data ingress and egress tools
✓ Low-level methods for moving data into and out of Hadoop
✓ Techniques for moving log files and relational and NoSQL data, as well as data in Kafka,
in and out of HDFS
Data movement is one of those things that you aren’t likely to think too much about until you’re
fully committed to using Hadoop on a project, at which point it becomes this big scary unknown
that has to be tackled. How do you get your log data sitting across thousands of hosts into Hadoop?
What’s the most efficient way to get your data out of your relational and No/NewSQL systems
and into Hadoop? How do you get Lucene indexes generated in Hadoop out to your servers? And
how can these processes be automated?
This topic starts by highlighting key data-movement properties.We’ll start with some simple
techniques, such as using the command line and Java for ingress, but we’ll quickly move on to
more advanced techniques like using NFS
Ingress and egress refer to data movement into and out of a system, respectively.
Data egress refers to data leaving a network in transit to an external location. Outbound email
messages, cloud uploads, or files being moved to external storage are simple examples of data
egress.
Data ingress in computer networking, including: Data that is downloaded from the internet to a
local computer. Email messages that are delivered to a mailbox. VoIP calls that come into a
network.
Once the low-level tooling is out of the way, we’ll survey higher-level tools that have simplified
the process of ferrying data into Hadoop. We’ll look at how you can automate the movement of
log files with Flume, and how Sqoop can be used to move relational data.
Key elements of data movement
Moving large quantities of data in and out of Hadoop offers logistical challenges that include
consistency guarantees and resource impacts on data sources and destinations. Before we dive into
the techniques, however, we need to discuss the design elements you should be aware of when
working with data movement.
Idempotence
An idempotent operation produces the same result no matter how many times it’s executed. In a
relational database, the inserts typically aren’t idempotent, because executing them multiple times
doesn’t produce the same resulting database state. Alternatively, updates often are idempotent,
because they’ll produce the same end result.
Any time data is being written, idempotence should be a consideration, and data ingress and egress
in Hadoop are not different. How well do distributed log collection frameworks deal with data
retransmissions? How do you ensure idempotent behavior in a MapReduce job where multiple
tasks are inserting into a database in parallel?
Aggregation
The data aggregation process combines multiple data elements. In the context of data ingress, this
can be useful because moving large quantities of small files into HDFS potentially translates into
NameNode memory woes, as well as slow MapReduce execution times. Having the ability to
aggregate files or data together mitigates this problem and is a feature to consider.
Data format transformation
The data format transformation process converts one data format into another. Often your source
data isn’t in a format that’s ideal for processing in tools such as Map-Reduce. If your source data
is in multiline XML or JSON form, for example, you may want to consider a preprocessing step.
This would convert the data into a form that can be split, such as one JSON or XML element per
line, or convert it into a format such as Avro.
Compression
Compression not only helps by reducing the footprint of data at rest, but also has I/O advantages
when reading and writing data.
Availability and recoverability
Recoverability allows an ingress or egress tool to retry in the event of a failed operation. Because
it’s unlikely that any data source, sink, or Hadoop itself can be 100% available, it’s important that
an ingress or egress action be retried in the event of failure.
Reliable data transfer and data validation
In the context of data transportation, checking for correctness is how you verify that no data
corruption occurred as the data was in transit. When you work with heterogeneous systems such
as Hadoop data ingress and egress, the fact that data is being transported across different hosts,
networks, and protocols only increases the potential for problems during data transfer.
A common method for checking the correctness of raw data, such as storage devices, is Cyclic
Redundancy Checks (CRCs), which are what HDFS uses internally to maintain block-level
integrity.
In addition, it’s possible that there are problems in the source data itself due to bugs in the software
generating the data. Performing these checks at ingress time allows you to do a one-time check,
instead of dealing with all the downstream consumers of the data that would have to be updated to
handle errors in the data.
Resource consumption and performance
Resource consumption and performance are measures of system resource utilization and system
efficiency, respectively.
Ingress and egress tools don’t typically impose significant load (resource consumption) on a
system, unless you have appreciable data volumes.
For performance, the questions to ask include whether the tool performs ingress and egress
activities in parallel, and if so, what mechanisms it provides to tune the amount of parallelism. For
example, if your data source is a production database and you’re using MapReduce to ingest that
data, don’t use a large number of concurrent map tasks to import data.
Monitoring
Monitoring ensures that functions are performing as expected in automated systems. For data
ingress and egress, monitoring breaks down into two elements: ensuring that the processes
involved in ingress and egress are alive, and validating that source and destination data are being
produced as expected.
Monitoring should also include verifying that the data volumes being moved are at expected
levels; unexpected drops or highs in your data will alert you to potential system issues or bugs in
your software.
Speculative execution
MapReduce has a feature called speculative execution that launches duplicate tasks near the end
of a job for tasks that are still executing. This helps prevent slow hardware from impacting job
execution times. But if you’re using a map task to perform inserts into a relational database, for
example, you should be aware that you could have two parallel processes inserting the same data.
Map- and reduce-side speculative execution can be disabled via the mapreduce.map.speculative
and mapreduce.reduce.speculative configurable in Hadoop 2.
Moving data into Hadoop
The first step in working with data in Hadoop is to make it available to Hadoop. There are two
primary methods that can be used to move data into Hadoop: writing external data at the HDFS
level (a data push), or reading external data at the MapReduce level (more like a pull). Reading
data in MapReduce has advantages in the ease with which the operation can be parallelized and
made fault tolerant. Not all data is accessible from MapReduce, however, such as in the case of
log files, which is where other systems need to be relied on for transportation, including HDFS for
the final data hop.
In this section we’ll look at methods for moving source data into Hadoop. I’ll use the design
considerations in the previous section as the criteria for examining and understanding the different
tools.
Roll your own ingest
Hadoop comes bundled with a number of methods to get your data into HDFS. This section will
examine various ways that these built-in tools can be used for your data movement needs. The
first and Picking the right ingest tool for the job
The low-level tools in this section work well for one-off file movement activities, or when working
with legacy data sources and destinations that are file-based. But moving data in this way is quickly
becoming obsolete by the availability of tools such as Flume and Kafka (covered later in this
chapter), which offer automated data movement pipelines.
Kafka is a much better platform for getting data from A to B (and B can be a Hadoop cluster) than
the old-school “let’s copy files around!” With Kafka, you only need to pump your data into Kafka,
and you have the ability to consume the data in real time (such as via Storm) or in offline/batch
jobs (such as via Camus) tentially easiest tool you can use is the HDFS command line.
3. Understanding inputs and outputs of MapReduce

Your data might be XML files sitting behind a number of FTP servers, text log files sitting on a
central web server, or Lucene indexes1 in HDFS. How does MapReduce support reading and
writing to these different serialization structures across the various storage mechanisms? You’ll
need to know the answer in order to support a specific serialization format.
Data input :-
The two classes that support data input in MapReduce are InputFormat and Record-Reader. The
InputFormat class is consulted to determine how the input data should be partitioned for the map
tasks, and the RecordReader performs the reading of data from the inputs.
INPUT FORMAT:-
Every job in MapReduce must define its inputs according to contracts specified in the InputFormat
abstract class. InputFormat implementers must fulfill three contracts: first, they describe type
information for map input keys and values; next, they specify how the input data should be
partitioned; and finally, they indicate the RecordReader
instance that should read the data from source
RECORD READER:-
The RecordReader class is used by MapReduce in the map tasks to read data from an input split
and provide each record in the form of a key/value pair for use by mappers. A task is commonly
created for each input split, and each task has a single RecordReader that’s responsible for reading
the data for that input split.
DATA OUTPUT:-
MapReduce uses a similar process for supporting output data as it does for input data.Two classes
must exist, an OutputFormat and a RecordWriter. The OutputFormat performs some basic
validation of the data sink properties, and the RecordWriter writes each reducer output to the data
sink.
OUTPUT FORMAT:-
Much like the InputFormat class, the OutputFormat class, as shown in figure 3.5, defines the
contracts that implementers must fulfill, including checking the information related to the job
output, providing a RecordWriter, and specifying an output committer, which allows writes to be
staged and then made “permanent” upon task and/or job success.
RECORD WRITER:-
You’ll use the RecordWriter to write the reducer outputs to the destination data sink.
It’s a simple class.
SequenceFileInputFormat – Hadoop MapReduce is not restricted to processing textual data. It
has support for binary formats, too. Hadoop’s sequence file format stores sequences of binary key-
value pairs. Sequence files are well suited as a format for MapReduce data because they are
splittable (they have sync points so that readers can synchronize with record boundaries from an
arbitrary point in the file, such as the start of a split), they support compression as a part of the
format, and they can store arbitrary types using a variety of serialization frameworks.
SequenceFileAsTextInputFormat – SequenceFileAsTextInputFormat is a variant of
SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects. The
conversion is performed by calling toString() on the keys and values. This format makes
sequence files suitable input for Streaming
SequenceFileAsBinaryInputFormat – SequenceFileAsBinaryInputFormat is a variant of
SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary
objects. They are encapsulated as BytesWritable objects, and the application is free to interpret
the underlying byte array as it pleases.
FixedLengthInputFormat – FixedLengthInputFormat is for reading fixed-width binary records
from a file, when the records are not separated by delimiters. The record size must be set via
fixedlengthinputformat.record.length.
SequenceFileOutputFormat – its for writing binary Output. As the name indicates,

SequenceFileOutputFormat writes sequence files for its output. This is a good choice of output if
it forms the input to a further MapReduce job, since it is compact and is readily compressed.
SequenceFileAsBinaryOutputFormat – Is the counterpart to

SequenceFileAsBinaryInputFormat writes keys and values in raw binary format into a sequence
file container.
MapFileOutputFormat – MapFileOutputFormat writes map files as output. The keys in a

MapFile must be added in order, so you need to ensure that your reducers emit keys in sorted order.
4. Data Serialization
Why Is Data Serialization Important for Distributed Systems?
In some distributed systems, data and its replicas are stored in different partitions on multiple
cluster members. If data is not present on the local member, the system will retrieve that data
from another member. This requires serialization for use cases such as:
• Adding key/value objects to a map
• Putting items into a queue, set, or list
• Sending a lambda functions to another server
• Processing an entry within a map
• Locking an object
• Sending a message to a topic
What Are Common Languages for Data Serialization?
A number of popular object-oriented programming languages provide either native support

for serialization or have libraries that add non-native capabilities for serialization to their
feature set. Java, .NET, C++, Node.js, Python, and Go, for example, all either have native
serialization support or integrate with libraries for serialization.
Data formats such as JSON and XML are often used as the format for storing serialized data.
Customer binary formats are also used, which tend to be more space-efficient due to less
markup/tagging in the serialization.
What Is Data Serialization in Big Data?
Big data systems often include technologies/data that are described as “schemaless.” This
means that the managed data in these systems are not structured in a strict format, as defined
by a schema. Serialization provides several benefits in this type of environment:
• Structure: By inserting some schema or criteria for a data structure through

serialization on read, we can avoid reading data that misses mandatory fields, is
incorrectly classified, or lacks some other quality control requirement.
• Portability: Big data comes from a variety of systems and may be written in a variety
of languages. Serialization can provide the necessary uniformity to transfer such data
to other enterprise systems or applications.
• Versioning: Big data is constantly changing. Serialization allows us to apply version
numbers to objects for lifecycle management.
Data serialization is the process of converting data objects present in complex data structures
into a byte stream for storage, transfer and distribution purposes on physical devices.
Computer systems may vary in their hardware architecture, OS, addressing mechanisms.
Internal binary representations of data also vary accordingly in every environment. Storing
and exchanging data between such varying environments requires a platform-and-language-
neutral data format that all systems understand.
Once the serialized data is transmitted from the source machine to the destination machine,
the reverse process of creating objects from the byte sequence called deserialization is carried
out. Reconstructed objects are clones of the original object.
Choice of data serialization format for an application depends on factors such as data
complexity, need for human readability, speed and storage space constraints. XML, JSON,
BSON, YAML, MessagePack, and protobuf are some commonly used data serialization
formats.
computer data is generally organized in data structures such as arrays, tables, trees, classes. When
data structures need to be stored or transmitted to another location, such as across a network, they
are serialized.
How does data serialization and deserialization work?
For simple, linear data (number or string) there's nothing to do. Serialization becomes complex for
nested data structures and object references. When objects are nested into multiple levels, such as
in trees, it's collapsed into a series of bytes, and enough information (such as traversal order) is
included to aid reconstruction of the original tree structure on the destination side.
When objects with pointer references to other member variables are serialized, the referenced
objects are tracked and serialized, ensuring that the same object is not serialized more than once.
However, all nested objects must be serializable too.
Finally, the serialized data stream is persisted in a byte sequence using a standard format. ISO-
8859-1 is a popular format for 1-byte representation of English characters and numerals. UTF-8 is
the world standard for encoding multilingual, mathematical and scientific data; each character may
take 1-4 bytes of data in Unicode.
What are the applications of Data Serialization?
Serialization allows a program to save the state of an object and recreate it when needed. Its
common uses are:
Persisting data onto files – happens mostly in language-neutral formats such as CSV or XML.
However, most languages allow objects to be serialized directly into binary using APIs such as the
Serializable interface in Java, fstream class in C++, or Pickle module in Python.
Storing data into Databases – when program objects are converted into byte streams and then
stored into DBs, such as in Java JDBC.
Transferring data through the network – such as web applications and mobile apps passing on
objects from client to server and vice versa.
Remote Method Invocation (RMI) – by passing serialized objects as parameters to functions
running on a remote machine as if invoked on a local machine. This data can be transmitted across
domains through firewalls.
Sharing data in a Distributed Object Model – when programs written in different languages
(running on diverse platforms) need to share object data over a distributed network using
frameworks such as COM and CORBA. However, SOAP, REST and other web services have
replaced these applications now.
Could you list some text-based Data Serialization formats and their key features?
Without being exhaustive, here are some common ones:
XML (Extensible Markup Language) - Nested textual format. Human-readable and editable.
Schema based validation. Used in metadata applications, web services data transfer, web
publishing.
CSV (Comma-Separated Values) - Table structure with delimiters. Human-readable textual data.
Opens as spreadsheet or plaintext. Used as plaintext Database.
JSON (JavaScript Object Notation) - Short syntax textual format with limited data types. Human-
readable. Derived from JavaScript data formats. No need of a separate parser (like XML) since
they map to JavaScript objects. Can be fetched with an XMLHttpRequest call. No direct support
for DATE data type. All data is dynamically processed. Popular format for web API parameter
passing. Mobile apps use this extensively for user interaction and database services.
YAML (YAML Ain't Markup Language) - Lightweight text format. Human-readable. Supports
comments and thus easily editable. Superset of JSON. Supports complex data types. Maps easily
to native data structures. Used in configuration settings, document headers, Apps with need for
MySQL style self-references in relational data.
Could you list some binary Data Serialization formats and their key features?
Without being exhaustive, here are some common ones:
BSON (Binary JSON) - Created and internally used by MongoDB. Binary format, not human-
readable. Deals with attribute-value pairs like JSON. Includes datetime, bytearray and other data
types not present in JSON. Used in web apps with rich media data types such as live video. Primary
use is storage, not network communication.
MessagePack - Designed for data to be transparently converted from/to JSON. Compressed binary
format, not human-readable. Supports static typing. Supports RPC. Better JSON compatibility
than BSON. Primary use is network communication, not storage. Used in apps with distributed
file systems.
protobuf (Protocol Buffers) - Created by Google. Binary message format that allows programmers
to specify a schema for the data. Also includes a set of rules and tools to define and exchange these
messages. Transparent data compression. Used in multi-platform applications due to easy
interoperability between languages. Universal RPC framework. Used in performance-critical
distributed applications.

BDT Unit 2 Textbook

Uploaded by

Copyright:

Available Formats

BDT Unit 2 Textbook

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDT Unit 2 Textbook

Uploaded by

Copyright:

Available Formats

What is Hadoop Ecosystem?

HDFS (Hadoop distributed file system)

• It acts as a slave node daemon which runs on each slave machine.

• It works on node level component and runs on every slave machine.

How does MapReduce Work?

• It is a framework for real-time analytics in a distributed computing environment.

Main components of Hive are

• HBase is an open source, NoSQL database.

• It controls the failovers.

How does Pig work?

What Sqoop does

Oozie consists of two jobs

• Avro makes Fast, compact, dynamic data formats.

• It gives an extensible architecture at all layers.

2. Moving data into and out of Hadoop

3. Understanding inputs and outputs of MapReduce

SequenceFileInputFormat – Hadoop MapReduce is not restricted to processing textual data. It

sequence files suitable input for Streaming

SequenceFileAsBinaryInputFormat – SequenceFileAsBinaryInputFormat is a variant of

the underlying byte array as it pleases.

FixedLengthInputFormat – FixedLengthInputFormat is for reading fixed-width binary records

SequenceFileOutputFormat – its for writing binary Output. As the name indicates,

SequenceFileAsBinaryOutputFormat – Is the counterpart to

MapFileOutputFormat – MapFileOutputFormat writes map files as output. The keys in a

Why Is Data Serialization Important for Distributed Systems?

What Are Common Languages for Data Serialization?

A number of popular object-oriented programming languages provide either native support

What Is Data Serialization in Big Data?

• Structure: By inserting some schema or criteria for a data structure through

You might also like