BDT Unit 2 Textbook
BDT Unit 2 Textbook
BDT Unit 2 Textbook
Hadoop ecosystem is a platform or framework which helps in solving the big data problems. It
comprises of different components and services (ingesting, storing, analyzing, and maintaining) inside
of it. Most of the services available in the Hadoop ecosystem are to supplement the main four core
components of Hadoop which include HDFS, YARN, MapReduce and Common.
Hadoop ecosystem includes both Apache Open Source projects and other wide variety of commercial
tools and solutions. Some of the well-known open source examples include
Spark,Hive,Pig, Sqoop andOozie.
As we have got some idea about what is Hadoop ecosystem, what it does, and what are its components,
let’s discuss each concept in detail.
Below mentioned are the hadoop components which all together can construct a Hadoop ecosystem.
Let's get into the details without wasting much time
NameNode:
• NameNode is a daemon which maintains and operates all DATA nodes (slave nodes).
• It acts as the recorder of metadata for all blocks in it, and it contains information like size,
location, source, and hierarchy, etc.
• It records all changes that happen to metadata.
• If any file gets deleted in the HDFS, the NameNode will automatically record it in EditLog.
• NameNode frequently receives heartbeat and block report from the data nodes in the cluster to
ensure they are working and live.
DataNode:
YARN:
YARN (Yet Another Resource Negotiator) acts as a brain of the Hadoop ecosystem. It takes
responsibility in providing the computational resources needed for the application executions
YARN consists of two essential components. They are Resource Manager and Node Manager
Resource Manager:
• It works at the cluster level and takes responsibility for running the master machine.
• It stores the track of heartbeats from the Node manager.
• It takes the job submissions and negotiates the first container for executing an application.
• It consists of two components: Application manager and Scheduler.
Node manager:
MapReduce
MapReduce acts as a core component in Hadoop Ecosystem as it facilitates the logic of processing. To
make it simple, MapReduce is a software framework which enables us in writing applications that
process large data sets using distributed and parallel algorithms in a Hadoop environment.
Parallel processing feature of MapReduce plays a crucial role in Hadoop ecosystem. It helps in
performing Big data analysis using multiple machines in the same cluster.
In the MapReduce program, we have two Functions; one is Map, and the other is Reduce.
Map function: It converts one set of data into another, where individual elements are broken down
into tuples. (Key /value pairs).
Reduce function: It takes data from the Map function as an input. Reduce function aggregates &
summarizes the results produced by Map function.
Apache Spark
Apache Spark is an essential product from the Apache software foundation, and it is considered as
a powerful data processing engine. Spark is empowering the big data applications around the world. It
all started with the increasing needs of enterprises and where MapReduce is unable to handle them.
The growth of large unstructured amounts of data increased need for speed and to fulfill the real-time
analytics led to the invention of Apache Spark.
Spark Features
Spark is equipped with high-level libraries, which support R, Python, Scala, Java etc. These standard
libraries make the data processing seamless and highly reliable. Spark can process the enormous
amounts of data with ease and Hadoop was designed to store the unstructured data which must be
processed. When we combine these two, we get the desired results.
Hive
Apache Hive is a data warehouse open source software built on Apache Hadoop for performing data
query and analysis. Hive mainly does three functions; data summarization, query, and analysis. Hive
uses a language called HiveQL( HQL), which is similar to SQL. Hive QL works as a translator which
translates the SQL queries into MapReduce Jobs, which will be executed on Hadoop.
Metastore- It serves as a storage device for the metadata. This metadata holds the information of
each table such as location and schema. Metadata keeps track of data and replicates it, and acts as a
backup store in case of data loss.
Driver- Driver receives the HiveQL instructions and acts as a Controller. It observes the progress and
life cycle of various executions by creating sessions. Whenever HiveQL executes a statement, driver
stores the metadata generated out of that action.
Compiler- The compiler is allocated with the task of converting the HiveQL query into MapReduce
input. A compiler is designed with the process to execute the steps and functions needed to enable the
HiveQL output, as required by the MapReduce.
H Base
Hbase is considered as a Hadoop database, because it is scalable, distributed, and because NoSQL
database that runs on top of Hadoop. Apache HBase is designed to store the structured data on table
format which has millions of columns and billions of rows. HBase gives access to get the real-time
data to read or write on HDFS.
HBase features
Components of HBase
There are majorly two components in HBase. They are HBase master and Regional server.
a) HBase master: It is not part of the actual data storage, but it manages load balancing activities
across all RegionServers.
b) Regional server: It is a worker node. It reads, writes, and deletes request from Clients. Region
server runs on every node of Hadoop cluster. Its server runs on HDFS data nodes.
H Catalogue
H Catalogue is a table and storage management tool for Hadoop. It exposes the tabular metadata stored
in the hive to all other applications of Hadoop. H Catalogue accepts all kinds of components available
in Hadoop such as Hive, Pig, and MapReduce to quickly read and write data from the cluster. H
Catalogue is a crucial feature of Hive which allows users to store their data in any format and
structure.
H Catalogue defaulted supports CSV, JSON, RCFile,ORC file from and sequenceFile formats.
Benefits of H Catalogue
• It assists the integration with the other Hadoop tools and provides read data from a Hadoop
cluster or write data into a Hadoop cluster. It allows notifications of data availability.
• It enables APIs and web servers to access the metadata from hive metastore.
• It gives visibility for data archiving and data cleaning tools.
Apache Pig
Apache Pig is a high-level language platform for analyzing and querying large data sets that are stored
in HDFS. Pig works as an alternative language to Java programming for MapReduce and generates
MapReduce functions automatically. Pig included with Pig Latin, which is a scripting language. Pig
can translate the Pig Latin scripts into MapReduce which can run on YARN and process data in HDFS
cluster.
Pig is best suitable for solving complex use cases that require multiple data operations. It is more like
a processing language than a query language (ex:Java, SQL). Pig is considered as a highly customized
one because the users have a choice to write their functions by using their preferred scripting language.
We use ‘load’ command to load the data in the pig. Then, we can perform various functions such as
grouping data, filtering, joining, sorting etc. At last, you can dump the data on a screen, or you can
store the result back in HDFS according to your requirement.
Apache Sqoop
Sqoop works as a front-end loader of Big data. Sqoop is a front-end interface that enables in moving
bulk data from Hadoop to relational databases and into variously structured data marts.
Sqoop replaces the function called ‘developing scripts’ to import and export data. It mainly helps in
moving data from an enterprise database to Hadoop cluster to performing the ETL process.
Apache Sqoop undertakes the following tasks to integrate bulk data movement between Hadoop and
structured databases.
• Sqoop fulfills the growing need to transfer data from the mainframe to HDFS.
• Sqoop helps in achieving improved compression and lightweight indexing for advanced query
performance.
• It facilitates the feature to transfer of data parallelly for effective performance and optimal
system utilization.
• Sqoop creates fast data copies from an external source into Hadoop.
• It acts as a load balancer by mitigating extra storage and processing loads to other devices.
Oozie
Apache Ooze is a tool in which all sort of programs can be pipelined in a required manner to work in
Hadoop's distributed environment. Oozie works as a scheduler system to run and manage Hadoop
jobs.
Oozie allows combining multiple complex jobs to be run in a sequential order to achieve the desired
output. It is strongly integrated with Hadoop stack supporting various jobs like Pig, Hive, Sqoop, and
system-specific jobs like Java, and Shell. Oozie is an open-source Java web application.
1. Oozie workflow: It is a collection of actions arranged to perform the jobs one after another. It is
just like a relay race where one has to start right after one finishes, to complete the race.
2. Oozie Coordinator: It runs workflow jobs based on data availability and predefined schedules.
Avro
Apache Avro is a part of the Hadoop ecosystem, and it works as a data serialization system. It is an
open-source project which helps Hadoop in data serialization and data exchange. Avro enables big data
in exchanging programs written in different languages. It serializes data into files or messages.
Avro Schema: Schema helps Avaro in the serialization and deserialization process without code
generation. Avro needs a schema for data to read and write. Whenever we store data in a file it’s
schema also stored along with it, with this the files may be processed later by any program.
Dynamic typing: it means serializing and deserializing data without generating any code. It replaces
the code generation process with its statistically typed language as an optional optimization.
Avro features
Apache Drill
The primary purpose of the Hadoop ecosystem is to process large sets of data either it is structured or
unstructured. Apache Drill is a low latency-distributed query engine that is designed to measure several
thousands of nodes and query petabytes of data. The drill has a specialized skill to eliminate cache
data and release space.
Features of Drill
Apache Zookeeper
Apache Zookeeper is an open-source project designed to coordinate multiple services in the Hadoop
ecosystem. Organizing and maintaining a service in a distributed environment is a complicated task.
Zookeeper solves this problem with its simple APIs and Architecture. Zookeeper allows developers to
focus on core applications instead of concentrating on a distributed environment of the application.
Features of Zookeeper
• Zookeeper acts fast enough with workloads where reads to data are more common than writes.
• Zookeeper acts as a disciplined one because it maintains a record of all transactions.
Apache Flume
Flume collects, aggregates, and moves large sets of data from its origin and sends it back to HDFS. It
works as a fault-tolerant mechanism. It helps in transmitting data from a source into a Hadoop
environment. Flume enables its users in getting data from multiple servers immediately into Hadoop.
Apache Ambari
Ambari is an open-source software of Apache software foundation. It makes Hadoop manageable. It
consists of software that is capable of provisioning, managing, and monitoring of Apache Hadoop
clusters. Let's discuss each concept.
Hadoop cluster provisioning: It guides us with a step-by-step procedure on how to install Hadoop
services across many hosts. Ambari handles the configuration of Hadoop services across all clusters.
Hadoop Cluster management: It acts as a central management system for starting, stopping, and
reconfiguring of Hadoop services across all clusters.
Hadoop cluster monitoring: Ambari provides us with a dashboard for monitoring health and status.
The Ambari framework acts as an alarming system to notify when anything goes wrong. For example,
if a node goes down or low disk space on a node etc, it intimates us through notification.
RECORD READER:-
The RecordReader class is used by MapReduce in the map tasks to read data from an input split
and provide each record in the form of a key/value pair for use by mappers. A task is commonly
created for each input split, and each task has a single RecordReader that’s responsible for reading
the data for that input split.
DATA OUTPUT:-
MapReduce uses a similar process for supporting output data as it does for input data.Two classes
must exist, an OutputFormat and a RecordWriter. The OutputFormat performs some basic
validation of the data sink properties, and the RecordWriter writes each reducer output to the data
sink.
OUTPUT FORMAT:-
Much like the InputFormat class, the OutputFormat class, as shown in figure 3.5, defines the
contracts that implementers must fulfill, including checking the information related to the job
output, providing a RecordWriter, and specifying an output committer, which allows writes to be
staged and then made “permanent” upon task and/or job success.
RECORD WRITER:-
You’ll use the RecordWriter to write the reducer outputs to the destination data sink.
It’s a simple class.
has support for binary formats, too. Hadoop’s sequence file format stores sequences of binary key-
value pairs. Sequence files are well suited as a format for MapReduce data because they are
splittable (they have sync points so that readers can synchronize with record boundaries from an
arbitrary point in the file, such as the start of a split), they support compression as a part of the
format, and they can store arbitrary types using a variety of serialization frameworks.
SequenceFileAsTextInputFormat – SequenceFileAsTextInputFormat is a variant of
SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects. The
conversion is performed by calling toString() on the keys and values. This format makes
SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary
objects. They are encapsulated as BytesWritable objects, and the application is free to interpret
from a file, when the records are not separated by delimiters. The record size must be set via
fixedlengthinputformat.record.length.
4. Data Serialization
In some distributed systems, data and its replicas are stored in different partitions on multiple
cluster members. If data is not present on the local member, the system will retrieve that data
from another member. This requires serialization for use cases such as:
• Adding key/value objects to a map
• Putting items into a queue, set, or list
• Sending a lambda functions to another server
• Processing an entry within a map
• Locking an object
• Sending a message to a topic
Data formats such as JSON and XML are often used as the format for storing serialized data.
Customer binary formats are also used, which tend to be more space-efficient due to less
markup/tagging in the serialization.
Big data systems often include technologies/data that are described as “schemaless.” This
means that the managed data in these systems are not structured in a strict format, as defined
by a schema. Serialization provides several benefits in this type of environment:
Computer systems may vary in their hardware architecture, OS, addressing mechanisms.
Internal binary representations of data also vary accordingly in every environment. Storing
and exchanging data between such varying environments requires a platform-and-language-
neutral data format that all systems understand.
Once the serialized data is transmitted from the source machine to the destination machine,
the reverse process of creating objects from the byte sequence called deserialization is carried
out. Reconstructed objects are clones of the original object.
Choice of data serialization format for an application depends on factors such as data
complexity, need for human readability, speed and storage space constraints. XML, JSON,
BSON, YAML, MessagePack, and protobuf are some commonly used data serialization
formats.
computer data is generally organized in data structures such as arrays, tables, trees, classes. When
data structures need to be stored or transmitted to another location, such as across a network, they
are serialized.
How does data serialization and deserialization work?
For simple, linear data (number or string) there's nothing to do. Serialization becomes complex for
nested data structures and object references. When objects are nested into multiple levels, such as
in trees, it's collapsed into a series of bytes, and enough information (such as traversal order) is
included to aid reconstruction of the original tree structure on the destination side.
When objects with pointer references to other member variables are serialized, the referenced
objects are tracked and serialized, ensuring that the same object is not serialized more than once.
However, all nested objects must be serializable too.
Finally, the serialized data stream is persisted in a byte sequence using a standard format. ISO-
8859-1 is a popular format for 1-byte representation of English characters and numerals. UTF-8 is
the world standard for encoding multilingual, mathematical and scientific data; each character may
take 1-4 bytes of data in Unicode.
What are the applications of Data Serialization?
Serialization allows a program to save the state of an object and recreate it when needed. Its
common uses are:
Persisting data onto files – happens mostly in language-neutral formats such as CSV or XML.
However, most languages allow objects to be serialized directly into binary using APIs such as the
Serializable interface in Java, fstream class in C++, or Pickle module in Python.
Storing data into Databases – when program objects are converted into byte streams and then
stored into DBs, such as in Java JDBC.
Transferring data through the network – such as web applications and mobile apps passing on
objects from client to server and vice versa.
Remote Method Invocation (RMI) – by passing serialized objects as parameters to functions
running on a remote machine as if invoked on a local machine. This data can be transmitted across
domains through firewalls.
Sharing data in a Distributed Object Model – when programs written in different languages
(running on diverse platforms) need to share object data over a distributed network using
frameworks such as COM and CORBA. However, SOAP, REST and other web services have
replaced these applications now.
Could you list some text-based Data Serialization formats and their key features?
Without being exhaustive, here are some common ones:
XML (Extensible Markup Language) - Nested textual format. Human-readable and editable.
Schema based validation. Used in metadata applications, web services data transfer, web
publishing.
CSV (Comma-Separated Values) - Table structure with delimiters. Human-readable textual data.
Opens as spreadsheet or plaintext. Used as plaintext Database.
JSON (JavaScript Object Notation) - Short syntax textual format with limited data types. Human-
readable. Derived from JavaScript data formats. No need of a separate parser (like XML) since
they map to JavaScript objects. Can be fetched with an XMLHttpRequest call. No direct support
for DATE data type. All data is dynamically processed. Popular format for web API parameter
passing. Mobile apps use this extensively for user interaction and database services.
YAML (YAML Ain't Markup Language) - Lightweight text format. Human-readable. Supports
comments and thus easily editable. Superset of JSON. Supports complex data types. Maps easily
to native data structures. Used in configuration settings, document headers, Apps with need for
MySQL style self-references in relational data.
Could you list some binary Data Serialization formats and their key features?
Without being exhaustive, here are some common ones:
BSON (Binary JSON) - Created and internally used by MongoDB. Binary format, not human-
readable. Deals with attribute-value pairs like JSON. Includes datetime, bytearray and other data
types not present in JSON. Used in web apps with rich media data types such as live video. Primary
use is storage, not network communication.
MessagePack - Designed for data to be transparently converted from/to JSON. Compressed binary
format, not human-readable. Supports static typing. Supports RPC. Better JSON compatibility
than BSON. Primary use is network communication, not storage. Used in apps with distributed
file systems.
protobuf (Protocol Buffers) - Created by Google. Binary message format that allows programmers
to specify a schema for the data. Also includes a set of rules and tools to define and exchange these
messages. Transparent data compression. Used in multi-platform applications due to easy
interoperability between languages. Universal RPC framework. Used in performance-critical
distributed applications.