BDA Unit 1
BDA Unit 1
Architecture
Hadoop Architecture
Hadoop has a master-slave topology. In this topology, we have one master node and
multiple slave nodes. Master node’s function is to assign a task to various slave nodes and
manage resources. The slave nodes do the actual computing. Slave nodes store the real data
whereas on master we have metadata.
Divya Maheshwari
HDFS
(Hadoop Distributed File System)
• It provides for data storage of Hadoop.
• HDFS splits the data unit into smaller units called blocks and stores them in
a distributed manner.
• One for master node – NameNode and other for slave nodes – DataNode.
Divya Maheshwari
(1)NameNode and DataNode
Divya Maheshwari
NameNode
• The daemon called NameNode runs on the master server.
• These are actions like the opening, closing and renaming files or
directories.
Divya Maheshwari
DataNode
• DataNode daemon runs on slave nodes.
• Internally, a file gets split into a number of data blocks and stored on a
group of slave machines.
• This DataNodes serves read/write request from the file system’s client.
Divya Maheshwari
(2)Block in HDFS
• Block is nothing but the smallest unit of storage on a computer system.
• It is the smallest contiguous storage allocated to a file.
• In Hadoop, we have a default block size of 128MB or 256 MB.
Divya Maheshwari
(3) Replication Management
• To provide fault tolerance HDFS uses a replication technique.
• In that, it makes copies of the blocks and stores in on different DataNodes.
• Replication factor decides how many copies of the blocks get stored.
• It is 3 by default but we can configure to any value.
Block Replication
Divya Maheshwari
Continue…
Divya Maheshwari
(4)Rack Awareness
Divya Maheshwari
Continue…
• A rack contains many DataNode machines and there are several such racks
in the production.
• HDFS follows a rack awareness algorithm to place the replicas of the
blocks in a distributed fashion.
• This rack awareness algorithm provides for low latency and fault tolerance.
• Suppose the replication factor configured is 3.
• Now rack awareness algorithm will place the first block on a local rack.
• It will keep the other two blocks on a different rack.
• It does not store more than two blocks in the same rack if possible.
Divya Maheshwari
MapReduce
• MapReduce is the data processing layer of Hadoop.
• It is a software framework that allows you to write applications for
processing a large amount of data.
• MapReduce runs these applications in parallel on a cluster of low-end
machines.
• MapReduce job comprises a number of map tasks and reduces tasks.
• Each task works on a part of data.
• This distributes the load across the cluster.
• The function of Map tasks is to load, parse, transform and filter data.
• Each reduce task works on the sub-set of output from the map tasks.
• Reduce task applies grouping and aggregation to this intermediate data
from the map tasks.
• The input file for the MapReduce job exists on HDFS.
• The input format decides how to split the input file into input splits.
• Input split is nothing but a byte-oriented view of the chunk of the input file.
• This input split gets loaded by the map task.
• The map task runs on the node where the relevant data is present. The data
need not move over the network and get processed locally.
Divya Maheshwari
Divya Maheshwari
MapReduce
Map
(It produces zero or multiple intermediate key-value pairs.)
Reduce
(The reducer performs the reduce
Combiner function once per key grouping. )
(The combiner is actually a localized reducer which groups the
data in the map phase. )
Partitioner OutputFormat
(Partitioner pulls the intermediate key-value pairs from the ( final step)
mapper)
Divya Maheshwari
Features of MapReduce
• Simplicity – MapReduce jobs are easy to run. Applications can be written
in any language such as java, C++, and python.
Divya Maheshwari
YARN
(Yet Another Resource Negotiator)
• YARN is the resource management layer of Hadoop.
• The basic principle behind YARN is to separate resource management and job
scheduling/monitoring function into separate daemons.
• The ResourceManager arbitrates resources among all the competing applications in the
system.
• The job of NodeManger is to monitor the resource usage by the container and report the
same to ResourceManger.
• The resources are like CPU, memory, disk, network and so on.
i. Scheduler
Scheduler is responsible for allocating
resources to various applications. The
scheduler allocates the resources based on
the requirements of the applications.
Divya Maheshwari
Features of Yarn
• Multi-tenancy
YARN allows a variety of access engines (open-source or propriety) on the same Hadoop
data set. These access engines can be of batch processing, real-time processing, iterative
processing and so on.
• Cluster Utilization
With the dynamic allocation of resources, YARN allows for good use of the cluster. As
compared to static map-reduce rules in previous versions of Hadoop which provides lesser
utilization of the cluster.
• Scalability
Any data center processing power keeps on expanding. YARN’s ResourceManager focuses
on scheduling and copes with the ever-expanding cluster, processing petabytes of data.
• Compatibility
MapReduce program developed for Hadoop 1.x can still on this YARN. And this is without
any disruption to processes that already work.
Divya Maheshwari
Analysis Big data using Hadoop
(Hadoop Ecosystem)
Divya Maheshwari
Hive
Hive is an open source data warehouse system for querying and analysing large datasets
stored in Hadoop files.
Hive do three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL.
Apache Pig is a high-level language platform for analyzing and querying huge dataset that are
stored in HDFS. Pig uses PigLatin language. It is similar to SQL. It loads the data, applies the
required filters and dumps the data in the required format. For Programs execution, pig requires
Java runtime environment.
Features of Apache Pig:
Extensibility – For carrying out special purpose processing, users can create their own function.
Optimization opportunities – Pig allows the system to optimize automatic execution. This
allows the user to pay attention to semantics instead of efficiency.
Handles all kinds of data – Pig analyzes both structured as well as unstructured.
Divya Maheshwari
Apache HBase is a distributed database that was
HBase
designed to store structured data in tables that could
have billions of row and millions of columns.
HBase is scalable, distributed, and NoSQL database
that is built on top of HDFS.
HBase, provide real-time access to read or write
data in HDFS.
Components of HBase
i. HBase Master
It is not part of the actual data storage but
negotiates load balancing across all RegionServer.
Maintain and monitor the Hadoop cluster. Performs
administration (interface for creating, updating and
deleting tables.) Controls the failover. HMaster
handles DDL operation.
ii. RegionServer
It is the worker node which handles read, writes,
updates and delete requests from clients. Region
server process runs on every node in Hadoop
cluster. Region server runs on HDFS DateNode.
Divya Maheshwari
HCatalog
Benefits of HCatalog:
• Enables notifications of data availability.
• With the table abstraction, HCatalog frees the user from overhead of data
storage.
• Provide visibility for data cleaning and archiving tools.
Divya Maheshwari
Avro
• Avro is an open source project that provides data serialization and data exchange
services for Hadoop. These services can be used together or independently.
• Big data can exchange programs written in different languages using Avro.
• Using serialization service programs can serialize data into files or messages.
• It stores data definition and data together in one message or file making it easy for
programs to dynamically understand information stored in Avro file or message.
Features of Avro:
• Rich data structures.
• Remote procedure call.
• Compact, fast, binary data format.
• Container file, to store persistent data.
Divya Maheshwari
Thrift
• It is a software framework for
scalable cross-language services
development.
Divya Maheshwari
Drill
• The drill is the first distributed SQL query engine that has a schema-free model.
• The drill has specialized memory management system to eliminates garbage collection and optimize
memory allocation and usage. Drill plays well with Hive by allowing developers to reuse their
existing Hive deployment.
• Extensibility – Drill provides an extensible architecture at all layers, including query layer, query
optimization, and client API. We can extend any layer for the specific need of an organization.
• Flexibility – Drill provides a hierarchical columnar data model that can represent complex, highly
dynamic data and allow efficient processing.
• Dynamic schema discovery – Apache drill does not require schema or type specification for data in
order to start the query execution process. Instead, drill starts processing the data in units called
record batches and discover schema on the fly during processing.
• Drill decentralized metadata – Unlike other SQL Hadoop technologies, the drill does not have
centralized metadata requirement. Drill users do not need to create and manage tables in metadata in
order to query data.
Divya Maheshwari
Mahout
Mahout is open source framework for creating scalable machine
learning algorithm and data mining library. Once data is stored in Hadoop
HDFS, mahout provides the data science tools to automatically find
meaningful patterns in those big data sets.
• Clustering – Here it takes the item in particular class and organizes them
into naturally occurring groups, such that item belonging to the same group
are similar to each other.
• Collaborative filtering – It mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
• Classifications – It learns from existing categorization and then assigns
unclassified items to the best category.
• Frequent pattern mining – It analyzes items in a group (e.g. items in a
shopping cart or terms in query session) and then identifies which items
typically appear together.
Divya Maheshwari
Sqoop
Sqoop imports data from external
sources into related Hadoop
ecosystem components like HDFS,
HBase or Hive.
It also exports data from Hadoop to
other external sources. Sqoop works
with relational databases such as
teradata, Netezza, oracle, MySQL.
Features of Sqoop:
• Import sequential datasets from mainframe – Sqoop satisfies the growing need to move data
from the mainframe to HDFS.
• Import direct to ORC files – Improves compression and light weight indexing and improve
query performance.
• Parallel data transfer – For faster performance and optimal system utilization.
• Efficient data analysis – Improve efficiency of data analysis by combining structured data and
unstructured data on a schema on reading data lake.
• Fast data copies – from an external system into Hadoop.
Divya Maheshwari
Flume
• Flume efficiently collects,
aggregate and moves a large
amount of data from its origin
and sending it back to HDFS.
• It is fault tolerant and reliable
mechanism.
• This Hadoop Ecosystem
component allows the data flow
from the source into Hadoop
environment.
• It uses a simple extensible data
model that allows for the online
analytic application.
• Using Flume, we can get the
data from multiple servers
immediately into hadoop.
Divya Maheshwari
Ambari
Features of Ambari:
Simplified installation, configuration, and management – Ambari easily and efficiently create and
manage clusters at scale.
Centralized security setup – Ambari reduce the complexity to administer and configure cluster security
across the entire platform.
Highly extensible and customizable – Ambari is highly extensible for bringing custom services under
management.
Full visibility into cluster health – Ambari ensures that the cluster is healthy and available with a
holistic approach to monitoring. Divya Maheshwari
Zookeeper
Features of Zookeeper:
Fast – Zookeeper is fast with workloads where reads to data are more common than writes.
The ideal read/write ratio is 10:1.
Ordered – Zookeeper maintains a record of all transactions.
Divya Maheshwari
Oozie
Oozie combines multiple jobs sequentially into one logical unit of work. Oozie
framework is fully integrated with apache Hadoop stack, YARN as an architecture
center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.
Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, pig, Hive.
Oozie Coordinator – It runs workflow jobs based on predefined schedules and
availability of data.
Divya Maheshwari
Hadoop Cluster
• A collection of nodes is what we
call the cluster.
• A node is a point of
intersection/connection within a
network, i.e. a server
Divya Maheshwari
Continue…
Hadoop clusters have two types of machines, such as Master and Slave,
• Master: HDFS NameNode, YARN ResourceManager.
• Slaves: HDFS DataNodes, YARN NodeManager.
It is recommended to separate the master and slave node, because:
• Task/application workloads on the slave nodes should be isolated from the
masters.
• Slaves nodes are frequently decommissioned for maintenance.
Divya Maheshwari
Advantages of a Hadoop Cluster
• The cluster helps in increasing the speed of the analysis process.
• It is inexpensive.
• These clusters are failure resilient.
• One more benefit to Hadoop clusters is “scalability”, it means Hadoop
offers Scalable and flexible Data Storage. Here Scalability means, we can
scale a Hadoop cluster by adding new servers to the cluster if needed.
• Hadoop Clusters deal with data from many sources and formats in a very
quick, easy manner.
• It is possible to deploy Hadoop using a single-node installation, for
evaluation purposes.
Divya Maheshwari