Big Data Analytics - Unit 4
Big Data Analytics - Unit 4
Hadoop
Hadoop
It is an open source framework that enables
processing of large data sets which are stored
in the form of distributed environment across
clusters of computers.
Data is initially divided into directories and files. Files are divided
into uniform sized blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for
further processing.
HDFS, being on top of the local file system, supervises the
processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce
Advantages of Hadoop
Fast: In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval. Even the tools to process the data are
often on the same servers, thus reducing the processing time. It is
able to process terabytes of data in minutes and Peta bytes in hours.
MapReduce
MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large
amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner. The MapReduce program runs on Hadoop which is
an Apache open-source framework.
Hadoop Architecture
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the
Google File System (GFS) and provides a distributed file
system that is designed to run on commodity hardware.
It is highly fault-tolerant and is designed to be deployed on
low-cost hardware.
It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart from the above-mentioned two core components,
Hadoop framework also includes the following two modules −
Hadoop Common − These are Java libraries and utilities
required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and
cluster resource management.
Hadoop Ecosystem
The ecosystem and stack
The Hadoop stack is the framework providing the functionality to process a
huge amount of data or dataset in a distributed manner.
As per the requirement or use case, we need to choose the different Hadoop
stack.
For the batch process, we will use the HDP (Hortonworks Data Platform )
stack. (handles data at rest)
For the live data processing, we will use the HDF (Hortonworks Data Flow)
stack. (handles data in motion)
In the HDP stack, we will get the HDFS, Yarn, Oozie, MapReduce, Spark,
Atlas, Ranger, Zeppelin, Hive, HBase, etc. In the HDF stack, we will get the
Kafka, NiFi, schema registry, and all.
Syntax:
As such, there is no specific syntax available for the Hadoop Stack. As per
the requirement or need, we can use the necessary components of the HDP
or HDF environment
Components of Hadoop
HDFS: Hadoop Distributed File System
It is the most important component of Hadoop Ecosystem.
HDFS is the primary storage system of Hadoop.
Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data
storage for Big Data.
HDFS is a distributed file system that runs on commodity hardware.
HDFS is already configured with default configuration for many
installations.
Most of the time for large clusters configuration is needed.
HDFS Components:
Data Node performs operations like block replica creation, deletion, and
replication according to the instruction of Name Node.
Data Node manages data storage of the system.
This was all about HDFS as a Hadoop Ecosystem component.
Map-Reduce
MapReduce is the data processing component of Hadoop.
MapReduce consists of two distinct tasks – Map and Reduce.
As the name MapReduce suggests, the reducer phase takes
place after the mapper phase has been completed.
So, the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input
to the Reducer.
The reducer receives the key-value pair from multiple map
jobs.
Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or
key-value pairs which is the final output.
MapReduce
MapReduce is a programming framework
that allows us to perform distributed and
parallel processing on large data sets in a
distributed environment.
Yarn
Yarn
It is like the Operating System of Hadoop.
It mainly monitors and manages the resources.
There are two main components –
a. Node Manager: It monitors the resource usage
like CPU, memory etc of the local node and
intimates the same to the Resource Manager.
b. Resource Manager: It is responsible to track
the resources in the cluster and schedule tasks
like map-reduce.
It also consists of Application master and
Scheduler.
Hive
It is a data warehouse project built on top of
Hadoop which provides data query and
analysis.
Hive Query Language (HQL) translates the
queries into map-reduce jobs.
Main parts of the Hive are:
Meta Store – Stores metadata
Driver – Manages the lifecycle of HQL
statement
Query Compiler – Compiles HQL into Directed
Acyclic graph
Pig
PIG is a SQL like
language used to
query the data
stored in HDFS.
Features of Pig
are
Extensibility
Optimization
opportunities
Handles all kinds
of data
The load command loads the data
At backend the pig latin compiler converts it to a sequence of map-
reduce jobs.
We perform various functions like join, sort, group etc.
The output can be dump on a screen or stored in an HDFS file.
HBase
It is a NoSQL database built on top of HDFS.
Ambari gives: