0% found this document useful (0 votes)
197 views32 pages

Big Data Analytics - Unit 4

Uploaded by

Prabha Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views32 pages

Big Data Analytics - Unit 4

Uploaded by

Prabha Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Big Data Analytics – Unit 4

Hadoop
Hadoop
It is an open source framework that enables
processing of large data sets which are stored
in the form of distributed environment across
clusters of computers.

Hadoop framework application works in an


environment that provides distributed storage
and computation across clusters of
computers.
Hadoop Framework
Apache Hadoop is the most important
framework for working with Big Data.
Hadoop biggest strength is scalability.
It upgrades from working on a single node to
thousands of nodes without any issue in a
seamless manner.
History of Hadoop
How does Hadoop work?
 It is quite expensive to build bigger servers with heavy
configurations that handle large scale processing, but as an
alternative, you can tie together many commodity computers with
single-CPU, as a single functional distributed system and
practically, the clustered machines can read the dataset in parallel
and provide a much higher throughput.

 Hadoop runs code across a cluster of computers.


 This process includes the following core tasks that Hadoop
performs −

 Data is initially divided into directories and files. Files are divided
into uniform sized blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for
further processing.
 HDFS, being on top of the local file system, supervises the
processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce
Advantages of Hadoop
 Fast: In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval. Even the tools to process the data are
often on the same servers, thus reducing the processing time. It is
able to process terabytes of data in minutes and Peta bytes in hours.

 Scalable: Hadoop cluster can be extended by just adding nodes in


the cluster.

 Cost Effective: Hadoop is open source and uses commodity


hardware to store data so it really cost effective as compared to
traditional relational database management system.

 Resilient to failure: HDFS has the property with which it can


replicate data over the network, so if one node is down or some
other network failure happens, then Hadoop takes the other copy of
data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
Modules of Hadoop
Hadoop Common: Includes the common utilities which
supports the other Hadoop modules
HDFS: Hadoop Distributed File System provides
unrestricted, high-speed access to the data application.
The files will be broken into blocks and stored in nodes
over distributed architecture.
Hadoop Yarn: This technology is basically used for
scheduling of job and efficient management of the
cluster resource.
Map Reduce: This is a highly efficient methodology for
parallel processing of huge volumes of data. The map
task takes input data and converts it into a data set
which can be computed in key value pair.
Hadoop Architecture
 Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce)
 Storage layer (Hadoop Distributed File System).

 MapReduce
 MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large
amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner. The MapReduce program runs on Hadoop which is
an Apache open-source framework.
Hadoop Architecture
 Hadoop Distributed File System
 The Hadoop Distributed File System (HDFS) is based on the
Google File System (GFS) and provides a distributed file
system that is designed to run on commodity hardware.
 It is highly fault-tolerant and is designed to be deployed on
low-cost hardware.
 It provides high throughput access to application data and is
suitable for applications having large datasets.
 Apart from the above-mentioned two core components,
Hadoop framework also includes the following two modules −
 Hadoop Common − These are Java libraries and utilities
required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and
cluster resource management.
Hadoop Ecosystem
The ecosystem and stack
 The Hadoop stack is the framework providing the functionality to process a
huge amount of data or dataset in a distributed manner.
 As per the requirement or use case, we need to choose the different Hadoop
stack.
 For the batch process, we will use the HDP (Hortonworks Data Platform )
stack. (handles data at rest)
 For the live data processing, we will use the HDF (Hortonworks Data Flow)
stack. (handles data in motion)
 In the HDP stack, we will get the HDFS, Yarn, Oozie, MapReduce, Spark,
Atlas, Ranger, Zeppelin, Hive, HBase, etc. In the HDF stack, we will get the
Kafka, NiFi, schema registry, and all.

 Syntax:

 As such, there is no specific syntax available for the Hadoop Stack. As per
the requirement or need, we can use the necessary components of the HDP
or HDF environment
Components of Hadoop
 HDFS: Hadoop Distributed File System
 It is the most important component of Hadoop Ecosystem.
 HDFS is the primary storage system of Hadoop.
 Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data
storage for Big Data.
 HDFS is a distributed file system that runs on commodity hardware.
 HDFS is already configured with default configuration for many
installations.
 Most of the time for large clusters configuration is needed.

 HDFS Components:

 There are two major components of Hadoop HDFS-


 NameNode and
 DataNode.
HDFS
i. NameNode
It is also known as Master node.
NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks,
their location, on which Rack, which Data node the
data is stored and other details. It consists of files and
directories.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming,
closing, opening files and directories.
HDFS
 ii. Data Node
 It is also known as Slave. It is responsible for storing actual data in HDFS.
 Data node performs read and write operation as per the request of the clients.
 Replica block of Data node consists of 2 files on the file system.
 The first file is for data and second file is for recording the block’s metadata.
 At startup, each Data node connects to its corresponding Name node and does
handshaking.
 Verification of namespace ID and software version of Data Node take place by
handshaking.
 At the time of mismatch found, Data Node goes down automatically.

 Tasks of HDFS Data Node

 Data Node performs operations like block replica creation, deletion, and
replication according to the instruction of Name Node.
 Data Node manages data storage of the system.
 This was all about HDFS as a Hadoop Ecosystem component.
Map-Reduce
 MapReduce is the data processing component of Hadoop.
 MapReduce consists of two distinct tasks – Map and Reduce.
 As the name MapReduce suggests, the reducer phase takes
place after the mapper phase has been completed.
 So, the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input
to the Reducer.
 The reducer receives the key-value pair from multiple map
jobs.
 Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or
key-value pairs which is the final output.
MapReduce
MapReduce is a programming framework
that allows us to perform distributed and
parallel processing on large data sets in a
distributed environment.
Yarn
Yarn
It is like the Operating System of Hadoop.
It mainly monitors and manages the resources.
There are two main components –
a. Node Manager: It monitors the resource usage
like CPU, memory etc of the local node and
intimates the same to the Resource Manager.
b. Resource Manager: It is responsible to track
the resources in the cluster and schedule tasks
like map-reduce.
It also consists of Application master and
Scheduler.
Hive
It is a data warehouse project built on top of
Hadoop which provides data query and
analysis.
Hive Query Language (HQL) translates the
queries into map-reduce jobs.
Main parts of the Hive are:
Meta Store – Stores metadata
Driver – Manages the lifecycle of HQL
statement
Query Compiler – Compiles HQL into Directed
Acyclic graph

Pig
 PIG is a SQL like
language used to
query the data
stored in HDFS.
 Features of Pig
are
 Extensibility
 Optimization
opportunities
 Handles all kinds
of data
 The load command loads the data
 At backend the pig latin compiler converts it to a sequence of map-
reduce jobs.
 We perform various functions like join, sort, group etc.
 The output can be dump on a screen or stored in an HDFS file.
HBase
It is a NoSQL database built on top of HDFS.

It is an open-source, non-relational, distributed


database.

It provides real-time read/write access to large


datasets.

It consists of the components


A. HBase Master
B. RegionServer
MAHOUT
Mahout provides a platform for creating
machine learning applications which are
scalable.
Mahout performs collaborative filtering,
clustering and classification.
Collaborative filtering – based on user
behavior patterns
Clustering – grouping of similar type of data
Classification – categorizing data into sub-
departments.
Frequent Item set missing- It generally gives
suggestions on items bought together.
Zookeeper
It co-ordinates between various services in
Hadoop ecosystem.
Its features are Speed, Organized, Simple,
Reliable.
Zookeeper solves the problems of deadlock
via synchronization.
It also solve the race condition. This occurs
when the machine tries to perform 2 or more
operations at a time. This is solved using
serialization.
Oozie
Apache Oozie is a server-based workflow scheduling
system to manage Hadoop jobs.
There are three basic types of Oozie jobs. They are
 Workflow – stores and runs a workflow of hadoop jobs.
 Co-ordinator – runs jobs based on predefined
schedules and availability of data
 Bundle – This is a package of many co-ordinators and
workflow jobs.

 There are two types of nodes in Oozie.


 Action Node and

 Control Flow Node.


Sqoop
Sqoop imports data from external sources into
Hadoop ecosystem
It also transfers data from Hadoop to other
external sources.
This works with both structured and
unstructured data.
Flume
This is a service which helps to ingest structured and
semi-structured data into HDFS.
This works on the principle of distributed processing.
This helps in aggregation and movement of huge
amount of data sets.
The three components of Flume are
Source (accepts data from incoming stream and stores
data)
Sink (collects the data from the channel and writes it
to HDFS)
Channel (It is a medium of temporary storage between
source of data and persistent storage of HDFS)
Ambari
It is responsible for provisioning, managing,
monitoring and securing Hadoop Cluster.

Ambari gives:

Hadoop cluster provisioning

Hadoop cluster management

Hadoop cluster monitoring


Design of HDFS
1. Very large
files
2. Streaming
data access
3.
Commodity
hardware
4. Low
latency data
access
5. Lots of
small files
Hadoop Distributed File Systems
HDFS Architecture
Examples of HDFS
HDFS Data Replication

You might also like