0% found this document useful (0 votes)

197 views32 pages

Big Data Analytics - Unit 4

Uploaded by

Prabha Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

197 views32 pages

Big Data Analytics - Unit 4

Uploaded by

Prabha Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Big Data Analytics – Unit 4

Hadoop
Hadoop
It is an open source framework that enables
processing of large data sets which are stored
in the form of distributed environment across
clusters of computers.

Hadoop framework application works in an

environment that provides distributed storage
and computation across clusters of
computers.
Hadoop Framework
Apache Hadoop is the most important
framework for working with Big Data.
Hadoop biggest strength is scalability.
It upgrades from working on a single node to
thousands of nodes without any issue in a
seamless manner.
History of Hadoop
How does Hadoop work?
 It is quite expensive to build bigger servers with heavy
configurations that handle large scale processing, but as an
alternative, you can tie together many commodity computers with
single-CPU, as a single functional distributed system and
practically, the clustered machines can read the dataset in parallel
and provide a much higher throughput.

 Hadoop runs code across a cluster of computers.

 This process includes the following core tasks that Hadoop
performs −

 Data is initially divided into directories and files. Files are divided
into uniform sized blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for
further processing.
 HDFS, being on top of the local file system, supervises the
processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce
Advantages of Hadoop
 Fast: In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval. Even the tools to process the data are
often on the same servers, thus reducing the processing time. It is
able to process terabytes of data in minutes and Peta bytes in hours.

 Scalable: Hadoop cluster can be extended by just adding nodes in

the cluster.

 Cost Effective: Hadoop is open source and uses commodity

hardware to store data so it really cost effective as compared to
traditional relational database management system.

 Resilient to failure: HDFS has the property with which it can

replicate data over the network, so if one node is down or some
other network failure happens, then Hadoop takes the other copy of
data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
Modules of Hadoop
Hadoop Common: Includes the common utilities which
supports the other Hadoop modules
HDFS: Hadoop Distributed File System provides
unrestricted, high-speed access to the data application.
The files will be broken into blocks and stored in nodes
over distributed architecture.
Hadoop Yarn: This technology is basically used for
scheduling of job and efficient management of the
cluster resource.
Map Reduce: This is a highly efficient methodology for
parallel processing of huge volumes of data. The map
task takes input data and converts it into a data set
which can be computed in key value pair.
Hadoop Architecture
 Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce)
 Storage layer (Hadoop Distributed File System).

 MapReduce
 MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large
amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner. The MapReduce program runs on Hadoop which is
an Apache open-source framework.
Hadoop Architecture
 Hadoop Distributed File System
 The Hadoop Distributed File System (HDFS) is based on the
Google File System (GFS) and provides a distributed file
system that is designed to run on commodity hardware.
 It is highly fault-tolerant and is designed to be deployed on
low-cost hardware.
 It provides high throughput access to application data and is
suitable for applications having large datasets.
 Apart from the above-mentioned two core components,
Hadoop framework also includes the following two modules −
 Hadoop Common − These are Java libraries and utilities
required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and
cluster resource management.
Hadoop Ecosystem
The ecosystem and stack
 The Hadoop stack is the framework providing the functionality to process a
huge amount of data or dataset in a distributed manner.
 As per the requirement or use case, we need to choose the different Hadoop
stack.
 For the batch process, we will use the HDP (Hortonworks Data Platform )
stack. (handles data at rest)
 For the live data processing, we will use the HDF (Hortonworks Data Flow)
stack. (handles data in motion)
 In the HDP stack, we will get the HDFS, Yarn, Oozie, MapReduce, Spark,
Atlas, Ranger, Zeppelin, Hive, HBase, etc. In the HDF stack, we will get the
Kafka, NiFi, schema registry, and all.

 Syntax:

 As such, there is no specific syntax available for the Hadoop Stack. As per
the requirement or need, we can use the necessary components of the HDP
or HDF environment
Components of Hadoop
 HDFS: Hadoop Distributed File System
 It is the most important component of Hadoop Ecosystem.
 HDFS is the primary storage system of Hadoop.
 Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data
storage for Big Data.
 HDFS is a distributed file system that runs on commodity hardware.
 HDFS is already configured with default configuration for many
installations.
 Most of the time for large clusters configuration is needed.

 HDFS Components:

 There are two major components of Hadoop HDFS-

 NameNode and
 DataNode.
HDFS
i. NameNode
It is also known as Master node.
NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks,
their location, on which Rack, which Data node the
data is stored and other details. It consists of files and
directories.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming,
closing, opening files and directories.
HDFS
 ii. Data Node
 It is also known as Slave. It is responsible for storing actual data in HDFS.
 Data node performs read and write operation as per the request of the clients.
 Replica block of Data node consists of 2 files on the file system.
 The first file is for data and second file is for recording the block’s metadata.
 At startup, each Data node connects to its corresponding Name node and does
handshaking.
 Verification of namespace ID and software version of Data Node take place by
handshaking.
 At the time of mismatch found, Data Node goes down automatically.

 Tasks of HDFS Data Node

 Data Node performs operations like block replica creation, deletion, and
replication according to the instruction of Name Node.
 Data Node manages data storage of the system.
 This was all about HDFS as a Hadoop Ecosystem component.
Map-Reduce
 MapReduce is the data processing component of Hadoop.
 MapReduce consists of two distinct tasks – Map and Reduce.
 As the name MapReduce suggests, the reducer phase takes
place after the mapper phase has been completed.
 So, the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input
to the Reducer.
 The reducer receives the key-value pair from multiple map
jobs.
 Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or
key-value pairs which is the final output.
MapReduce
MapReduce is a programming framework
that allows us to perform distributed and
parallel processing on large data sets in a
distributed environment.
Yarn
Yarn
It is like the Operating System of Hadoop.
It mainly monitors and manages the resources.
There are two main components –
a. Node Manager: It monitors the resource usage
like CPU, memory etc of the local node and
intimates the same to the Resource Manager.
b. Resource Manager: It is responsible to track
the resources in the cluster and schedule tasks
like map-reduce.
It also consists of Application master and
Scheduler.
Hive
It is a data warehouse project built on top of
Hadoop which provides data query and
analysis.
Hive Query Language (HQL) translates the
queries into map-reduce jobs.
Main parts of the Hive are:
Meta Store – Stores metadata
Driver – Manages the lifecycle of HQL
statement
Query Compiler – Compiles HQL into Directed
Acyclic graph

Pig
 PIG is a SQL like
language used to
query the data
stored in HDFS.
 Features of Pig
are
 Extensibility
 Optimization
opportunities
 Handles all kinds
of data
 The load command loads the data
 At backend the pig latin compiler converts it to a sequence of map-
reduce jobs.
 We perform various functions like join, sort, group etc.
 The output can be dump on a screen or stored in an HDFS file.
HBase
It is a NoSQL database built on top of HDFS.

It is an open-source, non-relational, distributed

database.

It provides real-time read/write access to large

datasets.

It consists of the components

A. HBase Master
B. RegionServer
MAHOUT
Mahout provides a platform for creating
machine learning applications which are
scalable.
Mahout performs collaborative filtering,
clustering and classification.
Collaborative filtering – based on user
behavior patterns
Clustering – grouping of similar type of data
Classification – categorizing data into sub-
departments.
Frequent Item set missing- It generally gives
suggestions on items bought together.
Zookeeper
It co-ordinates between various services in
Hadoop ecosystem.
Its features are Speed, Organized, Simple,
Reliable.
Zookeeper solves the problems of deadlock
via synchronization.
It also solve the race condition. This occurs
when the machine tries to perform 2 or more
operations at a time. This is solved using
serialization.
Oozie
Apache Oozie is a server-based workflow scheduling
system to manage Hadoop jobs.
There are three basic types of Oozie jobs. They are
 Workflow – stores and runs a workflow of hadoop jobs.
 Co-ordinator – runs jobs based on predefined
schedules and availability of data
 Bundle – This is a package of many co-ordinators and
workflow jobs.

 There are two types of nodes in Oozie.

 Action Node and

 Control Flow Node.

Sqoop
Sqoop imports data from external sources into
Hadoop ecosystem
It also transfers data from Hadoop to other
external sources.
This works with both structured and
unstructured data.
Flume
This is a service which helps to ingest structured and
semi-structured data into HDFS.
This works on the principle of distributed processing.
This helps in aggregation and movement of huge
amount of data sets.
The three components of Flume are
Source (accepts data from incoming stream and stores
data)
Sink (collects the data from the channel and writes it
to HDFS)
Channel (It is a medium of temporary storage between
source of data and persistent storage of HDFS)
Ambari
It is responsible for provisioning, managing,
monitoring and securing Hadoop Cluster.

Ambari gives:

Hadoop cluster provisioning

Hadoop cluster management

Hadoop cluster monitoring

Design of HDFS
1. Very large
files
2. Streaming
data access
3.
Commodity
hardware
4. Low
latency data
access
5. Lots of
small files
Hadoop Distributed File Systems
HDFS Architecture
Examples of HDFS
HDFS Data Replication

BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
Mc5502 Bda Unit I Notes
No ratings yet
Mc5502 Bda Unit I Notes
106 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
ACS 1000 Faults Alarms Classic0.1
94% (17)
ACS 1000 Faults Alarms Classic0.1
189 pages
Brochure For ATAL Workshop
No ratings yet
Brochure For ATAL Workshop
3 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
APP Question Bank Unit3
100% (1)
APP Question Bank Unit3
5 pages
Unit V
No ratings yet
Unit V
67 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
DVT Paper
No ratings yet
DVT Paper
1 page
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
R Language
No ratings yet
R Language
59 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Bda - 2 Unit
No ratings yet
Bda - 2 Unit
12 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
86 pages
Unit01 03
No ratings yet
Unit01 03
147 pages
Phase 2 Final
100% (1)
Phase 2 Final
65 pages
Unit 5 2 Marks
No ratings yet
Unit 5 2 Marks
10 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
HADOOP
No ratings yet
HADOOP
19 pages
Data Storage Technologies and Networks
No ratings yet
Data Storage Technologies and Networks
7 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Hbase PPT PDF
No ratings yet
Hbase PPT PDF
100 pages
Cp4152 Database Practice Lab Manual R 2021
No ratings yet
Cp4152 Database Practice Lab Manual R 2021
48 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Unit 3 Big Data MCQ AKTU: Royal Brinkman Gartenbaubedarf
No ratings yet
Unit 3 Big Data MCQ AKTU: Royal Brinkman Gartenbaubedarf
17 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
BDA - Chapter-1-Components of Hadoop Ecosystem - Lecture 3
0% (1)
BDA - Chapter-1-Components of Hadoop Ecosystem - Lecture 3
38 pages
Semantic Web SN
No ratings yet
Semantic Web SN
22 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
HBase
No ratings yet
HBase
31 pages
QP Solutions
100% (1)
QP Solutions
18 pages
BUSI 472 - Business Etiquette PowerPoint
100% (2)
BUSI 472 - Business Etiquette PowerPoint
13 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
43 PPT On Apache Pig
No ratings yet
43 PPT On Apache Pig
16 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
MSC IT Syllabus
93% (15)
MSC IT Syllabus
69 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
2 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Astable Multivibrator
100% (1)
Astable Multivibrator
4 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Panasonic VF0 Inverters
100% (3)
Panasonic VF0 Inverters
4 pages
Boiler Turbine Technical Report
No ratings yet
Boiler Turbine Technical Report
31 pages
Sapthagiri College of Engineering: Department of Information Science and Engineering Big Data Analytics Question Bank
No ratings yet
Sapthagiri College of Engineering: Department of Information Science and Engineering Big Data Analytics Question Bank
3 pages
Practice 7 QC Tool Part 1
No ratings yet
Practice 7 QC Tool Part 1
11 pages
01-MCC310 Single Line Diagram For UCD4 MCC of TLM Plant
No ratings yet
01-MCC310 Single Line Diagram For UCD4 MCC of TLM Plant
17 pages
Dynapac Large Tracked Paver Range: SD2500C / SD2500CS / SD2550C / SD2550CS
No ratings yet
Dynapac Large Tracked Paver Range: SD2500C / SD2500CS / SD2550C / SD2550CS
20 pages
A Path Finding Visualization Using A Star Algorithm and Dijkstra's Algorithm
100% (1)
A Path Finding Visualization Using A Star Algorithm and Dijkstra's Algorithm
2 pages
Design and Analysis of Pressure Vessel
No ratings yet
Design and Analysis of Pressure Vessel
9 pages
Mini Project Phase I Report Format 2 Edited (2) Almost Completed
No ratings yet
Mini Project Phase I Report Format 2 Edited (2) Almost Completed
28 pages
Exercise 2
100% (1)
Exercise 2
4 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Dokumen - Tips - Toyota 8fbchu25 Forklift Service Repair Manual
100% (2)
Dokumen - Tips - Toyota 8fbchu25 Forklift Service Repair Manual
15 pages
Doc3 Main Report
No ratings yet
Doc3 Main Report
60 pages
Fortigate Getting Started 56
No ratings yet
Fortigate Getting Started 56
105 pages
OOAd 2 Marks
No ratings yet
OOAd 2 Marks
16 pages
EIL Engineers Recruitment Notice 11 04
No ratings yet
EIL Engineers Recruitment Notice 11 04
9 pages
Water Electrolysis
No ratings yet
Water Electrolysis
2 pages
Udyam Registration
No ratings yet
Udyam Registration
4 pages
I PU Midterm Chapterwise Questions
No ratings yet
I PU Midterm Chapterwise Questions
4 pages
Pathfinder
No ratings yet
Pathfinder
6 pages
PH DThesis
No ratings yet
PH DThesis
159 pages
Remote Reporting in The COVID 19 Era From Pilot S
No ratings yet
Remote Reporting in The COVID 19 Era From Pilot S
4 pages
66fcb09978c0ee5e8bcd7e3e Munoxusunakadamosomilofo
No ratings yet
66fcb09978c0ee5e8bcd7e3e Munoxusunakadamosomilofo
2 pages
6.4 Thin Lens Formula Worksheet Name
No ratings yet
6.4 Thin Lens Formula Worksheet Name
5 pages
How Has The Content of Magazines Changed Over Time, and Why
No ratings yet
How Has The Content of Magazines Changed Over Time, and Why
2 pages
COM155 F2019 Sheet3
No ratings yet
COM155 F2019 Sheet3
3 pages
TFP-GPC-EMEA - 10-21 - v2 14
No ratings yet
TFP-GPC-EMEA - 10-21 - v2 14
1 page
Change: Unit - 4
No ratings yet
Change: Unit - 4
17 pages
Trust: Unit III
No ratings yet
Trust: Unit III
14 pages
Learning: Unit III
No ratings yet
Learning: Unit III
8 pages
Antivirus Solution Purchase Checklist
No ratings yet
Antivirus Solution Purchase Checklist
1 page
Request For Live Scan Service: Teacher Cred 44340 Ec
No ratings yet
Request For Live Scan Service: Teacher Cred 44340 Ec
4 pages
Table - Selection
No ratings yet
Table - Selection
2 pages
CS964 Data Warehousing and Data Mining
No ratings yet
CS964 Data Warehousing and Data Mining
1 page
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages

Big Data Analytics - Unit 4

Uploaded by

Big Data Analytics - Unit 4

Uploaded by

Big Data Analytics – Unit 4

Hadoop framework application works in an

 Hadoop runs code across a cluster of computers.

 Scalable: Hadoop cluster can be extended by just adding nodes in

 Cost Effective: Hadoop is open source and uses commodity

 Resilient to failure: HDFS has the property with which it can

 There are two major components of Hadoop HDFS-

 Tasks of HDFS Data Node

It is an open-source, non-relational, distributed

It provides real-time read/write access to large

It consists of the components

 There are two types of nodes in Oozie.

 Control Flow Node.

Hadoop cluster provisioning

Hadoop cluster management

Hadoop cluster monitoring

You might also like