0% found this document useful (0 votes)

3 views9 pages

Unit 2 Bda

The document provides an overview of NoSQL and SQL databases, highlighting their differences in structure, scalability, and use cases. It also discusses distributed computing challenges, the Hadoop framework, and its ecosystem, emphasizing the advantages of using Hadoop for processing large datasets. Key components of Hadoop, such as MapReduce and HDFS, are explained along with their roles in managing resources and applications.

Uploaded by

zeba.cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views9 pages

Unit 2 Bda

Uploaded by

zeba.cs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

UNIT II

NoSQL

NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to
scale. NoSQL database is used for distributed data stores with humongous data storage needs.
NoSQL is used for Big data and real-time web apps. For example companies like Twitter,
Facebook, Google that collect terabytes of user data every single day.

SQL

Structured Query language (SQL) pronounced as "S-Q-L" or sometimes as "See-Quel" is the

standard language for dealing with Relational Databases. A relational database defines
relationships in the form of tables.

SQL programming can be effectively used to insert, search, update, delete database records.

Comparison of SQL and NoSQL

Parameter SQL NOSQL

Definition SQL databases are primarily called NoSQL databases are primarily called as Non-
RDBMS or Relational Databases relational or distributed database
Design for Traditional RDBMS uses SQL NoSQL database system consists of various
syntax and queries to analyze and kind of database technologies. These databases
get the data for further insights. were developed in response to the demands
They are used for OLAP systems. presented for the development of the modern
application.
Query Structured query language (SQL) No declarative query language
Language
Type SQL databases are table based NoSQL databases can be document based, key-
databases value pairs, graph databases
Schema SQL databases have a predefined NoSQL databases use dynamic schema for
schema unstructured data.
Ability to scale SQL databases are vertically NoSQL databases are horizontally scalable
scalable
Examples Oracle, Postgres, and MS-SQL. MongoDB, Redis, , Neo4j, Cassandra, Hbase.
Best suited for An ideal choice for the complex It is not good fit complex queries.
query intensive environment.
Hierarchical SQL databases are not suitable for More suitable for the hierarchical data store as it
data storage hierarchical data storage. supports key-value pair method.
Variations One type with minor variations. Many different types which include key-value
stores, document databases, and graph
databases.

BIG DATA ANALYTICS 15

Development It was developed in the 1970s to Developed in the late 2000s to overcome issues
Year deal with issues with flat file and limitations of SQL databases.
storage
Open-source A mix of open-source like Postgres Open-source
& MySQL, and commercial like
Oracle Database.
Consistency It should be configured for strong It depends on DBMS as some offers strong
consistency. consistency like MongoDB, whereas others
offer only offers eventual consistency, like
Cassandra.
Best Used for RDBMS database is the right NoSQL is a best used for solving data
option for solving ACID problems. availability problems
Importance It should be used when data validity Use when it's more important to have fast data
is super important than correct data
Best option When you need to support dynamic Use when you need to scale based on changing
queries requirements
Hardware Specialized DB hardware (Oracle Commodity hardware
Exadata, etc.)
Network Highly available network Commodity network (Ethernet, etc.)
(Infiniband, Fabric Path, etc.)
Storage Type Highly Available Storage (SAN, Commodity drives storage (standard HDDs,
RAID, etc.) JBOD)
Best features Cross-platform support, Secure and Easy to use, High performance, and Flexible
free tool.
Top Hootsuite, CircleCI, Gauges Airbnb, Uber, Kickstarter
Companies
Using
Average salary
The average salary for any The average salary for "NoSQL developer"
professional SQL Developer is ranges from approximately $72,174 per year
$84,328 per year in the U.S.A.
ACID vs. ACID( Atomicity, Consistency, Base ( Basically Available, Soft state,
BASE Model Isolation, and Durability) is a Eventually Consistent) is a model of many
standard for RDBMS NoSQL systems

BIG DATA ANALYTICS 16

RDBMS Versus Hadoop

Distributed Computing Challenges

Designing a distributed system does not come as easy and straight forward. A number
of challenges need to be overcome in order to get the ideal system. The major challenges in
distributed systems are listed below:

1. Heterogeneity:
The Internet enables users to access services and run applications over a heterogeneous collection
of computers and networks. Heterogeneity (that is, variety and difference) applies to all of the
following:

BIG DATA ANALYTICS 17

o Hardware devices: computers, tablets, mobile phones, embedded devices, etc.
o Operating System: Ms Windows, Linux, Mac, Unix, etc.
o Network: Local network, the Internet, wireless network, satellite links, etc.
o Programming languages: Java, C/C++, Python, PHP, etc.
o Different roles of software developers, designers, system managers
Different programming languages use different representations for characters and data structures
such as arrays and records. These differences must be addressed if programs written in different
languages are to be able to communicate with one another. Programs written by different
developers cannot communicate with one another unless they use common standards, for example,
for network communication and the
representation of primitive data items and data structures in messages. For this to happen, standards
need to be agreed and adopted – as have the Internet protocols.
Middleware: The term middleware applies to a software layer that provides a programming
abstraction as well as masking the heterogeneity of the underlying networks, hardware, operating
systems and programming languages. Most middleware is implemented over the Internet
protocols, which themselves mask the differences of the underlying networks, but all middleware
deals with the differences in operating systems
and hardware
Heterogeneity and mobile code: The term mobile code is used to refer to program code that can
be transferred from one computer to another and run at the destination – Java applets are an
example. Code suitable for running on one computer is not necessarily suitable for running on
another because executable programs are normally specific both to the instruction set and to the
host operating system.
2. Transparency:
Transparency is defined as the concealment from the user and the application programmer of the
separation of components in a distributed system, so that the system is perceived as a whole rather
than as a collection of independent components. In other words, distributed systems designers must
hide the complexity of the systems as much as they can. Some terms of transparency in distributed
systems are:
Access Hide differences in data representation and how a resource is accessed
Location Hide where a resource is located
Migration Hide that a resource may move to another location
Relocation Hide that a resource may be moved to another location while in use
Replication Hide that a resource may be copied in several places
Concurrency Hide that a resource may be shared by several competitive users
Failure Hide the failure and recovery of a resource
Persistence Hide whether a (software) resource is in memory or a disk
3. Openness
The openness of a computer system is the characteristic that determines whether the system can
be extended and re-implemented in various ways. The openness of distributed systems is
determined primarily by the degree to which new resource-sharing services can be added and be

BIG DATA ANALYTICS 18

made available for use by a variety of client programs. If the well-defined interfaces for a system
are published, it is easier for developers to add new features or replace sub-systems in the future.
Example: Twitter and Facebook have API that allows developers to develop their own software
interactively.

4. Concurrency
Both services and applications provide resources that can be shared by clients in a distributed
system. There is therefore a possibility that several clients will attempt to access a shared resource
at the same time. For example, a data structure that records bids for an auction may be accessed
very frequently when it gets close to the deadline time. For an object to be safe in a concurrent
environment, its operations must be synchronized in such a way that its data remains consistent.
This can be achieved by standard techniques such as semaphores, which are used in most operating
systems.

5. Security
Many of the information resources that are made available and maintained in distributed systems
have a high intrinsic value to their users. Their security is therefore of considerable importance.
Security for information resources has three components:
confidentiality (protection against disclosure to unauthorized individuals)
integrity (protection against alteration or corruption),
availability for the authorized (protection against interference with the means to access the
resources).
6. Scalability
Distributed systems must be scalable as the number of user increases. The scalability is defined by
B. Clifford Neuman as

A system is said to be scalable if it can handle the addition of users and resources without suffering
a noticeable loss of performance or increase in administrative complexity
Scalability has 3 dimensions:

o Size
o Number of users and resources to be processed. Problem associated is overloading
o Geography
o Distance between users and resources. Problem associated is communication reliability
o Administration
o As the size of distributed systems increases, many of the system needs to be controlled.
Problem associated is administrative mess
7. Failure Handling
Computer systems sometimes fail. When faults occur in hardware or software, programs may
produce incorrect results or may stop before they have completed the intended computation. The
handling of failures is particularly difficult.

BIG DATA ANALYTICS 19

Hadoop Overview

Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

 Storage layer (Hadoop Distributed File System).

MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at

Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many

BIG DATA ANALYTICS 20

similarities with existing distributed file systems. However, the differences from other distributed
file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost
hardware. It provides high throughput access to application data and is suitable for applications
having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
 Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource
management.

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.

Advantages of Hadoop

 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.

BIG DATA ANALYTICS 21

 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the
application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible on
all the platforms since it is Java based.
Processing Data with Hadoop - Managing Resources and Applications with Hadoop YARN

Yarn divides the task on resource management and job scheduling/monitoring into separate
daemons. There is one ResourceManager and per-application ApplicationMaster. An application
can be either a job or a DAG of jobs.
The ResourceManger have two components – Scheduler and AppicationManager.

The scheduler is a pure scheduler i.e. it does not track the status of running application. It only
allocates resources to various competing applications. Also, it does not restart the job after failure
due to hardware or application failure. The scheduler allocates the resources based on an abstract
notion of a container. A container is nothing but a fraction of resources like CPU, memory, disk,
network etc.
Following are the tasks of ApplicationManager:-

 Accepts submission of jobs by client.

 Negotaites first container for specific ApplicationMaster.
 Restarts the container after application failure.
Below are the responsibilities of ApplicationMaster

 Negotiates containers from Scheduler

 Tracking container status and monitoring its progress.
Yarn supports the concept of Resource Reservation via Reservation System. In this, a user can fix
a number of resources for execution of a particular job over time and temporal constraints. The
Reservation System makes sure that the resources are available to the job until its completion. It
also performs admission control for reservation.

Yarn can scale beyond a few thousand nodes via Yarn Federation. YARN Federation allows to
wire multiple sub-cluster into the single massive cluster. We can use many independent clusters
together for a single large job. It can be used to achieve a large scale system.

Let us summarize how Hadoop works step by step:

 Input data is broken into blocks of size 128 Mb and then blocks are moved to different nodes.
 Once all the blocks of the data are stored on data-nodes, the user can process the data.
 Resource Manager then schedules the program (submitted by the user) on individual nodes.
 Once all the nodes process the data, the output is written back to HDFS.

BIG DATA ANALYTICS 22

Interacting with Hadoop Ecosystem
Hadoop Ecosystem Hadoop has an ecosystem that has evolved from its three core components
processing, resource management, and storage. In this topic, you will learn the components of the
Hadoop ecosystem and how they perform their roles during Big Data processing. The
Hadoop ecosystem is continuously growing to meet the needs of Big Data. It comprises the
following twelve components:

 HDFS(Hadoop Distributed file system)

 HBase
 Sqoop
 Flume
 Spark
 Hadoop MapReduce
 Pig
 Impala
 Hive
 Cloudera Search
 Oozie
 Hue.

Let us understand the role of each component of the Hadoop ecosystem.

Components of Hadoop Ecosystem

Let us start with the first component HDFS of Hadoop Ecosystem.

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

 HDFS is a storage layer for Hadoop.

 HDFS is suitable for distributed storage and processing, that is, while the data is being
stored, it first gets distributed and then it is processed.
 HDFS provides Streaming access to file system data.
 HDFS provides file permission and authentication.
 HDFS uses a command line interface to interact with Hadoop.

So what stores data in HDFS? It is the HBase which stores data in HDFS.

HBase
 HBase is a NoSQL database or non-relational database .
 HBase is important and mainly used when you need random, real-time, read, or write
access to your Big Data.
 It provides support to a high volume of data and high throughput.
 In an HBase, a table can have thousands of columns.

BIG DATA ANALYTICS 23

Big Data Analytics Module-3
No ratings yet
Big Data Analytics Module-3
160 pages
Unit - 2
No ratings yet
Unit - 2
70 pages
Lecture Notes 1 - AD
No ratings yet
Lecture Notes 1 - AD
67 pages
Lesson Two DMS
No ratings yet
Lesson Two DMS
11 pages
Dbms Essays U1-4
No ratings yet
Dbms Essays U1-4
78 pages
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
From Everand
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
Anand Vemula
No ratings yet
DBMS Unit 1
No ratings yet
DBMS Unit 1
49 pages
Unit 2 Big Data (1) - 240328 - 162657
No ratings yet
Unit 2 Big Data (1) - 240328 - 162657
46 pages
BDT Unit 4
No ratings yet
BDT Unit 4
93 pages
NOSQL
No ratings yet
NOSQL
64 pages
BDA - M 3 - NoSQL
No ratings yet
BDA - M 3 - NoSQL
81 pages
DBMS Notes
No ratings yet
DBMS Notes
85 pages
Unit 1
No ratings yet
Unit 1
104 pages
Unit 2 Evaluating NoSQL
No ratings yet
Unit 2 Evaluating NoSQL
64 pages
Lecture#02 FileSystemAndDB
No ratings yet
Lecture#02 FileSystemAndDB
39 pages
Practical File of RDBMS: Mata Gujri College, Fatehgarh Sahib Punjab India 140406
No ratings yet
Practical File of RDBMS: Mata Gujri College, Fatehgarh Sahib Punjab India 140406
73 pages
Introduction To Databases Part 1
No ratings yet
Introduction To Databases Part 1
78 pages
ACU MSC Database
No ratings yet
ACU MSC Database
71 pages
Web Based Collaborative Big Data Analytics On Big Data As A Service Platform
No ratings yet
Web Based Collaborative Big Data Analytics On Big Data As A Service Platform
84 pages
UNIT 5 NoSql DBMS Notes
No ratings yet
UNIT 5 NoSql DBMS Notes
19 pages
Unit 15
No ratings yet
Unit 15
19 pages
DBMS - Unit I
No ratings yet
DBMS - Unit I
17 pages
Ebook Database Advice Guide
No ratings yet
Ebook Database Advice Guide
19 pages
PCC Alok
No ratings yet
PCC Alok
29 pages
Kingdom of Saudi: Jubail Industrial City Project
No ratings yet
Kingdom of Saudi: Jubail Industrial City Project
45 pages
QMS Manual (23 Files Merged)
100% (1)
QMS Manual (23 Files Merged)
168 pages
Unit-2 Big Data Analytics
No ratings yet
Unit-2 Big Data Analytics
12 pages
1504846528session31 NoSQL
No ratings yet
1504846528session31 NoSQL
12 pages
Database Systems
No ratings yet
Database Systems
86 pages
Data Anal
No ratings yet
Data Anal
53 pages
DBMS Notes
No ratings yet
DBMS Notes
19 pages
AOS Rel 6.0 GUI User Guide August 2024
No ratings yet
AOS Rel 6.0 GUI User Guide August 2024
208 pages
DF Unit3 DataBaseManagement
No ratings yet
DF Unit3 DataBaseManagement
25 pages
No SQL
No ratings yet
No SQL
13 pages
No SQL
No ratings yet
No SQL
12 pages
Learn SQL in 24 Hours
From Everand
Learn SQL in 24 Hours
Alex Nordeen
5/5 (4)
NoSQL Databases Notes
No ratings yet
NoSQL Databases Notes
5 pages
DBMS - Chapter 1
No ratings yet
DBMS - Chapter 1
45 pages
Para Distr Nosql Notes
No ratings yet
Para Distr Nosql Notes
13 pages
Relational DB
No ratings yet
Relational DB
32 pages
Distribution, Data, Deployment: Software Architecture Convergence in Big Data Systems
No ratings yet
Distribution, Data, Deployment: Software Architecture Convergence in Big Data Systems
14 pages
DBMS3
No ratings yet
DBMS3
42 pages
The Architecture of Open Source Applications - The NoSQL Ecosystem
No ratings yet
The Architecture of Open Source Applications - The NoSQL Ecosystem
9 pages
IntroToDbms Test1Notes
No ratings yet
IntroToDbms Test1Notes
14 pages
Sales Force Techleap Ad PDF
No ratings yet
Sales Force Techleap Ad PDF
137 pages
Database Advice Guide
No ratings yet
Database Advice Guide
19 pages
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Types
No ratings yet
Types
7 pages
Database Management Systems: A Nosql Analysis: September 2013
No ratings yet
Database Management Systems: A Nosql Analysis: September 2013
8 pages
Unit - 1 PDF
No ratings yet
Unit - 1 PDF
24 pages
Unit 3
No ratings yet
Unit 3
10 pages
Unit - 1
No ratings yet
Unit - 1
24 pages
Ultratech Cement: Particulars Test Results Requirements of
100% (1)
Ultratech Cement: Particulars Test Results Requirements of
1 page
Database Management Systems: A Nosql Analysis: September 2013
No ratings yet
Database Management Systems: A Nosql Analysis: September 2013
8 pages
Foundation Load (Reactions) Data FOR 45 M Diameter Thickener
No ratings yet
Foundation Load (Reactions) Data FOR 45 M Diameter Thickener
88 pages
Nosql Technologies: Performance Characteristics and Monitoring
No ratings yet
Nosql Technologies: Performance Characteristics and Monitoring
18 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
15 pages
NoSQL Group1
No ratings yet
NoSQL Group1
15 pages
Introduction To Nosql: What Is A Nosql Database Used For?
No ratings yet
Introduction To Nosql: What Is A Nosql Database Used For?
6 pages
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
No ratings yet
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
30 pages
Slide Sharvin's and Shashi Part
No ratings yet
Slide Sharvin's and Shashi Part
8 pages
Your Energy Bill
No ratings yet
Your Energy Bill
4 pages
DBMS Basics
No ratings yet
DBMS Basics
6 pages
NOSQL Concept 2
No ratings yet
NOSQL Concept 2
4 pages
A Survey of Post-Relational Data Management and NOSQL Movement
No ratings yet
A Survey of Post-Relational Data Management and NOSQL Movement
22 pages
Preliminaryproject
No ratings yet
Preliminaryproject
9 pages
Cloud Computing Lab Manual
No ratings yet
Cloud Computing Lab Manual
61 pages
What Are Database Types
No ratings yet
What Are Database Types
7 pages
2014 Ieee Computer Nosql
No ratings yet
2014 Ieee Computer Nosql
4 pages
Nosql Database
No ratings yet
Nosql Database
8 pages
Toro - Drip Irrigation DIY Guide
No ratings yet
Toro - Drip Irrigation DIY Guide
5 pages
Computational Techniques For Civil and Structural Engineering
No ratings yet
Computational Techniques For Civil and Structural Engineering
8 pages
Iot Lab Manual It 703
No ratings yet
Iot Lab Manual It 703
37 pages
The Virtual File System (VFS)
No ratings yet
The Virtual File System (VFS)
60 pages
Block Diagram
No ratings yet
Block Diagram
6 pages
Advanced Power Electronics Corp.: Description
No ratings yet
Advanced Power Electronics Corp.: Description
6 pages
SE Unit 4
No ratings yet
SE Unit 4
20 pages
CH 7 - PMTTD
No ratings yet
CH 7 - PMTTD
32 pages
BOP Configurations
No ratings yet
BOP Configurations
4 pages
Final Rodent
No ratings yet
Final Rodent
7 pages
Iot Lab Manual
No ratings yet
Iot Lab Manual
47 pages
Wravor Catalog en
No ratings yet
Wravor Catalog en
28 pages
Viva Question CSE-376
No ratings yet
Viva Question CSE-376
7 pages
OS Lab Manual
No ratings yet
OS Lab Manual
50 pages
K039-Pic-Checklist For JCB
No ratings yet
K039-Pic-Checklist For JCB
1 page
Module2-Signals and Systems
No ratings yet
Module2-Signals and Systems
21 pages
Global Summit On Innovation, Productivity in The Age of AI - Revised
No ratings yet
Global Summit On Innovation, Productivity in The Age of AI - Revised
6 pages
PCworth Product Pricelist
No ratings yet
PCworth Product Pricelist
22 pages
Experian Dispute Form
100% (3)
Experian Dispute Form
1 page
NSP P1
No ratings yet
NSP P1
46 pages
Class 12 - Holiday Homework - 2025-26
No ratings yet
Class 12 - Holiday Homework - 2025-26
2 pages
AMF-65 AMS RFT Partnership Range - CULT PDF
No ratings yet
AMF-65 AMS RFT Partnership Range - CULT PDF
46 pages
Am-4258 NEO Brushless Motor - Data Sheet
No ratings yet
Am-4258 NEO Brushless Motor - Data Sheet
2 pages
Bond Strength of Concrete Plugs Embedded in Tubula PDF
No ratings yet
Bond Strength of Concrete Plugs Embedded in Tubula PDF
16 pages
2023 - SP2 - CP3401 - CP5636-Assessment Item 1
No ratings yet
2023 - SP2 - CP3401 - CP5636-Assessment Item 1
4 pages

Unit 2 Bda

Uploaded by

Unit 2 Bda

Uploaded by

UNIT II

Structured Query language (SQL) pronounced as "S-Q-L" or sometimes as "See-Quel" is the

Comparison of SQL and NoSQL

Parameter SQL NOSQL

BIG DATA ANALYTICS 15

BIG DATA ANALYTICS 16

Distributed Computing Challenges

BIG DATA ANALYTICS 17

BIG DATA ANALYTICS 18

BIG DATA ANALYTICS 19

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

MapReduce is a parallel programming model for writing distributed applications devised at

Hadoop Distributed File System

BIG DATA ANALYTICS 20

How Does Hadoop Work?

BIG DATA ANALYTICS 21

 Accepts submission of jobs by client.

 Negotiates containers from Scheduler

Let us summarize how Hadoop works step by step:

BIG DATA ANALYTICS 22

 HDFS(Hadoop Distributed file system)

Let us understand the role of each component of the Hadoop ecosystem.

Components of Hadoop Ecosystem

Let us start with the first component HDFS of Hadoop Ecosystem.

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

 HDFS is a storage layer for Hadoop.

BIG DATA ANALYTICS 23

You might also like