0% found this document useful (0 votes)
47 views42 pages

Unit 4: Big Data Tehnology Landscape Two Inportant Technologies

This document provides an overview of NoSQL and Hadoop technologies. It discusses: 1. NoSQL databases as a distributed, non-relational alternative to SQL databases that can handle big data and real-time applications. Popular types include key-value, document, column, and graph databases. 2. Hadoop as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It consists of HDFS for storage and MapReduce for processing. 3. The course will cover challenges of distributed computing, an introduction to NoSQL and Hadoop technologies, and deep dives into HDFS, MapReduce, and use cases.

Uploaded by

kiran vemula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views42 pages

Unit 4: Big Data Tehnology Landscape Two Inportant Technologies

This document provides an overview of NoSQL and Hadoop technologies. It discusses: 1. NoSQL databases as a distributed, non-relational alternative to SQL databases that can handle big data and real-time applications. Popular types include key-value, document, column, and graph databases. 2. Hadoop as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It consists of HDFS for storage and MapReduce for processing. 3. The course will cover challenges of distributed computing, an introduction to NoSQL and Hadoop technologies, and deep dives into HDFS, MapReduce, and use cases.

Uploaded by

kiran vemula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 42

Unit 4

Big data tehnology landscape


Two inportant technologies:
NoSQL and hadoop
Course objective
• To study the two technologies 1. NoSQL and 2. hadoop.
• Note that processing does not mean analytics.
while storing handles how to divide the big data into chunks
of data and distribute the blocks and store them,
Processing is to integrate these distributed blocks again
prior to processing such that it can be properly presented
to data analytic tools.
again analysis is different from analytics. Analysis converts
data into information in he form of reports and visualization
while analytics converts information into knowledge using
statistical and mathematics .
syllabus
• Course objective: study NoSQL and hadoop technologies
• 1.Distributed computing challenges
• 2.NoSQL
• 3.Hadoop: consisting of HDFS and MapReduce.
3.1.history of hadoop,
3.2.hadoop overview
3.3. use case of hadoop,
3.4.hadoop distributors,
• 4. HDFS:
4.1.HDFS daemons: Namenode, datanode, secondary
namenode
4.2. file read, file write, Replica processing of data with
hadoop
4.3.Managing resources and applications with Hadoop YARN
4.1.Distributed computing challenges
1. In a distributed system ,since several servers are
networked together there could be failure of
hardware.
ex: a hard disk failure creates data retrieval
problem
2. In DS the data is spread across several machines.
How to integrate them prior to processing it?
Solution: two important technologies: NoSQL and
hadoop. We study in this unit 4
2.NoSQL
• MySQL is the world's most used RDBMS, and runs as a
server providing multi-user access to a number of
databases.
• TheOracle Database is an object-relational database
management system (ORDBMS).
• The main difference between Oracleand MySQL is the fact
that MySQL is open source, whileOracle is not.
• SQL stands for Structured Query Language. It's a standard
language for accessing and manipulating databases
• SQL Server, Oracle, Informix, Postgres, etc are RDMS
2.1.introduction to NoSQL.
• It is a distributed DataBase model while hadoop is not a data
base.(hadoop is a framework) ;
• NoSQL is OpenSource, non relational, scalable.
• There are several databases which follow this NoSQL model.
• NoSQL data bases are used in Big data and real time web
applications, social media.
• They do not restrict the data to adhere to any schema at the
time of storage
• They structure the unstructured input data into different
formats viz key value pairs ; document oriented; coloumn
oriented; graph based data ; besides structured data
• They adhere to CAP theorem and compromise on C in favor of A
and P.
• It does not support ACID properties of transactions
(Atomocity,Consistency,Isolation, and Durability).
2.2.Types of NoSQL data bases
They can be broadly classified into:
1.key-value or the big hash table type: : they maintain big hash table
of keys and values.
sample key value pair:
key value
First name Robert
last name williams
2. Document type: maintain data as a collection of documents.
Documents are equivalent to records in RDBMS and collection is
equivalent of Table in RDBMS. Sample document:
{“Book Name”: “Fundamentals .. “,
“Publisher”: “Wiley India”,
“year”: “2011”
}
2.2.Types of NoSQL data bases ..
3.Column type: each storage block has data
from only one column
4. Graph type: Also called network db. A graph stores
data in nodes
sample graph:ID, name, Age stored in each node.
arrows carry Labels like “member”,”member since
2002” , “knows since 2002”, etc.,
2.3.popular NoSQL data bases
1. Key value or big hash table
2. Schema-less
1. Key value or big hash table type NoSQL Data bases: (some schema is
followed)
Amazon S3 (Dynamo); Scalaris , Redis,Riak,
2.schema-less: (no schema even like key, value)
2.1 Column based : Cassaandra, Hbase
2.2 Document based: ApacheCouchDB, MongoDB,
MarkLogic
2.3. Graph-based: Neo4j, HyperGraphDB
2.4.Advantages of NoSQL
• Dynamic schema: since it allows insertion of data without a
predefined schema-it facilitates application changes in real
time ie faster code development and integration and less
db administration
• Auto sharding: it automatically spreads data across
arbitrary number of servers while balancing the load and
query on the servers. if a server fails the server is replaced
w/o disruptions.
• Replication: multiple copies of data are stored across the
cluster and even across data centers. This promises high
availability and fault tolerance
2.4.Advantages of NoSQL..
• Rapid and elastic Scalability: allows to scale to the cloud
with the following capacities:
Cluster scale: allows distribution of data base across >100
nodes among multiple data centers
performance scale: supports over >100000 database read
and write operations per sec
Data scale: supports storing of >1 billion documents in the
db
• Cheap and easy to implement
• Adheres to CAP. relaxes consistency requirement
2.5.Disadvantages of NoSQL
• Does not support joins
• No support for ACID
• No standard query language interface except
in case of MongoDB and Cassandra(CQL)
• No easy integration with other applications
that support SQL
2.6.No SQL applications in Industry
• Key value pairs type data base: used for shopping carts, web user
data analysis(amazon, Linkedin)
• Column type database: used by facebook, twitter, eBay, NETFLIX
• Document type database : used for logging, archives management
• Graph type database : used in network modeling, walmart
• NoSQL vendors:
1.amazon (Dynamo): Used by Linkedin, Mozilla
2.Facebook(Cassandra):Used by Netflix. Twitter,eBay ie column type
darabase
3.Google(Big Table). Used by Adobe Photoshop
2.7.NewSQL
• Data base that has the same scalable
performance as NoSQL, support OLTP,
maintain ACID guarantees of traditional Data
Base.
• It is a new RDBMS supporting relational data
model and uses SQL as interface.
2.8.Comparison
Sql NOsql NewSQL

acid cap ACID

RDB Non RDB RDB

OTP/OLAP NO OTP/OLAP

Predefined schema No schema rigidity May have schema rigidity

Vertically scalable by Scaleout (horizontal) Scaleout (horizontal)


increasing system
resources

Distributed computing Distributed Distributed computing

Fully supported Increasing support growing


ACID
• In databanises, a transaction is a very small of a program
may contain several lowlevel tasks.
• A transaction in a database system must maintain Atomicity, Consistency,
Isolation, and Durability − commonly known as ACID properties − in order to
ensure accuracy, completeness, and data integrity .
• For example, a transfer of funds from one bank account to another, even involving
multiple changes such as debiting one account and crediting another, is a single
transaction.
• Atomicity Consistency Isolation Durability (ACID) is a concept referring to a
database system's four transaction
properties: atomicity, consistency, isolationand durability.
• These four properties describe the major guarantees of the transaction paradigm,
which has influenced many aspects of development in database systems.
Atomicity
• An atomic transaction is an indivisible and irreducible series
of database operations such that either all occur, or nothing
occurs. A guarantee of atomicity prevents updates to
the database occurring only partially, which can cause greater
problems than rejecting the whole series outright.
• An atomic transaction is an indivisible and irreducible series
of database operations such that either all occur, or nothing
occurs.
• Transactions are often composed of Multiple statements.
• A guarantee of atomicity prevents updates to
the database occurring only partially, which can cause greater
problems than rejecting the whole series outright.
• A guarantee of atomicity prevents updates to
the database occurring only partially, which can cause greater
problems than rejecting the whole series outright.
• Atomicity guarantees that each transaction is treated as a
single "unit", which either succeeds completely, or fails
completely:
• if any of the statements in a transaction fails to complete, the
entire transaction fails and the database is left unchanged.
• An atomic system must guarantee atomicity in each and every
situation, including power failures, errors and crashes.
Consistency
• Consistency ensures that a transaction can only bring the
database from one valid state to another valid state,
maintaining database invariants:
• any data written to the database must be valid according
to all defined rules, including constraints
, cascades, triggers, and any combination thereof.
• This prevents database corruption by an illegal transaction,
but does not guarantee that a transaction is correct.
Isolation
• Transactions are often executed concurrently (e.g., reading
and writing to multiple tables at the same time)
• Isolation ensures that concurrent execution of transactions
leaves the database in the same state that would have been
obtained if the transactions were executed sequentially.
• Isolation is the main goal of concurrency control;
• depending on the method used, the effects of an
incomplete transaction might not even be visible to other
transactions.
Durability

• Durability guarantees that once a transaction


has been committed, it will remain committed
even in the case of a system failure (e.g.,
power outage or crash).
• This usually means that completed
transactions (or their effects) are recorded
in non-volatile memory
HADOOP
Hadoop
• 3.Hadoop:
3.1.history of hadoop,
3.2.hadoop overview
3.3. use case of hadoop,
3.4.hadoop distributors,
• 4. HDFS:
4.1.HDFS daemons: Namenode, datanode, secondary
namenode
4.2. file read, file write, Replica processing of data with hadoop
4.3.Managing resources and applications with Hadoop YARN
1. Hadoop overview
• For 1. massive data storage
2. faster data processing
Key aspects:
1.OSS
2.Framework: programs, tools etc; provided to
develop and execute applications. It is not a data
base like NoSQL
3. distributed: data distributed across multiple
computers. Data processed parallelly
4. Massive data and faster processing
Hadoop distributors
• The following companies supply hadoop
products:
• Cloudera, Hortonworks, MAPR, Apache
hadoop
• Cloudera.
• HortonWorks.
• Amazon Web Services Elastic MapReduce Hadoop Distribution.
• Microsoft.
• MapR.
• IBM InfoSphere Insights
4. HDFS:
• HDFS is one of the two core components of
hadoop, the 2nd being MapReduce.
4.1.HDFS daemons: Namenode, datanode,
secondary namenode
4.2. file read, file write, Replica processing of
data with hadoop
4.3.Managing resources and applications with
Hadoop YARN
4.1.HDFS daemons
1.NameNode:
• There is a single namenode per cluster
• It manages file related operations like read, write, create and
delete
• Namenode stores HDFS namespace
• It manages file system Namespace which is a collection of files in
the cluster
• file system Namespace includes mapping of blocks to file , file
properties and is stored in a file called FsImage
• It uses editlog to record every transaction
• A rack is a collection of data nodes within a cluster
• it uses rackID to identify datanodes in the rack.
HDFS daemons
• When namenode starts, it reads FsImage and
EditLog from disk and applies all transactions
from EditLog to represent in FsImage.
• Then it flushes out new version of FsImage on
disk and truncates the old EditLog because the
changes are updated in the FsImage.
HDFS daemons
2.DataNode
• There are multiples
• During pipeline read write datanodes communicate
with each other.
• A datanode also sends heartbeat message to
namenode to ensure connectivity between name
and data nodes
• In case of no heartbeat, namenode replicates
datanode within the cluster and keeps running
HDFS daemons

3.Secondary NameNode
• It takes a snapshot of HDFS metadata at
intervals as specified in the configuration
• It occupies same memory size as namenode
• Therefore they are run on different machines
• In failure of namenode the secondary can be
configured
4.2. file read, file write, Replica processing of data
with hadoop
• File read:
• 1. the client opens file he wants to read by calling open()
on the DFS
• 2.DFS communicates with namenode to get the location
of the data blocks
• 3.namenode returns the addresses of the datanodes
containing the data blocks
• 4.DFS returns an FSDataInputStream to client.
• 5. client calls read() on the FSDataInputStream which
contains the addresses of the datanodes for the first few
blocks of file, connects to the nearest datanode for the
1st block in the file on FSDataInputStream to close the
connection
• 6.client calls read() repeatedly to get the data stream
from the datanode
• 7.when the end of a block FSDataInputStream closes
the connection with datanode.
• 8. it repeats the steps for to find the best node for
the next block.
• 9. client calls close()
File write
• 1. client calls create() to create file
• 2. An RPC call is initiated to namenode
• 3. namenode creates file after few checks
• 4. FSDataInputStream returns the stream for client to write on
• 5.as the client writes data, the data is split into packets which is then
written to a data queue
• 6.datastreamer requests namenode to allocate blocks by selecting a
list of suitable nodes for storing replicas (by default 3)
• 7. this list of datanodes makes a pipeline with 3 nodes in the pipe line
for the 1st block
File write….
• 8. datastreamer streams the packets to the 1st data node in the
pipeline which stores and the forwards to other datanodes in the
pipeline
• 9.DFSOutputStream manages an “Ack queue” of packets that are
waiting for ackment- and a pkt is removed from the queue only if
it is acknowledged by all the datanodes in the pipeline
• 10.when the client finishes writing the file it calls close() on the
stream
• 11.this flushes all the remaining pkts to the datanode pipeline
and waits for acknowledgements before communicating with
NameNode to inform the client that the creation of file is
complete
Replica processing of data with
• Replica placement strategy:
hadoop
• by default 3 replicas are created for each data set
1st replica is placed in the same node as the client
2nd replica is placed on a node in a different rack
3rd replica is placed on the same rack as second but on a different node in the rack
• Then a data pipeline is built . The client application writes a block to the 1 st
datanode in the pipeline.
next this datanode takes over and forwards data to the next node in the pipeline.
• this process continues for all the data blocks.
• Subsequently all the data blocks are written to the disk
• The client application need not track all blocks of data. The HDFS directs the
client to the nearest replica.
Why hadoop 2.x ?
• Because of following limitations of hadoop1.0:
• In hadoop 1.0 HDFS and MR are core componenets while other
components are built around.
• 1. single namenode for entire namespace of a cluster. It saves all its
file metadata in main memory. This puts a limit on the number of
objects stored in NameNode.
• 2.restricted to processing batch-oriented Map reduce jobs
• 3.MR for cluster resource management and data processing. not
suitable for interactive analysis
• 4. hadoop1.0 not suitable for machine learning, graphs and other
memory intensive algorithms
• 5. map slots may become full while reduce slots are empty and vice
versa- inefficient resource utilization
• 6
How hadoop 2.x
• HDFS 2 used in hadoop 2.0 consists of 2 major components:
• 1. namespace service: to take care of file related (create,
read, write) operations
• 2. blocks storage service: handles data nodes cluster
management, replication
• HDFS2 uses:
• 1. mutiple independent namenodes: datanodes are
common storage blocks shared by all namenodes. All
datanodes register with every namenode in the cluster
• 2. passive standby namenode
Managing resources and applications
with hadoop YARN
• YARN is a sub-project of hadoop 2.x
• It is a general processing platform
• YARN is not constrained to MR alone
• Multiple applications can be run in hadoop2.x
if all the applications share the same resources (memory,
cpu, network etc.,) management.
• With YARN hadoop can do not only batch processing but
also interactive, online, streaming, graph and other types of
processing
Daemons of YARN
1. Global Resource Manager: to distribute resources among various
applications. It has 2 components:
1.1. Scheduler: decides allocation of resources to running applications.
No monitoring
1.2. ApplicationManager: accepts jobs, negotiates resources for
executing ApplicationMaster which is specific to an application
• 2.NodeManager: it monitors usage of resources and reports the usage
to Global Resource Manager. It launches ‘application containers’ for
execution of application.
• Every machine will have one NodeManager
• 3.Per-application ApplicationMaster: every application has one.to
negotiate required resoueces for execution from the Resource
Manager. It works along with NodeManager for executing and
monitoring component tasks
• Application is a job submitted to the
framework. Ex: Map Reduce job
• Container:
is a basic unit of allocation across multiple
resource types:
ex: container_0= 2GB, 1 CPU
container_1= 1GB, 6 CPU
container replaces the fixed map/reduce slots
YARN Architecture: steps
• 1.client program submits the application which contains specifications to
launch application specific ‘ ApplicationMaster’
• 2.ResourceManager launches ‘ ApplicationMaster’ by assigning some
container
• 3. ‘ ApplicationMaster’ registers with ApplicationMaster’ so that the client
can quiery from Resource manager for details
• 4.( applicationmaster negotiates apptopruate resource containers via the
resource –request protocol)
• 5. after container allocation , the ApplicationMaster launches the container
by providing the specs to NodeManger
• 6. NodeManger executeds the application code and provides status to
ApplicationMaster via application specific protocol
• 7.on completion of application , ‘ ApplicationMaster deregisters with
ResourceManager and shuts down. itscontainers can then be reused.

You might also like