0% found this document useful (0 votes)
44 views49 pages

Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi

The document discusses technologies for handling big data, including distributed file systems like Hadoop Distributed File System (HDFS), the Hadoop framework, and NoSQL databases. It provides an overview of HDFS architecture and explains how Hadoop uses HDFS to store and process large datasets across clusters of computers. Programming models for big data processing with Hadoop like MapReduce are also covered.

Uploaded by

syahmina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views49 pages

Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi

The document discusses technologies for handling big data, including distributed file systems like Hadoop Distributed File System (HDFS), the Hadoop framework, and NoSQL databases. It provides an overview of HDFS architecture and explains how Hadoop uses HDFS to store and process large datasets across clusters of computers. Programming models for big data processing with Hadoop like MapReduce are also covered.

Uploaded by

syahmina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

TECHNOLOGIES FOR HANDLING

BIG DATA

Prepared by: Saidatul Rahah Hamidi


Overview
 Distributed File System
 Hadoop
 HDFS Architecture
 Programming Models for Big Data
 NoSQL Database
 Cloud Storage System
 Graph Data
Distributed File System
 Hold a large amount of data
 Clients distributed across a network
 Network File System(NFS)
o Straightforward design
o remote access- single machine
o Constraints
Why Hadoop is able to compete?
4

Database

vs.

Scalability (petabytes of data, Performance (tons of indexing, tuning,


thousands of machines) data organization tech.)

Flexibility in accepting all data


formats (no schema) Features:
- Provenance tracking
Efficient and simple fault-tolerant - Annotation management
mechanism - ….
Mainly for Structured data
Commodity inexpensive hardware

Used for Structured, Semi- structured


and Unstructured data
Hadoop vs RDBMS
Feature RDBMS Hadoop
Used for Structured, Semi-Structured
Data Variety Mainly for Structured data.
and Unstructured data
Data Storage Average size data (GBS) Use for large data set (Tbs and Pbs)
Querying SQL Language HQL (Hive Query Language)
Required on write (static
Schema Required on read (dynamic schema)
schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
OLTP (Online transaction Analytics (Audio, video, logs etc), Data
Use Case
processing) Discovery
Data Objects Works on Relational Tables Works on Key/Value Pair
Throughput Low High
Scalability Vertical Horizontal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High (ACID) Low
Hadoop’s Developers

2005: Doug Cutting and Michael J. Cafarella developed


Hadoop to support distribution for the Nutch search engine
project. “Hadoop” is named after his son's toy elephant.
Doug Cutting
The project was funded by Yahoo.

2006: Yahoo gave the project to Apache


Software Foundation.

https://www.coursera.org/learn/hadoop/lectur
e/1BPj6/the-hadoop-zoo
What is Hadoop
7

 Hadoop is a software framework for distributed processing


of large datasets across large clusters of computers
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop is an open-source implementation of Google
MapReduce, GFS (Google File System)
 Hadoop is based on a simple programming model called
MapReduce
 Hadoop is based on a simple data model, any data will fit
Apache Hadoop In Big Data

• Apache Hadoop is a framework for running applications on large cluster built of


commodity hardware.

• The Hadoop framework transparently provides applications both reliability and


data motion.

• Hadoop implements a computational paradigm named MapReduce, where the


application is divided into many small fragments of work, each of which may be
executed or re-executed on any node in the cluster.

• In addition, it provides a distributed file system (HDFS) that stores data on the
compute nodes, providing very high aggregate bandwidth across the cluster.

• Both MapReduce and the HDFS are designed so that node failures are
automatically handled by the framework.

http://hadoop.apache.org
Apache Hadoop In Big Data

List of modules :
• Hadoop Common : contains libraries and utilities that
support the other Hadoop modules.
• Hadoop Distributed File System (HDFS) :
A distributed file system that provides
high-throughput access to application data.
• Hadoop Yarn : A framework for job scheduling and
cluster resource management.
• Hadoop MapReduce : A programming model for
large scale data processin.
Architecture (Hadoop)
https://dzone.com/articles/ecosystem-hadoop-
animal-zoo-0
Hadoop Components / Tools
 Apache Avro: designed for communication between Hadoop nodes
through data serialization
 Cassandra and Hbase: a non-relational database designed for use
with Hadoop
 Hive: a query language similar to SQL (HiveQL) but compatible with
Hadoop
 Mahout: an AI tool designed for machine learning; that is, to assist
with filtering data for analysis and exploration
 Pig Latin: A data-flow language and execution framework for
parallel computation
 Hbase : database model built on top of Hadoop
 Flume : Designed for large scale data movement
 ZooKeeper: Keeps all the parts coordinated and working together
https://opensource.com/life/14/8/intro-apache-hadoop-big-data
Main Properties of HDFS
14

 HDFS is the primary data storage system used by


Hadoop application
 Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data
 Replication: Each data block is replicated many times
(default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
 Namenode is consistently checking Datanodes
HDFS is good for
 Very large files
 Streaming data access

 Commodity hardware

HDFS is not good for


 Low-latency data access
 Lots of small files

 Multiple writers, arbitrary file modifications


Architecture
Technologies
Traditional Data Management Arc.
New Data Management Arc.
Bigger Picture: Hadoop vs. Other Systems
20

Distributed Databases Hadoop


Computing Model - Notion of transactions - Notion of jobs
- Transaction is the unit of work - Job is the unit of work
- ACID properties, Concurrency - No concurrency control
control
Data Model - Structured data with known - Any data will fit in any
schema format
- Read/Write mode - (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare - Failures are common over
- Recovery mechanisms thousands of machines
- Simple yet efficient fault
tolerance
Key - Efficiency, optimizations, fine- - Scalability, flexibility, fault
Characteristics tuning tolerance
Bigger Picture:
Hadoop vs. Other Systems (continue)

 Cloud Computing
 A computing model where any computing infrastructure can run on the
cloud
 Hardware & Software are provided as remote services
 Elastic: grows and shrinks based on the user’s demand
 Example: Amazon EC2
Storage….
Storage???
 Storing of the Data
 The Memory Hierarchy
 Redundancy Arrays of Independent Disks (RAID)

 Disk Space Management


An Example Memory Hierarchy
Smaller, L0:
faster, registers CPU registers hold words retrieved from L1
and cache
costlier L1: on-chip L1
(per byte) cache (SRAM) L1 cache holds cache lines retrieved from
storage the L2 cache memory
devices L2: off-chip L2
cache (SRAM) L2 cache holds cache lines retrieved
from main memory

L3: main memory


Larger, (DRAM)
Main memory holds disk
slower, blocks retrieved from local
and disks
cheaper local secondary storage
L4:
(per byte) (local disks)
storage Local disks hold files retrieved
from disks on remote network
devices
servers

L5: remote secondary storage


(distributed file systems, Web servers)
6A-24
RAID
 Redundant Array of Inexpensive Disks (RAID)
 RAID arrays write data across multiple disks as a way of storing data redundantly (to
achieve fault tolerance) or to stripe data across multiple disks to get better performance
than any one disk could provide on its own.
 Typically, a RAID array will appear to the operating system as a single disk.

- Idea: Use many disks in parallel to increase storage bandwidth, improve reliability
- Files are striped across disks
- Each stripe portion is read/written in parallel
- Bandwidth increases with more disks

 Problems:
- Small files (small writes less than a full stripe)
- Need to read entire stripe, update with small write, then write entire segment out to disks
- Reliability: more disks increases the chance of media failure (MTBF)
- Turn reliability problem into a feature
- Use one disk to store parity data: XOR of all data blocks in stripe
- Can recover any data block from all others + parity block - “redundant”
- Overhead
Common RAID Levels
 RAID 0: Striping
 Good for random access (no reliability)
 It splits data among two or more disks.

 RAID 1: Mirroring
- Two disks, write data to both (expensive, 1X storage overhead)

 RAID 5: Floating parity


- Parity blocks for different stripes written to different disks
- No single parity disk, hence no bottleneck at that disk
- is an ideal combination of good performance, good fault tolerance and high
capacity and storage efficiency.
- “Distributed Parity” is the key word here

 RAID “10”: Striping plus mirroring


- Higher bandwidth, but still have large overhead
- good performance and good failover handling.
- Also called ‘Nested RAID’.
JBOD (Just a Bunch of Disk)
6A-27

 The disks in a JBOD array can function as their own


individual volumes or can be connected or spanned, to
form a single logical volume.
 Just a disk
 No replication

 No data integrity

 Less cost

 Commodity
RAID vs JBOD
6A-28
Data Lake
….is a storage repository that holds a vast amount of raw data in its native
format until it is needed. While a hierarchical data warehouse stores data in files
or folders, a data lake uses a flat architecture to store data.
searchaws.techtarget.com/definition/data-lake
Data Lake
Data Lake vs Data Warehouse
Programming Models
Programming Models for Big Data
MapReduce: Word Count
NoSQL
NoSQL
 NoSQL is not “No-SQL”  Not Only SQL
 It does not aim to provide the ACID properties
 Originated as no –SQL though
 Scalability is horizontal, i.e., can put tuples (rows)
across distributed machines
 Flexibility to model any kind of data
 Natural way of modeling data
 Distribution support is in-bulit
Taxonomy of NoSQL

• Key-value

• Graph database

• Document-oriented
• Popular doc format:
XML, JSON, BSON,
YAML
3
• Column family
CAP theorem for NoSQL
What the CAP theorem really says:
• If you cannot limit the number of faults and requests can be
directed to any server and you insist on serving every request you
receive then you cannot possibly be consistent Eric Brewer 2001

How it is interpreted:
• You must always give something up: consistency, availability or
tolerance to failure andreconfiguration

5
Theory of NOSQL: CAP
GIVEN:
• Many nodes
C
• Nodes contain replicas of partitions
of the data

• Consistency
• All replicas contain the same version
of data
• Client always has the same view of
the data (no matter what node)
• Availability
• System remains operational on failing
nodes A P
• All clients can always read and write
• Partition tolerance
• multiple entrypoints
• System remains operational on
CAP Theorem:
system split (communication
malfunction)
satisfying all three at the
• System works well across physical
network partitions
same time is impossible 6
Available, Partition-
Tolerant (AP) Systems
achieve "eventual
consistency" through
replication and
verification

Consistent,
Available (CA)
Systems have
trouble with
Consistent, Partition-Tolerant (CP)
partitions
Systems have trouble with availability
and typically deal while keeping data consistent across
with it with partitioned nodes
replication

http://blog.nahurst.com/visual-guide-to-nosql-systems
BASE properties
 Basically Available – System guarantees
availability
 Soft State – state of system is soft, it may change
without input to maintain consistency
 Eventually Consistency – data will be eventually
consistent without any interim perturbation
(uneasiness)
 Sacrifices consistency
 To counter ACID
RDB ACID to NoSQLBASE

Atomicity Basically

Consistency Available (CP)

Isolation Soft-state
(State of system may change
over time)

Durability Eventually
consistent 15
(Asynchronous propagation)

Pritchett, D.: BASE: An Acid Alternative (queue.acm.org/detail.cfm?id=1394128)


Types of NoSQL data Stores
 Four main types of NoSQL data stores:
 Columnar families
 Bigtable systems

 Document databases

 Graph database

 http://nosql-database.org/
NoSQL Systems
 Three most popular are:
 Hbase

 Cassandra

 MongoDB
How does NoSQL vary from RDBMS?
• Looser schema definition
• Applications written to deal with specific documents/ data
• Applications aware of the schema definition as opposed to the data
• Designed to handle distributed, large databases
• Trade offs:
• No strong support for ad hoc queries but designed for speed and
growth of database
• Query language through the API
• Relaxation of the ACIDproperties

10
Benefits of NoSQL
Elastic Scaling Big Data
• RDBMS scale up – bigger • Huge increase in data
load , bigger server RDMS: capacity and
• NO SQL scale out – constraints of data
distribute data across volumes at its limits
multiple hosts • NoSQL designed for big
seamlessly data
DBA Specialists
• RDMS require highly
trained expert to
monitor DB
• NoSQL require less
management, automatic
repair and simpler data
models 11
Benefits of NoSQL
Flexible data models Economics
• Change management to • RDMS rely on expensive
schema for RDMS have proprietary servers to
to be carefully managed manage data
• NoSQL databases more • No SQL: clusters of
relaxed in structure of cheap commodity
data servers to manage the
• Database schema data and transaction
changes do not have to volumes
be managed as one • Cost per gigabyte or
complicated change unit
transaction/second for
• Application already
NoSQL can be lower
written to address an
amorphous schema than the cost for a
RDBMS 12
Drawbacks of NoSQL
• Support • Maturity
• RDBMS vendors • RDMS mature
provide a high level of product: means stable
support to clients and dependable
• Stellar reputation • Also means old no
• NoSQL – are open longer cutting edge nor
interesting
source projects with
startups supporting • NoSQL are still
them implementing their
• Reputation not yet
basic feature set
established
13
Drawbacks of NoSQL
• Administration • Analytics and
• RDMS administrator well Business Intelligence
defined role • RDMS designed to
• No SQL’s goal: no  address this niche
administrator necessary
• NoSQL designed to meet
however NO SQL still
requires effort to maintain the needs of an Web 2.0
application - not
• Lack of Expertise designed for ad hoc
• Whole workforce of query of the data
trained and seasoned • Tools are being
RDMS developers developed to address
this need
• Still recruiting developers to
the NoSQL camp
14

You might also like