Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
BIG DATA
Database
vs.
https://www.coursera.org/learn/hadoop/lectur
e/1BPj6/the-hadoop-zoo
What is Hadoop
7
• In addition, it provides a distributed file system (HDFS) that stores data on the
compute nodes, providing very high aggregate bandwidth across the cluster.
• Both MapReduce and the HDFS are designed so that node failures are
automatically handled by the framework.
http://hadoop.apache.org
Apache Hadoop In Big Data
List of modules :
• Hadoop Common : contains libraries and utilities that
support the other Hadoop modules.
• Hadoop Distributed File System (HDFS) :
A distributed file system that provides
high-throughput access to application data.
• Hadoop Yarn : A framework for job scheduling and
cluster resource management.
• Hadoop MapReduce : A programming model for
large scale data processin.
Architecture (Hadoop)
https://dzone.com/articles/ecosystem-hadoop-
animal-zoo-0
Hadoop Components / Tools
Apache Avro: designed for communication between Hadoop nodes
through data serialization
Cassandra and Hbase: a non-relational database designed for use
with Hadoop
Hive: a query language similar to SQL (HiveQL) but compatible with
Hadoop
Mahout: an AI tool designed for machine learning; that is, to assist
with filtering data for analysis and exploration
Pig Latin: A data-flow language and execution framework for
parallel computation
Hbase : database model built on top of Hadoop
Flume : Designed for large scale data movement
ZooKeeper: Keeps all the parts coordinated and working together
https://opensource.com/life/14/8/intro-apache-hadoop-big-data
Main Properties of HDFS
14
Commodity hardware
Cloud Computing
A computing model where any computing infrastructure can run on the
cloud
Hardware & Software are provided as remote services
Elastic: grows and shrinks based on the user’s demand
Example: Amazon EC2
Storage….
Storage???
Storing of the Data
The Memory Hierarchy
Redundancy Arrays of Independent Disks (RAID)
- Idea: Use many disks in parallel to increase storage bandwidth, improve reliability
- Files are striped across disks
- Each stripe portion is read/written in parallel
- Bandwidth increases with more disks
Problems:
- Small files (small writes less than a full stripe)
- Need to read entire stripe, update with small write, then write entire segment out to disks
- Reliability: more disks increases the chance of media failure (MTBF)
- Turn reliability problem into a feature
- Use one disk to store parity data: XOR of all data blocks in stripe
- Can recover any data block from all others + parity block - “redundant”
- Overhead
Common RAID Levels
RAID 0: Striping
Good for random access (no reliability)
It splits data among two or more disks.
RAID 1: Mirroring
- Two disks, write data to both (expensive, 1X storage overhead)
No data integrity
Less cost
Commodity
RAID vs JBOD
6A-28
Data Lake
….is a storage repository that holds a vast amount of raw data in its native
format until it is needed. While a hierarchical data warehouse stores data in files
or folders, a data lake uses a flat architecture to store data.
searchaws.techtarget.com/definition/data-lake
Data Lake
Data Lake vs Data Warehouse
Programming Models
Programming Models for Big Data
MapReduce: Word Count
NoSQL
NoSQL
NoSQL is not “No-SQL” Not Only SQL
It does not aim to provide the ACID properties
Originated as no –SQL though
Scalability is horizontal, i.e., can put tuples (rows)
across distributed machines
Flexibility to model any kind of data
Natural way of modeling data
Distribution support is in-bulit
Taxonomy of NoSQL
• Key-value
• Graph database
• Document-oriented
• Popular doc format:
XML, JSON, BSON,
YAML
3
• Column family
CAP theorem for NoSQL
What the CAP theorem really says:
• If you cannot limit the number of faults and requests can be
directed to any server and you insist on serving every request you
receive then you cannot possibly be consistent Eric Brewer 2001
How it is interpreted:
• You must always give something up: consistency, availability or
tolerance to failure andreconfiguration
5
Theory of NOSQL: CAP
GIVEN:
• Many nodes
C
• Nodes contain replicas of partitions
of the data
• Consistency
• All replicas contain the same version
of data
• Client always has the same view of
the data (no matter what node)
• Availability
• System remains operational on failing
nodes A P
• All clients can always read and write
• Partition tolerance
• multiple entrypoints
• System remains operational on
CAP Theorem:
system split (communication
malfunction)
satisfying all three at the
• System works well across physical
network partitions
same time is impossible 6
Available, Partition-
Tolerant (AP) Systems
achieve "eventual
consistency" through
replication and
verification
Consistent,
Available (CA)
Systems have
trouble with
Consistent, Partition-Tolerant (CP)
partitions
Systems have trouble with availability
and typically deal while keeping data consistent across
with it with partitioned nodes
replication
http://blog.nahurst.com/visual-guide-to-nosql-systems
BASE properties
Basically Available – System guarantees
availability
Soft State – state of system is soft, it may change
without input to maintain consistency
Eventually Consistency – data will be eventually
consistent without any interim perturbation
(uneasiness)
Sacrifices consistency
To counter ACID
RDB ACID to NoSQLBASE
Atomicity Basically
Isolation Soft-state
(State of system may change
over time)
Durability Eventually
consistent 15
(Asynchronous propagation)
Document databases
Graph database
http://nosql-database.org/
NoSQL Systems
Three most popular are:
Hbase
Cassandra
MongoDB
How does NoSQL vary from RDBMS?
• Looser schema definition
• Applications written to deal with specific documents/ data
• Applications aware of the schema definition as opposed to the data
• Designed to handle distributed, large databases
• Trade offs:
• No strong support for ad hoc queries but designed for speed and
growth of database
• Query language through the API
• Relaxation of the ACIDproperties
10
Benefits of NoSQL
Elastic Scaling Big Data
• RDBMS scale up – bigger • Huge increase in data
load , bigger server RDMS: capacity and
• NO SQL scale out – constraints of data
distribute data across volumes at its limits
multiple hosts • NoSQL designed for big
seamlessly data
DBA Specialists
• RDMS require highly
trained expert to
monitor DB
• NoSQL require less
management, automatic
repair and simpler data
models 11
Benefits of NoSQL
Flexible data models Economics
• Change management to • RDMS rely on expensive
schema for RDMS have proprietary servers to
to be carefully managed manage data
• NoSQL databases more • No SQL: clusters of
relaxed in structure of cheap commodity
data servers to manage the
• Database schema data and transaction
changes do not have to volumes
be managed as one • Cost per gigabyte or
complicated change unit
transaction/second for
• Application already
NoSQL can be lower
written to address an
amorphous schema than the cost for a
RDBMS 12
Drawbacks of NoSQL
• Support • Maturity
• RDBMS vendors • RDMS mature
provide a high level of product: means stable
support to clients and dependable
• Stellar reputation • Also means old no
• NoSQL – are open longer cutting edge nor
interesting
source projects with
startups supporting • NoSQL are still
them implementing their
• Reputation not yet
basic feature set
established
13
Drawbacks of NoSQL
• Administration • Analytics and
• RDMS administrator well Business Intelligence
defined role • RDMS designed to
• No SQL’s goal: no address this niche
administrator necessary
• NoSQL designed to meet
however NO SQL still
requires effort to maintain the needs of an Web 2.0
application - not
• Lack of Expertise designed for ad hoc
• Whole workforce of query of the data
trained and seasoned • Tools are being
RDMS developers developed to address
this need
• Still recruiting developers to
the NoSQL camp
14