0% found this document useful (0 votes)

44 views49 pages

Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi

The document discusses technologies for handling big data, including distributed file systems like Hadoop Distributed File System (HDFS), the Hadoop framework, and NoSQL databases. It provides an overview of HDFS architecture and explains how Hadoop uses HDFS to store and process large datasets across clusters of computers. Programming models for big data processing with Hadoop like MapReduce are also covered.

Uploaded by

syahmina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views49 pages

Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi

Uploaded by

syahmina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

TECHNOLOGIES FOR HANDLING

BIG DATA

Prepared by: Saidatul Rahah Hamidi

Overview
 Distributed File System
 Hadoop
 HDFS Architecture
 Programming Models for Big Data
 NoSQL Database
 Cloud Storage System
 Graph Data
Distributed File System
 Hold a large amount of data
 Clients distributed across a network
 Network File System(NFS)
o Straightforward design
o remote access- single machine
o Constraints
Why Hadoop is able to compete?
4

Database

vs.

Scalability (petabytes of data, Performance (tons of indexing, tuning,

thousands of machines) data organization tech.)

Flexibility in accepting all data

formats (no schema) Features:
- Provenance tracking
Efficient and simple fault-tolerant - Annotation management
mechanism - ….
Mainly for Structured data
Commodity inexpensive hardware

Used for Structured, Semi- structured

and Unstructured data
Hadoop vs RDBMS
Feature RDBMS Hadoop
Used for Structured, Semi-Structured
Data Variety Mainly for Structured data.
and Unstructured data
Data Storage Average size data (GBS) Use for large data set (Tbs and Pbs)
Querying SQL Language HQL (Hive Query Language)
Required on write (static
Schema Required on read (dynamic schema)
schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
OLTP (Online transaction Analytics (Audio, video, logs etc), Data
Use Case
processing) Discovery
Data Objects Works on Relational Tables Works on Key/Value Pair
Throughput Low High
Scalability Vertical Horizontal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High (ACID) Low
Hadoop’s Developers

2005: Doug Cutting and Michael J. Cafarella developed

Hadoop to support distribution for the Nutch search engine
project. “Hadoop” is named after his son's toy elephant.
Doug Cutting
The project was funded by Yahoo.

2006: Yahoo gave the project to Apache

Software Foundation.

https://www.coursera.org/learn/hadoop/lectur
e/1BPj6/the-hadoop-zoo
What is Hadoop
7

 Hadoop is a software framework for distributed processing

of large datasets across large clusters of computers
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop is an open-source implementation of Google
MapReduce, GFS (Google File System)
 Hadoop is based on a simple programming model called
MapReduce
 Hadoop is based on a simple data model, any data will fit
Apache Hadoop In Big Data

• Apache Hadoop is a framework for running applications on large cluster built of

commodity hardware.

• The Hadoop framework transparently provides applications both reliability and

data motion.

• Hadoop implements a computational paradigm named MapReduce, where the

application is divided into many small fragments of work, each of which may be
executed or re-executed on any node in the cluster.

• In addition, it provides a distributed file system (HDFS) that stores data on the
compute nodes, providing very high aggregate bandwidth across the cluster.

• Both MapReduce and the HDFS are designed so that node failures are
automatically handled by the framework.

http://hadoop.apache.org
Apache Hadoop In Big Data

List of modules :
• Hadoop Common : contains libraries and utilities that
support the other Hadoop modules.
• Hadoop Distributed File System (HDFS) :
A distributed file system that provides
high-throughput access to application data.
• Hadoop Yarn : A framework for job scheduling and
cluster resource management.
• Hadoop MapReduce : A programming model for
large scale data processin.
Architecture (Hadoop)
https://dzone.com/articles/ecosystem-hadoop-
animal-zoo-0
Hadoop Components / Tools
 Apache Avro: designed for communication between Hadoop nodes
through data serialization
 Cassandra and Hbase: a non-relational database designed for use
with Hadoop
 Hive: a query language similar to SQL (HiveQL) but compatible with
Hadoop
 Mahout: an AI tool designed for machine learning; that is, to assist
with filtering data for analysis and exploration
 Pig Latin: A data-flow language and execution framework for
parallel computation
 Hbase : database model built on top of Hadoop
 Flume : Designed for large scale data movement
 ZooKeeper: Keeps all the parts coordinated and working together
https://opensource.com/life/14/8/intro-apache-hadoop-big-data
Main Properties of HDFS
14

 HDFS is the primary data storage system used by

Hadoop application
 Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data
 Replication: Each data block is replicated many times
(default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
 Namenode is consistently checking Datanodes
HDFS is good for
 Very large files
 Streaming data access

 Commodity hardware

HDFS is not good for

 Low-latency data access
 Lots of small files

 Multiple writers, arbitrary file modifications

Architecture
Technologies
Traditional Data Management Arc.
New Data Management Arc.
Bigger Picture: Hadoop vs. Other Systems
20

Distributed Databases Hadoop

Computing Model - Notion of transactions - Notion of jobs
- Transaction is the unit of work - Job is the unit of work
- ACID properties, Concurrency - No concurrency control
control
Data Model - Structured data with known - Any data will fit in any
schema format
- Read/Write mode - (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare - Failures are common over
- Recovery mechanisms thousands of machines
- Simple yet efficient fault
tolerance
Key - Efficiency, optimizations, fine- - Scalability, flexibility, fault
Characteristics tuning tolerance
Bigger Picture:
Hadoop vs. Other Systems (continue)

 Cloud Computing
 A computing model where any computing infrastructure can run on the
cloud
 Hardware & Software are provided as remote services
 Elastic: grows and shrinks based on the user’s demand
 Example: Amazon EC2
Storage….
Storage???
 Storing of the Data
 The Memory Hierarchy
 Redundancy Arrays of Independent Disks (RAID)

 Disk Space Management

An Example Memory Hierarchy
Smaller, L0:
faster, registers CPU registers hold words retrieved from L1
and cache
costlier L1: on-chip L1
(per byte) cache (SRAM) L1 cache holds cache lines retrieved from
storage the L2 cache memory
devices L2: off-chip L2
cache (SRAM) L2 cache holds cache lines retrieved
from main memory

L3: main memory

Larger, (DRAM)
Main memory holds disk
slower, blocks retrieved from local
and disks
cheaper local secondary storage
L4:
(per byte) (local disks)
storage Local disks hold files retrieved
from disks on remote network
devices
servers

L5: remote secondary storage

(distributed file systems, Web servers)
6A-24
RAID
 Redundant Array of Inexpensive Disks (RAID)
 RAID arrays write data across multiple disks as a way of storing data redundantly (to
achieve fault tolerance) or to stripe data across multiple disks to get better performance
than any one disk could provide on its own.
 Typically, a RAID array will appear to the operating system as a single disk.

- Idea: Use many disks in parallel to increase storage bandwidth, improve reliability
- Files are striped across disks
- Each stripe portion is read/written in parallel
- Bandwidth increases with more disks

 Problems:
- Small files (small writes less than a full stripe)
- Need to read entire stripe, update with small write, then write entire segment out to disks
- Reliability: more disks increases the chance of media failure (MTBF)
- Turn reliability problem into a feature
- Use one disk to store parity data: XOR of all data blocks in stripe
- Can recover any data block from all others + parity block - “redundant”
- Overhead
Common RAID Levels
 RAID 0: Striping
 Good for random access (no reliability)
 It splits data among two or more disks.

 RAID 1: Mirroring
- Two disks, write data to both (expensive, 1X storage overhead)

 RAID 5: Floating parity

- Parity blocks for different stripes written to different disks
- No single parity disk, hence no bottleneck at that disk
- is an ideal combination of good performance, good fault tolerance and high
capacity and storage efficiency.
- “Distributed Parity” is the key word here

 RAID “10”: Striping plus mirroring

- Higher bandwidth, but still have large overhead
- good performance and good failover handling.
- Also called ‘Nested RAID’.
JBOD (Just a Bunch of Disk)
6A-27

 The disks in a JBOD array can function as their own

individual volumes or can be connected or spanned, to
form a single logical volume.
 Just a disk
 No replication

 No data integrity

 Less cost

 Commodity
RAID vs JBOD
6A-28
Data Lake
….is a storage repository that holds a vast amount of raw data in its native
format until it is needed. While a hierarchical data warehouse stores data in files
or folders, a data lake uses a flat architecture to store data.
searchaws.techtarget.com/definition/data-lake
Data Lake
Data Lake vs Data Warehouse
Programming Models
Programming Models for Big Data
MapReduce: Word Count
NoSQL
NoSQL
 NoSQL is not “No-SQL”  Not Only SQL
 It does not aim to provide the ACID properties
 Originated as no –SQL though
 Scalability is horizontal, i.e., can put tuples (rows)
across distributed machines
 Flexibility to model any kind of data
 Natural way of modeling data
 Distribution support is in-bulit
Taxonomy of NoSQL

• Key-value

• Graph database

• Document-oriented
• Popular doc format:
XML, JSON, BSON,
YAML
3
• Column family
CAP theorem for NoSQL
What the CAP theorem really says:
• If you cannot limit the number of faults and requests can be
directed to any server and you insist on serving every request you
receive then you cannot possibly be consistent Eric Brewer 2001

How it is interpreted:
• You must always give something up: consistency, availability or
tolerance to failure andreconfiguration

5
Theory of NOSQL: CAP
GIVEN:
• Many nodes
C
• Nodes contain replicas of partitions
of the data

• Consistency
• All replicas contain the same version
of data
• Client always has the same view of
the data (no matter what node)
• Availability
• System remains operational on failing
nodes A P
• All clients can always read and write
• Partition tolerance
• multiple entrypoints
• System remains operational on
CAP Theorem:
system split (communication
malfunction)
satisfying all three at the
• System works well across physical
network partitions
same time is impossible 6
Available, Partition-
Tolerant (AP) Systems
achieve "eventual
consistency" through
replication and
verification

Consistent,
Available (CA)
Systems have
trouble with
Consistent, Partition-Tolerant (CP)
partitions
Systems have trouble with availability
and typically deal while keeping data consistent across
with it with partitioned nodes
replication

http://blog.nahurst.com/visual-guide-to-nosql-systems
BASE properties
 Basically Available – System guarantees
availability
 Soft State – state of system is soft, it may change
without input to maintain consistency
 Eventually Consistency – data will be eventually
consistent without any interim perturbation
(uneasiness)
 Sacrifices consistency
 To counter ACID
RDB ACID to NoSQLBASE

Atomicity Basically

Consistency Available (CP)

Isolation Soft-state
(State of system may change
over time)

Durability Eventually
consistent 15
(Asynchronous propagation)

Pritchett, D.: BASE: An Acid Alternative (queue.acm.org/detail.cfm?id=1394128)

Types of NoSQL data Stores
 Four main types of NoSQL data stores:
 Columnar families
 Bigtable systems

 Document databases

 Graph database

 http://nosql-database.org/
NoSQL Systems
 Three most popular are:
 Hbase

 Cassandra

 MongoDB
How does NoSQL vary from RDBMS?
• Looser schema definition
• Applications written to deal with specific documents/ data
• Applications aware of the schema definition as opposed to the data
• Designed to handle distributed, large databases
• Trade offs:
• No strong support for ad hoc queries but designed for speed and
growth of database
• Query language through the API
• Relaxation of the ACIDproperties

10
Benefits of NoSQL
Elastic Scaling Big Data
• RDBMS scale up – bigger • Huge increase in data
load , bigger server RDMS: capacity and
• NO SQL scale out – constraints of data
distribute data across volumes at its limits
multiple hosts • NoSQL designed for big
seamlessly data
DBA Specialists
• RDMS require highly
trained expert to
monitor DB
• NoSQL require less
management, automatic
repair and simpler data
models 11
Benefits of NoSQL
Flexible data models Economics
• Change management to • RDMS rely on expensive
schema for RDMS have proprietary servers to
to be carefully managed manage data
• NoSQL databases more • No SQL: clusters of
relaxed in structure of cheap commodity
data servers to manage the
• Database schema data and transaction
changes do not have to volumes
be managed as one • Cost per gigabyte or
complicated change unit
transaction/second for
• Application already
NoSQL can be lower
written to address an
amorphous schema than the cost for a
RDBMS 12
Drawbacks of NoSQL
• Support • Maturity
• RDBMS vendors • RDMS mature
provide a high level of product: means stable
support to clients and dependable
• Stellar reputation • Also means old no
• NoSQL – are open longer cutting edge nor
interesting
source projects with
startups supporting • NoSQL are still
them implementing their
• Reputation not yet
basic feature set
established
13
Drawbacks of NoSQL
• Administration • Analytics and
• RDMS administrator well Business Intelligence
defined role • RDMS designed to
• No SQL’s goal: no  address this niche
administrator necessary
• NoSQL designed to meet
however NO SQL still
requires effort to maintain the needs of an Web 2.0
application - not
• Lack of Expertise designed for ad hoc
• Whole workforce of query of the data
trained and seasoned • Tools are being
RDMS developers developed to address
this need
• Still recruiting developers to
the NoSQL camp
14

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Data Contracts Early Release 042024
No ratings yet
Data Contracts Early Release 042024
52 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Spark
No ratings yet
Spark
160 pages
Cloud Practitioner: Aws Certified
No ratings yet
Cloud Practitioner: Aws Certified
18 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
B07J45DZ9F
No ratings yet
B07J45DZ9F
327 pages
Edureka Training - Microsoft SQL Server Certification Course
No ratings yet
Edureka Training - Microsoft SQL Server Certification Course
11 pages
T2 File Handling
No ratings yet
T2 File Handling
15 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Database
No ratings yet
Database
145 pages
iCEDQ Brochure - Product Datasheet
No ratings yet
iCEDQ Brochure - Product Datasheet
5 pages
HCIA-openGauss V1.0Training Materials
No ratings yet
HCIA-openGauss V1.0Training Materials
504 pages
Primo SQL Masterclass
No ratings yet
Primo SQL Masterclass
94 pages
Cloudera Hive
No ratings yet
Cloudera Hive
132 pages
CIS Lab Workbook
No ratings yet
CIS Lab Workbook
72 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
SQL Server To Aurora PostgreSQL Migration Playbook 1.0 Preliminary
No ratings yet
SQL Server To Aurora PostgreSQL Migration Playbook 1.0 Preliminary
456 pages
Data Flow Testing
100% (1)
Data Flow Testing
3 pages
ETL Testing Concepts iCEDQ
No ratings yet
ETL Testing Concepts iCEDQ
20 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
ETL Testing Topics1
No ratings yet
ETL Testing Topics1
46 pages
Gianluca Hotz: SQL Server Modernization
No ratings yet
Gianluca Hotz: SQL Server Modernization
74 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Resident Load Vs Preceding Load
No ratings yet
Resident Load Vs Preceding Load
10 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Etl Cook Book PDF
No ratings yet
Etl Cook Book PDF
14 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
ETL Testing: Online, Classroom, Corporate Mr. 40 Days
No ratings yet
ETL Testing: Online, Classroom, Corporate Mr. 40 Days
13 pages
Lab - GAE
No ratings yet
Lab - GAE
133 pages
Understanding Data Governance
100% (3)
Understanding Data Governance
28 pages
Extraction, Transformation, and Load (ETL) Specification
No ratings yet
Extraction, Transformation, and Load (ETL) Specification
8 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Amazon Aurora: Relational Database Reimagined For The Cloud
No ratings yet
Amazon Aurora: Relational Database Reimagined For The Cloud
31 pages
CIT-650 Introduction To Big Data, Developing With Spark and Hadoop
No ratings yet
CIT-650 Introduction To Big Data, Developing With Spark and Hadoop
4 pages
Updated Syllabus - ME CSE Word Document PDF
No ratings yet
Updated Syllabus - ME CSE Word Document PDF
62 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Big Query
No ratings yet
Big Query
5 pages
Interview PDF
No ratings yet
Interview PDF
100 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Internship Report On Nepal Telecom
No ratings yet
Internship Report On Nepal Telecom
21 pages
02 - Data Analytics Prefessional Course
100% (1)
02 - Data Analytics Prefessional Course
16 pages
4 Data-Testing PDF
No ratings yet
4 Data-Testing PDF
79 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
DP900 ExamTopics Question 241 - 285
No ratings yet
DP900 ExamTopics Question 241 - 285
14 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Time Series Databases
100% (2)
Time Series Databases
81 pages
Meeting DWH QA Challenges Part 2
No ratings yet
Meeting DWH QA Challenges Part 2
10 pages
Data-Engineering Course Structure
No ratings yet
Data-Engineering Course Structure
9 pages
Unit - 5 - Chapter 3 - Creating, Updating, and Deleting Documents in MongoDB
No ratings yet
Unit - 5 - Chapter 3 - Creating, Updating, and Deleting Documents in MongoDB
60 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
Data Quality Administration Guide
No ratings yet
Data Quality Administration Guide
210 pages
ETL QA Sample Scenario V3
100% (2)
ETL QA Sample Scenario V3
3 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
The System Design
No ratings yet
The System Design
135 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Scheme & Syllabus For Master of Computer Applications
No ratings yet
Scheme & Syllabus For Master of Computer Applications
110 pages
Big Data - No SQL Databases and Related Concepts
100% (1)
Big Data - No SQL Databases and Related Concepts
101 pages
System Design Interview Design Ticketmaster W A Ex-Meta Staff Engineer
No ratings yet
System Design Interview Design Ticketmaster W A Ex-Meta Staff Engineer
18 pages
System Design Interview Design BookMyShow Fandango - Tech Wrench
No ratings yet
System Design Interview Design BookMyShow Fandango - Tech Wrench
24 pages
Pyqs
No ratings yet
Pyqs
9 pages
Data Modeling ER
33% (3)
Data Modeling ER
89 pages
Unit1 Notes PPT and PDF
No ratings yet
Unit1 Notes PPT and PDF
20 pages
CS3492-DBMS Unit-5
No ratings yet
CS3492-DBMS Unit-5
9 pages
Assignment Activity Module For MIS Chapter 6
No ratings yet
Assignment Activity Module For MIS Chapter 6
15 pages
Introduction To NoSQL Injection - @CyberFreeCourses
No ratings yet
Introduction To NoSQL Injection - @CyberFreeCourses
46 pages
(FREE PDF Sample) BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning Raj Kamal Ebooks
100% (7)
(FREE PDF Sample) BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning Raj Kamal Ebooks
49 pages
Syllabus - 2024 Onwards SMU 1
No ratings yet
Syllabus - 2024 Onwards SMU 1
4 pages
DBMS Questions and Answers
No ratings yet
DBMS Questions and Answers
48 pages
System Design
No ratings yet
System Design
15 pages
Introduction To Database Management Systems Summary Notes
No ratings yet
Introduction To Database Management Systems Summary Notes
10 pages
UG 4-1 R19 CSE Syllabus
No ratings yet
UG 4-1 R19 CSE Syllabus
33 pages
Hindu. internshipFINAL1
No ratings yet
Hindu. internshipFINAL1
30 pages
KronoGraph White Paper
No ratings yet
KronoGraph White Paper
13 pages
Database Management System
No ratings yet
Database Management System
18 pages
Ganesh Vandana Ma2 03
No ratings yet
Ganesh Vandana Ma2 03
5 pages
Plaquette en
No ratings yet
Plaquette en
24 pages
Mongodb
No ratings yet
Mongodb
22 pages
502 T3694 NoSQL Databases
No ratings yet
502 T3694 NoSQL Databases
2 pages
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Monitoring Hadoop
From Everand
Monitoring Hadoop
Gurmukh Singh
No ratings yet
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet

Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi

Uploaded by

Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi

Uploaded by

TECHNOLOGIES FOR HANDLING

Prepared by: Saidatul Rahah Hamidi

Scalability (petabytes of data, Performance (tons of indexing, tuning,

Flexibility in accepting all data

Used for Structured, Semi- structured

2005: Doug Cutting and Michael J. Cafarella developed

2006: Yahoo gave the project to Apache

 Hadoop is a software framework for distributed processing

• Apache Hadoop is a framework for running applications on large cluster built of

• The Hadoop framework transparently provides applications both reliability and

• Hadoop implements a computational paradigm named MapReduce, where the

 HDFS is the primary data storage system used by

HDFS is not good for

 Multiple writers, arbitrary file modifications

Distributed Databases Hadoop

 Disk Space Management

L3: main memory

L5: remote secondary storage

 RAID 5: Floating parity

 RAID “10”: Striping plus mirroring

 The disks in a JBOD array can function as their own

Consistency Available (CP)

Pritchett, D.: BASE: An Acid Alternative (queue.acm.org/detail.cfm?id=1394128)

You might also like