Cloud Data Storage
Cloud Data Storage
Introduction
• Relational database
• Default data storage and retrieval mechanism since 80s
• Efficient in transaction processing
• Example: System R, Ingres, etc.
• Replaced hierarchical and network databases
• For scalable web search service:
• Google File System (GFS): Massively parallel and fault
tolerant distributed file system.
• BigTable
• Organizes data
• Similar to column-oriented database (e.g. Verica)
• MapReduce
• Parallel programming paradigm
Introduction Contd…
• Suitable for:
• Large volume massively parallel text processing
• Enterprise analytics
• Similar to BigTable data model are:
• Google App Engines Datastore
• Amazons SimpleDB
Relational Databases
• Users/application programs interact with an
RDMS through SQL
• RDMS parser:
• Transforms queries into memory and disk-level
operations
• Optimizes execution time
• Disk-Space management layer:
• Store data record on pages of contiguous memory
blocks
• Pages are fetched from disk into memory as requested
using pre-fetching and page replacement policies
Relational Database Contd…
• Database file system layer:
• Independent of OS file system
• Reason:
• To have full control on retaining or releasing
a page in memory
• Files used by the DB may span to multiple disk
to handle large storage
• Uses parallel I/O system , Viz. RAID disk array
or multi-processor clusters
Data Storage Techniques
• Row-oriented storage
• Optimal for write-oriented operations viz. transaction
processing applications
• Relational records: stored on contagious disk pages
• Accessed through indexes (primary index) on specified
columns
• Example” B+ tree like storage
• Column-oriented storage
• Efficient for data- warehouse workloads
• Aggregation of measures columns need to be performed
based on values from dimension columns
• Projection of a table is stored as sorted by dimension
values
• Required multiple “join Indexes”
• If different projections are to be indexed in sorted order
Data Storage Techniques Contd…
• Shared Memory
• Suitable for server with multiple CPUs
• Memory address space is shared and
managed by symmetric multi-processing
(SMP) operating system
Parallel • SMP: Schedules processes in parallel
exploiting all the processors
Database • Shared nothing
Architect • Cluster of independent server each with
its own disk space
ures • Connected by a network
• Shared disk
• Hybrid architecture
• Independent server clusters shared
storage through high-speed network
storage viz. NAS (Network attached
storage) or SAN (storage area network)
• Clusters are connected to storage viz:
standard Ethernet or faster Fiber
Channel or InfiniBand connections
Parallel Database Architecture
Contd….
Advantages of Parallel DB over
Relational DB
• Efficient execution of SQL queries by
exploiting multiple processors
• For shared nothing architecture:
• Tables are partitioned and distributed across
multiple processing nodes
• SQL optimizer handles distributed joins
• Distributed two-phase commit locking for transaction
isolation between processors
• Fault tolerant
• System failure handled by transferring control to
“stand-by” system [for transaction processing]
• Restoring computations [for data warehousing
applications]
Examples of databases
capable of handling
Advantages parallel processing
of
Parallel Traditional transaction
processing databases:
DB over Oracle, DB2, SQL Server
Relational
DB Data warehousing
databases: Netezza,
Verica, Teradata
Cloud File Systems
• Google File System (GFS)
• Designed to manage relatively large files using a
very large distributed cluster of commodity servers
connected by a high-speed network
• Handles:
• Failures even during reading or writing of individual files
• Fault tolerant: a necessity
• P (system failure)=1-(1-p(component failure)^N-> 1 (for
large N)
• Support parallel reads, writes and appends by multiple
simultaneous client programs
• Hadoop Distributed File System (HDFS)
• Opensource implementation of GFS architecture
• Available on Amazon EC2 Cloud Platform
Hadoop Distributed
File System (HDFS)
Cloud File Open-source
System implementation of GFS
architecture
Available on Amazon
EC2 cloud platform
GFS Architecture
Master GFS)
Name node (HDFS)
GFS Architecture
• Single Master Control file namespace
• Larger files are broken up into chucks (GFS) or
blocks (HDFS)
• Typical size of each chuck: 64 MB
• Stored on commodity (Linux) servers called Chuck
servers (GFS) or Data nodes (HDFS)
• Replicated three times on different:
• Physical rack
• Network segment
Read Operation in GFS