0% found this document useful (0 votes)
6 views65 pages

6 2-BigTable

Uploaded by

purohithinesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views65 pages

6 2-BigTable

Uploaded by

purohithinesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Bigtable: A Distributed Storage

System for Structured Data


Fay Chang et al. (Google, Inc.)
Presenter: Kyungho Jeon
[email protected]

Fall 2012: CSE 704 Web-scale Data


10/22/2012 1
Management
Motivation and Design Goal
• Distributed Storage System for Structured
Data
– Scalability
• Petabytes of data on Thousands of (commodity)
machines
– Wide Applicability
• Throughput-oriented and Latency-sensitive
– High Performance
– High Availability

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 2


Data Model

Fall 2012: CSE 704 Web-scale Data


10/22/2012 3
Management
Data Model
• Not a Full Relational Data Model
• Provides a simple data model
– Supports Dynamic Control over Data Layout
– Allows clients to reason about the locality
properties

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 4


Data Model – A Big Table
• A Table in Bigtable is a:
– Sparse
– Distributed
– Persistent
– Multidimensional
– Sorted map

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 5


Data Model
• Data is indexed using row and column names
• Data is treated as uninterpreted strings
– (row:string, column:string, time:int64) → string
• Data locality can be controlled through careful
choices of the schema

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 6


Data Model
• Rows
– Data maintained in lexicographic order by row key
– Tablet: rows with consecutive keys
• Units of distribution and load balancing
• Columns
– Column families
• Family:qualifier
• Cells
• Timestamps
10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 7
Data Model – WebTable Example

A large collection of web pages and related information

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 8


Data Model – WebTable Example

Row Key

Tablet - Group of rows with consecutive keys.


Unit of Distribution
Bigtable maintains data in lexicographic order by row key

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 9


Data Model – WebTable Example

Column
Column family is the unit of access control Family

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 10


Data Model – WebTable Example

Column key is specified by Column


“Column family:qualifier”

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 11


Data Model – WebTable Example

You can add a column in a Column


column family if the column
family was created
10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 12
Data Model – WebTable Example

Cell: the storage referenced


Cell by a particular row key,
column key, and
timestamp
10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 13
Data Model – WebTable Example

Different cells in a table


can contain multiple
versions indexed by
timestamp

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 14


API

Fall 2012: CSE 704 Web-scale Data


10/22/2012 15
Management
API
• Write or Delete values in Bigtable
• Look up values from individual rows
• Iterate over a subset of the data in a table

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 16


API – Update a Row

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 17


API – Update a Row

Opens a Table

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 18


API – Update a Row

We’re going to mutate the row

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 19


API – Update a Row

Store a new item under the


column key “anchor:www.c-
span.org”

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 20


API – Update a Row

Delete an item under the


column key
“anchor:www.abc.com”

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 21


API – Update a Row

Atomic Mutation

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 22


API – Iterate over a Table

Create a Scanner instance

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 23


API – Iterate over a Table

Access “anchor” column family

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 24


API – Iterate over a Table

Specify “return all versions”

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 25


API – Iterate over a Table

Specify a row key

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 26


API – Iterate over a Table

Iterate over rows


10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 27
API – Other Features
• Single row transaction
• Client-supplied scripts in the address space of
the server
• Input source/Output target for MapReduce
jobs

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 28


A Typical Google Machine

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 29


A Google Cluster

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 30


A Google Cluster

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 31


Building Blocks
• Chubby
– Highly-available and persistent distributed lock
service
• GFS
– Store logs and data files
– SSTable
• Google’s immutable file format
• A persistent, ordered immutable map from keys to
values
• http://code.google.com/p/leveldb/

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 32


Chubby
• Highly-available and persistent distributed lock
service
– 5 replicas, one is elected as a master
– Paxos
– Provides a namespace that consists of directories
and small files

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 34


Implementation
• Client Library
• Master
– one and only one!
• Tablet Servers
– Many

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 35


Implementation - Master
• Responsible for assigning tablets to table
servers
– Addition/removal of tablet server
– Tablet-server load balancing
– Garbage collecting files in GFS
• Handles schema changes
• Single master system (as GFS did)

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 36


Tablet Server
• Manages a set of tablets
• Handles read and write requests to the tablets
• Splits tablets that have grown too large

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 37


How Does a Client Find a Tablet?

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 38


Tablet Assignment
• Each tablet is assigned to at most one tablet
server at a time
• When a tablet is unassigned, and a tablet
server is available, the master assigns the
tablet by sending a tablet load request
• Bigtable uses Chubby to keep track of tablet
servers

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 39


Tablet Assignment
• Detecting a tablet server which is no longer serving its
tablets
– The master periodically asks each tablet server for the status of
its lock
– If a tablet server reports it has lost its lock, or if the master
cannot reach a tablet server,
– The master attempts to acquire an exclusive lock on the
server’s file
– If the lock acquire is successful -> Chubby is alive, so the tablet
server must have a problem
– The master deletes the server’s file in Chubby to ensure the
tablet server can never serve again
– Then, the master move all the tablets that were previously
assigned to that server into the set of unassigned tablets

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 40


Tablet Assignment
• When a master is started, the master…
– Grabs a unique master lock in Chubby
– Scans the servers directory in Chubby to find the
live servers
– Communicates with every live tablet server to
discover the current tablet assignment
– Scans the METADATA table and adds unassigned
tablets to the set of unassigned tablets

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 41


Tablet Serving

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 42


Tablet Serving
• Memtable
– A sorted buffer
– Maintains the updates on a row-by-row basis
– Each row is copy-on-write to maintain row-level
consistency
– Older updates are stored in a sequence of SSTable

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 43


Tablet Serving

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 44


Tablet Serving - Write
• Write operation
– The server checks if the operation is valid
– A valid mutation is written to the commit log
– After the write has been committed, its contents
are inserted into the memtable

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 45


Tablet Serving

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 46


Tablet Serving - Read
• Read operation
– Check if the operation is valid
– A valid operation is executed on a merged view of
the sequence of SSTables and the memtable
– The merged view can be formed efficiently since
SSTables and the memtable are lexicographically
sorted data structure

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 47


Tablet Serving - Recover

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 48


Tablet Serving - Recover
• Recover a table
– A tablet server reads its metadata from
METADATA table
– The metadata contains the list of SSTables that
comprise a tablet and a set of redo points
– The server reads the indices of the SSTables into
memory and reconstructs the memtable by
applying all of the updates that have committed
since the redo points

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 49


Compaction
• Minor compaction
– When the memtable size reaches a threshold, the
memtable is frozen, a new memtable is created,
and the frozen memtable is converted to an
SSTable
• Major compaction
– Rewrite multiple SSTables into one SSTable

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 50


Compaction

memtable

Memory

GFS
Commit Log SSTable
SSTable
SSTable
SSTable
Write Op

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 51


Compaction
Threshold reached

memtable

Memory

GFS
Commit Log SSTable
SSTable
SSTable
SSTable
Write Op

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 52


Compaction
Threshold reached

memtable

Memory

GFS
Commit Log SSTable
SSTable
SSTable
SSTable
Write Op SSTable

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 53


Compaction
A new memtable

memtable

Memory

GFS
Commit Log SSTable
SSTable
SSTable
SSTable
Write Op SSTable

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 54


Compaction

memtable

Memory

GFS
Commit Log
SSTable

Write Op
Major
compaction

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 55


Schema Management
• Bigtable schemas are stored in Chubby
• The master update the schema by rewriting
the corresponding schema file in Chubby

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 56


Optimization
• Locality Group
– Client defined
– An abstraction that enables clients to control
their data’s storage layout
– A separate SSTable is generated for each locality
group in each tablet during compaction
– A locality group can be declared to be in-memory

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 57


Optimization
• Compression
– Client can control whether the SSTables for a
locality group are compressed

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 58


Optimization
• Two-level Caching for Read Performance
– Scan cache:
• higher level.
• Caches the key-value pairs returned by the SSTable
interface to the tablet server code
– Block cache:
• lower level
• Caches SSTable blocks

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 59


Optimization
• Commit-Log Implementation
– Using one log per tablet server
– Recovery?
• A tablet server hosted 100 tablets failed
• 100 other machines were each assigned a single tablet
• 100 reads?
• Sort the commit log by <table, row name, log seq #>
– Writing commit logs
• Two log-writer threads

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 61


Performance Evaluation
• Sequential writes/reads
– Row keys with names 0 to R-1, partitioned into 10N equal-sized
ranges
– Wrote a single string under each row key
– 1GB / tablet server
• Scan
– Uses Bigtable Scan API
• Random writes/reads
– Similar to Sequential write/read, but the row key was hashed
• Random reads (Mem)
– 100MB / tablet server, the locality group is marked as in-memory

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 62


Single Tablet Server Performance

Fall 2012: CSE 704 Web-scale Data


10/22/2012 63
Management
Aggregate Throughput

Fall 2012: CSE 704 Web-scale Data


10/22/2012 64
Management
Real Applications

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 65


Lessons Learned
• Failures!
• Delay new features until it is clear how the
new features will be used
• Monitoring
• Simple Design!

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 66


Acknowledgement
• Jeff Dean, “Handling Large Datasets at Google:
Current Systems and Future Directions”

10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 67

You might also like