26 Distributed Dbms Nosql
26 Distributed Dbms Nosql
and
NOSQL
Introduction to Databases
CompSci 316 Spring 2017
Announcements (Mon., Apr. 24)
• Homework #4 due today (Monday, April 24, 11:55 pm)
• Project
• final report draft due today -- Monday, April 24, 11:59 pm
• code due on Wednesday -- April 26, 11:59 pm
• See all announcements about project report and demo on piazza
• Please bring a computer to class on Wednesday
• We will take a 5-10 mins break for filling out course evaluations as
advised by the Office of Assessments
• Google Cloud code
• Please redeem your code asap, by May 11
• Final Exam
• May 2 (next Tuesday), 2-5 pm, in class
• Everything covered in the class (up to last lecture) is included
• If you need special accommodation, please email me
Announcements (Mon., Apr. 24)
• Final Project Report
• should be “comprehensive”
• if you had more material in MS1 or MS2, please include them
• not the “work in progress” part
Architecture
Data Storage
Query Execution
Transactions
Two desired properties and recent trends
• Data is stored at several sites, each managed by a DBMS that can run
independently
1. Distributed Data Independence
• Users should not have to know where data is located
2. Distributed Transaction Atomicity
• Users should be able to write transactions accessing multiple sites just
like local transactions
• These two properties are in general desirable, but not always efficiently
achievable
• e.g. when sites are connected by a slow long-distance network
• Even sometimes not desirable for globally distributed sites
• too much administrative overhead of making location of data
transparent (not visible to the user)
• Therefore not always supported
• Users have to be aware of where data is located
Distributed DBMS Architecture
• Three alternative approaches
1. Client-Server
• One or more client (e.g. personal computer) and one or more
server processes (e.g. a mainframe)
• Clients are responsible for user interfaces, Server manages
data and executes queries
2. Collaborating Server
• Queries can be submitted and can span multiple sites
• No distinction between clients and servers
3. Middleware
• need just one db server (called middleware) capable of
managing queries and transactions spanning multiple servers
• the remaining servers can handle only the local queries and
transactions
• can integrate legacy systems with limited flexibility and power
Storing Data in a Distributed DBMS
• Vertical:
• Identified by projection queries
• Typically unique TIDs added to each tuple
• TIDs replicated in each fragments
• Ensures that we have a Lossless Join
Replication
• When we store several copies of a relation or relation
fragments
• can be replicated at one or more sites
• e.g. R is fragmented into R1, R2, R3; one copy of R2, R3; but two
copies at R1 at two sites
• Advantages
• Gives increased availability – e.g. when a site or communication
link goes down
• Faster query evaluation – e.g. using a local copy
• Synchronous and Asynchronous (later)
• Vary in how current different copies are when a relation is modified
SITE A SITE B
R1 R3
R1 R2
Distributed Query Processing: SELECT AVG(S.age)
FROM Sailors S
Non-Join Distributed Queries WHERE S.rating > 3
tid sid sname rating age AND S.rating < 7
T1 4 stored at Shanghai
T2 5
stored at Tokyo
T3 9
LONDON PARIS
LONDON PARIS
• Optional reading:
• Cattell’s paper (2010-11)
• Warning! some info will be outdated
• see webpage http://cattell.net/datastores/ for
updates and more pointers
So far -- RDBMS
• Relational Data Model
• Relational Database Systems (RDBMS)
• RDBMSs have
• a complete pre-defined fixed schema
• a SQL interface
• and ACID transactions
NOSQL
• Many of the new systems are referred to as “NoSQL” data
stores
• MongoDB, CouchDB, VoltDB, Dynamo, Membase, ….
• Basically Available
• Soft state
• Eventually consistent
ACID vs. BASE
• The idea is that by giving up ACID constraints, one
can achieve much higher performance and
scalability
1. Concurrency Control
2. Data Storage Medium
3. Replication
4. Transactions
Choices in NOSQL systems:
1. Concurrency Control
a) Locks
• some systems provide one-user-at-a-time read or
update locks
• MongoDB provides locking at a field level
b) MVCC
c) None
• do not provide atomicity
• multiple users can edit in parallel
• no guarantee which version you will read
d) ACID
• pre-analyze transactions to avoid conflicts
• no deadlocks and no waits on locks
Choices in NOSQL systems:
2. Data Storage Medium
a) Storage in RAM
• snapshots or replication to disk
• poor performance when overflows RAM
b) Disk storage
• caching in RAM
Choices in NOSQL systems:
3. Replication
• whether mirror copies are always in sync
a) Synchronous
b) Asynchronous
• faster, but updates may be lost in a crash
c) Both
• local copies synchronously, remote copies
asynchronously
Choices in NOSQL systems:
4. Transaction Mechanisms
a) support
b) do not support
c) in between
• support local transactions only within a single object or
“shard”
• shard = a horizontal partition of data in a database
Comparison from Cattell’s paper (2011)
Data Store Categories
• The data stores are grouped according to their data model
• Key-value Stores:
• store values and an index to find them based on a programmer-
defined key
• e.g. Project Voldemort, Riak, Redis, Scalaris, Tokyo Cabinet,
Memcached/Membrain/Membase
• Document Stores:
• store documents, which are indexed, with a simple query mechanism
• e.g. Amazon SimpleDB, CouchDB, MongoDB, Terrastore
• Extensible Record Stores:
• store extensible records that can be partitioned vertically and
horizontally across nodes (“wide column stores”)
• e.g. Hbase, HyperTable, Cassandra, Yahoo’s PNUTS
• Relational Databases:
• store (and index and query) tuples, e.g. the new RDBMSs that provide
horizontal scaling
• e.g. MySQL Cluster, VoltDB, Clustrix, ScaleDB, ScaleBase, NimbusDB,
Google Megastore (a layer on BigTable)
RDBMS benefits
• Relational DBMSs have taken and retained majority market
share over other competitors in the past 30 years
• If you only require a lookup of objects based on a single key, then a key-
value/document store may be adequate and probably easier to
understand than a relational DBMS
• Row store
• store all attributes of a tuple together
• storage like “row-major order” in a matrix
• Column store
• store all rows for an attribute (column) together
• storage like “column-major order” in a matrix
• e.g.
• MonetDB, Vertica (earlier, C-store), SAP/Sybase IQ,
Google Bigtable (with column groups)
Ack: Slide from VLDB 2009 tutorial on Column store
Ack: Slide from VLDB 2009 tutorial on Column store
Ack: Slide from VLDB 2009 tutorial on Column store
Ack: Slide from VLDB 2009 tutorial on Column store
Ack: Slide from VLDB 2009 tutorial on Column store