UNIT 2 - Part1
UNIT 2 - Part1
2
CREDITS: Jimmy Lin (University of Maryland)
Why NoSQL?
In the last fifty years relational databases have been the default
choice for serious data storage.
An architect starting a new project:
your only choice is likely to be which relational database to use.
often not even that, if your company has a dominant vendor.
In the past, other proposals for database technology:
deductive databases in the 1980’s
object databases in the 1990’s
XML databases in the 2000’s
these alternatives never got anywhere.
3
CREDITS: Jimmy Lin (University of Maryland)
The Value of Relational Databases
Effective and efficient management of persistent data
Concurrency control
Data integration
A standard data model
A standard query language
4
CREDITS: Jimmy Lin (University of Maryland)
Impedance Mismatch
Difference between the persistent data model and the in-memory
data structures
5
CREDITS: Jimmy Lin (University of Maryland)
A proposal to solve the problem (1990s)
Databases that replicate the in-memory data structures to disk
Object-oriented databases!
8
CREDITS: Jimmy Lin (University of Maryland)
NoSQL
Term appeared in the late 90s
open-source relational database [Strozzi NoSQL]
tables as ASCII files, without SQL
Current interpretation
June 11, 2009: meetup in San Francisco
Open-source, distributed, non-relational databases
Hashtag chosen: #NoSQL
Main features:
Not using SQL and the relational model
Open-source projects (mostly)
Running on clusters
Schemaless
However, no accepted precise definitions
Most people say that NoSQL means "Not Only SQL”
9
CREDITS: Jimmy Lin (University of Maryland)
Key Points
Relational databases have been a successful technology for twenty years,
providing persistence, concurrency control, and an integration mechanism
Application developers have been frustrated with the impedance mismatch
between the relational model and the in-memory data structures
There is a movement away from using databases as integration points towards
encapsulating databases within applications and integrating through services
The vital factor for a change in data storage was the need to support large
volumes of data by running on clusters. Relational databases are not designed to
run efficiently on clusters.
NoSQL is an accidental neologism. There is no prescriptive definition—all you
can make is an observation of common characteristics.
The common characteristics of NoSQL databases are:
Not using the relational model
Running well on clusters
Popularity
Open-source
Schemaless
10
CREDITS: Jimmy Lin (University of Maryland)
The non-relational world
11
CREDITS: Jimmy Lin (University of Maryland)
NoSQL Data Models
A data model is a set of constructs for representing the
information
Relational model: tables, columns and rows
Storage model: how the DBMS stores and manipulates the data
internally
A data model is usually independent of the storage model
Data models for NoSQL systems:
aggregate models
key-value,
document,
column-family
graph-based models
12
CREDITS: Jimmy Lin (University of Maryland)
Aggregates
Data as atomic units that have a complex structure
more structure than just a set of tuples
example:
complex record with: simple fields, arrays, records nested inside
Aggregate in Domain-Driven Design
a collection of related objects that we treat as a unit
a unit for data manipulation and management of consistency
Advantages of aggregates:
easier for application programmers to work with
easier for database systems to handle operating on a cluster
13
CREDITS: Jimmy Lin (University of Maryland)
Example
14
CREDITS: Jimmy Lin (University of Maryland)
Relational implementation
15
CREDITS: Jimmy Lin (University of Maryland)
A possible aggregation
16
CREDITS: Jimmy Lin (University of Maryland)
Aggregate representation
17
CREDITS: Jimmy Lin (University of Maryland)
Aggregate implementation
18
CREDITS: Jimmy Lin (University of Maryland)
Another possible aggregation
19
CREDITS: Jimmy Lin (University of Maryland)
Aggregate representation (2)
20
CREDITS: Jimmy Lin (University of Maryland)
Aggregate implementation (2)
// in customers
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}],
}]
}
21 }
CREDITS: Jimmy Lin (University of Maryland)
Design strategy
No universal answer for how to draw aggregate boundaries
It depends entirely on how you tend to manipulate data!
Accesses on a single order at a time: first solution
Accesses on customers with all orders: second solution
Context-specific
some applications will prefer one or the other
even within a single system
Focus on the unit of interaction with the data storage
Pros:
it helps greatly with running on a cluster: data will be manipulated together,
and thus should live on the same node!
Cons:
an aggregate structure may help with some data interactions but be an
obstacle for others
22
CREDITS: Jimmy Lin (University of Maryland)
Transactions?
Relational databases do have ACID transactions!
Aggregate-oriented databases:
don’t have ACID transactions that span multiple aggregates
they support atomic manipulation of a single aggregate at a time
Part of the consideration for deciding how to aggregate data
23
CREDITS: Jimmy Lin (University of Maryland)
Key-Value Databases
Strongly aggregate-oriented
Lots of aggregates
Each aggregate has a key
Data model:
A set of <key,value> pairs
Value: an aggregate instance
The aggregate is opaque to the database
just a big blob of mostly meaningless bit
Access to an aggregate:
lookup based on its key
24
CREDITS: Jimmy Lin (University of Maryland)
Popular key-value databases
25
CREDITS: Jimmy Lin (University of Maryland)
Document databases
Strongly aggregate-oriented
Lots of aggregates
Each aggregate has a key
Data model:
A set of <key,document> pairs
Document: an aggregate instance
Structure of the aggregate visible
limits on what we can place in it
Access to an aggregate:
queries based on the fields in the aggregate
26
CREDITS: Jimmy Lin (University of Maryland)
Popular document databases
27
CREDITS: Jimmy Lin (University of Maryland)
Key-Value vs Document stores
Key-value database
A key plus a big blob of mostly meaningless bits
We can store whatever we like in the aggregate
We can only access an aggregate by lookup based on its key
Document database
A key plus a structured aggregate
More flexibility in access
we can submit queries to the database based on the fields in the aggregate
we can retrieve part of the aggregate rather than the whole thing
Indexes based on the contents of the aggregate
28
CREDITS: Jimmy Lin (University of Maryland)
Column(-Family) Stores
Strongly aggregate-oriented
Lots of aggregates
Each aggregate has a key
Data model: a two-level map structure:
A set of <row-key, aggregate> pairs
Each aggregate is a group of pairs
<column-key,value>
Structure of the aggregate visible
Columns can be organized in families
Data usually accessed together
Access to an aggregate:
accessing the row as a whole
picking out a particular column
29
CREDITS: Jimmy Lin (University of Maryland)
Properties of Column Stores
Operations also allow picking out a particular column
get('1234', 'name')
Each column:
has to be part of a single column family
acts as unit for access
You can add any column to any row, and rows can have very different
columns
You can model a list of items by making each item a separate column.
Two ways to look at data:
Row-oriented
Each row is an aggregate
Column families represent useful chunks of data within that aggregate.
Column-oriented:
Each column family defines a record type
Row as the join of records in all column families
30
CREDITS: Jimmy Lin (University of Maryland)
Cassandra
Skinny row
few columns
same columns used by many different rows
each row is a record and each column is a field
Wide row
many columns (perhaps thousands)
rows having very different columns
models a list, with each column being one element in that list
31
CREDITS: Jimmy Lin (University of Maryland)
Popular column stores
32
CREDITS: Jimmy Lin (University of Maryland)
Key Points
An aggregate is a collection of data that we interact with as a unit.
Aggregates form the boundaries for ACID operations with the
database
Key-value, document, and column-family databases can all be seen
as forms of aggregate-oriented database
Aggregates make it easier for the database to manage data storage
over clusters
Aggregate-oriented databases work best when most data
interaction is done with the same aggregate
Aggregate-ignorant databases are better when interactions use data
organized in many different formations
33
CREDITS: Jimmy Lin (University of Maryland)
Relationships
Relationship between different aggregates:
Put the ID of one aggregate within the data of the other
Join: write a program that uses the ID to link data
The database is ignorant of the relationship in the data
34
CREDITS: Jimmy Lin (University of Maryland)
Relationship management
Many NoSQL databases provide ways to make relationships visible
to the database
Document stores makes use of indexes
Riak (key-value store) allows you to put link information in metadata
But what about updates?
Aggregate-oriented databases treat the aggregate as the unit of data-
retrieval.
Atomicity is only supported within the contents of a single aggregate.
Updates over multiple aggregates at once is a programmer's
responsibility!
In contrast, relational databases provide ACID guarantees while
altering many rows through transactions
35
CREDITS: Jimmy Lin (University of Maryland)
Graph Databases
Graph databases are motivated by a different frustration with
relational databases
Complex relationships require complex join
Goal:
Capture data consisting of complex relationships
Data naturally modelled as graphs
Examples: Social networks, Web data, maps, networks.
36
CREDITS: Jimmy Lin (University of Maryland)
A graph database
38
CREDITS: Jimmy Lin (University of Maryland)
Data model of graph databases
Basic characteristic: nodes connected by edges (also called arcs).
Beyond this: a lot of variation in data models
Neo4J stores Java objects to nodes and edges in a schemaless fashion
InfiniteGraph stores Java objects, which are subclasses of built-in
types, as nodes and edges.
FlockDB is simply nodes and edges with no mechanism for additional
attributes
Queries
Navigation through the network of edges
You do need a starting place
Nodes can be indexed by an attribute such as ID.
39
CREDITS: Jimmy Lin (University of Maryland)
Graph vs Relational databases
Relational databases
implement relationships using foreign keys
joins require to navigate around and can get quite expensive
Graph databases
make traversal along the relationships very cheap
performance is better for highly connected data
shift most of the work from query time to insert time
good when querying performance is more important than insert
speed
40
CREDITS: Jimmy Lin (University of Maryland)
Graph vs Aggregate-oriented databases
Very different data models
Aggregate-oriented databases
distributed across clusters
simple query languages
no ACID guarantees
Graph databases
more likely to run on a single server
graph-based query languages
transactions maintain consistency over multiple nodes and edges
41
CREDITS: Jimmy Lin (University of Maryland)
Schemaless Databases
Key-value store allows you to store any data you like under a key
Document databases make no restrictions on the structure of the
documents you store
Column-family databases allow you to store any data under any
column you like
Graph databases allow you to freely add new edges and freely add
properties to nodes and edges as you wish
42
CREDITS: Jimmy Lin (University of Maryland)
Pros and cons of schemaless data
Pros:
More freedom and flexibility
You can easily change your data organization
You can deal with non-uniform data
Cons:
A program that accesses data:
almost always relies on some form of implicit schema
it assumes that certain fields are present
The implicit schema is shifted into the application code that accesses data
To understand what data is present you have look at the application code
The schema cannot be used to:
decide how to store and retrieve data efficiently
ensure data consistency
Problems if multiple applications, developed by different people, access the
same database.
43
CREDITS: Jimmy Lin (University of Maryland)
Materialized Views
A relational view is a table defined by computation over the base tables
Materialized views: computed in advance and cached on disk
NoSQL databases:
do not have views
have precomputed and cached queries usually called “materialized view”
Strategies to building a materialized view
Eager approach
the materialized view is updated at the same time of the base data
good when you have more frequent reads than writes
Detached approach
batch jobs update the materialized views at regular intervals
good when you don’t want to pay an overhead on each update
44
CREDITS: Jimmy Lin (University of Maryland)
Data Accesses in key-value store
The application can read all customer’s information by using the key
45
CREDITS: Jimmy Lin (University of Maryland)
Splitting aggregates
We can now find the orders independently from the Customer, and with the
46
orderID reference in the Customer we can find all Orders for the Customer.
CREDITS: Jimmy Lin (University of Maryland)
Aggregates for analytics
A view may store which Orders have a given Product in them
Useful for Real Time Analytic
47
CREDITS: Jimmy Lin (University of Maryland)
Data Accesses in document stores
We can query inside
documents:
“find all orders that include
the Refactoring Databases
product”
Removing references to
Orders from the Customer
object is possible
We do not need to update
the Customer object when
new orders are placed by
the Customer
48
CREDITS: Jimmy Lin (University of Maryland)
Data Accesses in column-family stores
We can query inside rows:
“find all orders whose price is
greater than 20$”
The columns are ordered
We can choose columns that are
frequently used so that they are
fetched first
Splitting data in different
column-family families can
improve performance
49
CREDITS: Jimmy Lin (University of Maryland)
Data Accesses in graph databases
We start from a (set of )
node(s)
Each node has independent
relationships with other
nodes
The relationships have
names
Relationship names let you
traverse the graph.
50
CREDITS: Jimmy Lin (University of Maryland)
51
CREDITS: Jimmy Lin (University of Maryland)
52
CREDITS: Jimmy Lin (University of Maryland)
Key Points
Aggregate-oriented databases make inter-aggregate relationships
more difficult to handle than intra-aggregate relationships.
Graph databases organize data into node and edge graphs; they
work best for data that has complex relationship structures.
Schemaless databases allow you to freely add fields to records, but
there is usually an implicit schema expected by users of the data.
Aggregate-oriented databases often compute materialized views to
provide data organized differently from their primary aggregates.
This is often done with MapReduce-like computations.
53
CREDITS: Jimmy Lin (University of Maryland)