06 BigDataAndBigDataDesign
06 BigDataAndBigDataDesign
1
Knowledge objectives
1. Define the impedance mismatch
2. Identify applications handling different kinds of data
3. Name four different kinds of NOSQL systems
4. Explain three consequences of schema variability
5. Explain the consequences of physical independence
6. Explain the two dimensions to classify NOSQL systems according to
how they manage schema
7. Explain the three elements of the RUM conjecture
2
Understanding objectives
1. Decide whether two NOSQL designs have a more or less explicit/fix
schema
3
Application objectives
1. Given a relatively small UML conceptual diagram, translate it into a
logical representation of data considering flexible schema
representation
4
From SQL to NOSQL
The need for alternative families of database technologies
5
Law of the instrument
“Over-reliance on a familiar tool.”
Wikipedia
If the only tool you have is a hammer, everything looks like a nail.
A. Maslow
6
Law of the Relational Database
• Since we only know relational databases, every time we want to model a
new domain we’ll automatically think on how to represent it as columns
and rows Ireland et al.
7
The end of an architectural era
WEB 2.0 – Write Era Real time communication
Big Data
8
Michael Stonebraker and Ugur
RDBMS: why aren’t they enough? Çetintemel. One size fits all: an
idea whose time has come and
gone. ACM Books, p: 441-462,
2019.
RDBMS
• Generic architecture that can be Designed for consistency and
tuned according to the needs: integrity, making them excellent
• Mainly-write OLTP Systems for applications with complex
• Normalization queries and transaction processing.
• Indexes: B+, Hash
• Joins: BNL, RNL, Hash-Join, Merge We need to deal with massive
Join
reads and writes at the same time!
• Read-only DW (OLAP) Systems
• Denormalized data
• Indexes: Bitmaps Data fragmentation
• Joins: Star-join Data replication
• Materialized views
Distributed RDBMS
9
Distributed RDBMS: limitations
• ACID Transactions
• Relational databases follow ACID (Atomicity, Consistency, Isolation, Durability) properties
to ensure data integrity. Maintaining them across distributed nodes introduces
complexity and performance bottlenecks
• Locking and Contention
• To enforce data integrity, relational databases use locking mechanisms to manage
simultaneous transactions, which is very costly in a distributed environment.
• Schema Rigidness
• Relational databases rely on predefined schemas, which makes it difficult to adapt to
changes in the data structure without significant overhead
• Joins across Nodes
• Queries in relational models often involve complex joins between tables. Distributing
tables across multiple nodes makes these joins inefficient and costly
11
New challenges for data management
VOLUME Verac¿ty
Velocity
Variability
Variety
12
NOSQL Goals
• Schemaless: Allow flexible (even runtime) schema definition
• Reliability / availability: Keep delivering service even if its software or
hardware components fail
• Scalability: Continuously evolve to support a growing amount of tasks
• Efficiency: How well the system performs, usually measured in terms of
response time (latency) and throughput (bandwidth)
13
Aggregate data models
• The relational model with simple record structures, referential integrity,
transactions … is not suitable for distribution -> not designed to run on
clusters
• We need to operate on data in units that have a more complex structure
(complex records)
• Think of a complex record as a structure that allows lists and other
record structures to be nested inside it
• These complex records are sometimes referred to as aggregates
14
Aggregate data models
• Aggregate orientation fits well with scaling out (i.e., use lots of small
machines in a cluster)
• The aggregate is a natural unit for distribution
• The aggregate makes a natural unit for replication and sharding
15
Example
Aggregate dataof Relations
models: Orders and Aggregates (1)
Relational database
perspective: no aggregates
23
Michael Stonebraker and Ugur
One size does not fit all Çetintemel. One size fits all: an
idea whose time has come and
gone. ACM Books, p: 441-462,
2019.
24
Evolution of different data models
25
Schema definition
26
Schema variability
• CREATE TABLE Students(id int, name varchar(50),surname varchar(50),enrolment date);
• INSERT INTO Students (1,‘Sergi’,‘Nadal’,‘01/01/2012’,true,‘Igualada’); WRONG
• INSERT INTO Students (1,‘Sergi’,‘Nadal’,NULL); OK
• INSERT INTO Students (1,‘Sergi’,‘Nadal’,‘01/01/2012’); OK
• Consequences
• Gain flexibility
• Lose semantics (also consistency)
• The data independence principle is lost (!)
• The ANSI / SPARC architecture is not followed → Implicit schema
• Applications can access and manipulate the database internal structures
27
ANSI/SPARC (recap)
Physical independence
28
ANSI/SPARC
Physical independence
29
Database
Database models
models
RELATIONAL NOSQL
• Based on the relational model • No single reference model
• Tables, rows and columns • Key-value, document, stream, graph
• Sets, instances and attributes • Ideally, the schema should be
• Constraints are allowed defined at insertion time and not at
• PK, FK, Check, … definition time (schemaless
• When creating the tables you MUST databases)
specify their schema (i.e., columns • The closer the data model in use
and constraints) looks to the way data is stored
• Data is restructured when brought internally the better (read/write
into memory (impedance throughput)
mismatch)
30
Considered database models
• Relational
city(name, population, region) VALUES (’BCN’, ’2,000,000’, ’CAT’)
• Key-Value
[‘BCN’, ‘2,000,000;CAT’]
• Document
{id:‘BCN’, population:‘2,000,000’, region:‘CAT’}
• Wide-Column (Column-Family)
[‘BCN’, population:{value:’2,000,000’}, region:{value:’CAT’}]
[‘BCN’, all:{value:’2,000,000;CAT’}]
[‘BCN’, all:{population:’2,000,000’;region:’CAT’}]
31
Relevant schema dimensions
Some new models lack of an explicit schema (declared by the user)
• An implicit schema (hidden in the application code) always remains
• May reduce the impedance mismatch
32
Key-value and Document
Data Models
33
Key-value and Document Data Models
Key-value and document databases are strongly aggregate-oriented
• In a key-value database, the aggregate is opaque to the database: just some big
blob of bits. The advantage of opacity is that we can store whatever we like in the
aggregate. It is the responsibility of the application to understand what was stored.
Since key-value stores always use primary-key access, they generally have great
performances.
• In contrast, a document database is able to see a structure in the aggregate, but
imposes limits on what we can place in it, defining allowable structures and types.
In return, however, we get more flexibility when accessing data.
34
Ordersdata
Aggregate example (1NF)
models: Orders
Customer CreditCard
Customer CustKey Name Phone CustKey CCNum Expiry
Credit card 1 Ann 234 1 02345 04/28
2 Dan 211 2 01221 05/24
Orders
Order lines (Line items)
…
Orders LineItem
35
Key-values: Orders Example
Key Value
1001 03214_$50_3_$150, 03222_$40_1_$40 …
LineItem
In a key-value store, we can only access OrderKey LineNum PriceItem Qty TotPrice
an aggregate by lookup based on its key 1001 03214 $50 3 $150
36
The Document
Aggregate data models data model – Orders Example
.json
Orders
OrderID CustKey Price
ID:1001 1001 1 $210
Customer
1002 2 $230
customer: Ann CustKey Name Phone
1 Ann 234
line items:
2 Dan 211
38
The Column-Family data model – Example
Column-Families are organized in terms of distributed The column-family model can be seen as a
maps
two-level aggregate structure
{ Row-identifier • As with key-value stores, the first key is
"1001" : { Column-family often described as a row identifier,
"profile" : { picking up the aggregate of interest
"customer": "Ann”, Columns • This row aggregate is itself formed of a
"card":"Amex”, map of more detailed values. These
"cc_number":"12345", second-level values are referred to as
"expiry":"04/28"}, columns, each being a key-value pair
"line-items" : • Columns are organized into column
"items": families. Each column has to be part of a
"[[03214,$50,3,$150],…]" single column family (data for a particular
} column family will be usually accessed
"1002" : { together)
"profile" : "…", • Each row identifier (i.e., first-level key) is
"line-items" : "…" unique
},
}
39
The Column-Family data model - Characteristics
• Column Family Structure: Data is organized into rows and columns grouped into column families,
allowing efficient data retrieval
• Sparse Data Handling: Optimized for storing sparse datasets, where rows can have a variable
number of columns, saving storage space
• High Write Performance: Designed for high-throughput write operations, making it suitable for
write-heavy workloads
• Scalability: Supports horizontal scaling across distributed nodes, ideal for handling large volumes of
data
• Efficient Data Access: Queries are optimized to read only the necessary columns within a column
family, improving performance
• Data Locality: Related columns are stored together on disk, making access patterns efficient for
certain types of queries
• Flexible Schema: Allows easy addition of new columns to existing rows without schema changes,
supporting evolving data requirements
40
Design choices
• Denormalization
• Partitioning/Fragmenting
• Horizontal
• Vertical
• Data placement
• Distribution
• Clustering
43
Alternative storage structures
44
The problem is not SQL
• Relational systems are too generic…
• OLTP: stored procedures and simple queries
• OLAP: ad-hoc complex queries
• Documents: large objects
• Streams: time windows with volatile data
• Scientific: uncertainty and heterogeneity
• …but the overhead of RDBMS has nothing to do with SQL
• Low-level, record-at-a-time interface is not the solution
Michael Stonebraker
SQL Databases vs. NoSQL Databases
Communications of the ACM, 53(4), 2010
45
The RUM conjecture
“Designing access methods that set an upper bound for two of the RUM
overheads, leads to a hard lower bound for the third overhead which
cannot be further reduced.”
M. Athanassoulis et al.
46
Example of RUM conjecture
Find(x) Space overhead, must
update the index
x
Updating means
Space overhead (appends),
reordering
hard to find
Find(x)
Update(x) Binary search
x x x sorted
M. Athanassoulis et al.
47
RUM classification space Manos Athanassoulis, Michael S.
Kester, Lukas M. Maas, Radu
Stoica, Stratos Idreos, Anastasia
Ailamaki, Mark Callaghan:
Designing Access Methods: The
RUM Conjecture. EDBT 2016.
48
Data
Data Storage
Storage
RELATIONAL NOSQL
• Generic architecture that can be • Specific architectures for a specific
tuned according to the needs: need:
• Mainly-write OLTP Systems • Primary indexes
• Normalization • Sequential reads
• Indexes: B+, Hash
• Vertical partitioning
• Joins: BNL, RNL, Hash-Join, Merge
Join • Compression
• Read-only DW Systems • Fixed-size values
• Denormalized data • In-memory processing
• Indexes: Bitmaps
• Joins: Star-join • Very specific and good (but very
• Materialized Views good) in solving a particular problem
49
Different internal structures
Aina Montalban
50
51
Closing
52
Summary
• NOSQL systems
• Schemaless databases
• Impedance mismatch
• Aggregate data models
• Key-value data models
• Document data models
• Column-family data models
53
References
54