0% found this document useful (0 votes)
43 views52 pages

06 BigDataAndBigDataDesign

Data Bases

Uploaded by

lolamentosano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views52 pages

06 BigDataAndBigDataDesign

Data Bases

Uploaded by

lolamentosano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Introduction to Big Data

and Big Data Design


BDA GCED

1
Knowledge objectives
1. Define the impedance mismatch
2. Identify applications handling different kinds of data
3. Name four different kinds of NOSQL systems
4. Explain three consequences of schema variability
5. Explain the consequences of physical independence
6. Explain the two dimensions to classify NOSQL systems according to
how they manage schema
7. Explain the three elements of the RUM conjecture

2
Understanding objectives
1. Decide whether two NOSQL designs have a more or less explicit/fix
schema

3
Application objectives
1. Given a relatively small UML conceptual diagram, translate it into a
logical representation of data considering flexible schema
representation

4
From SQL to NOSQL
The need for alternative families of database technologies

5
Law of the instrument
“Over-reliance on a familiar tool.”
Wikipedia

If the only tool you have is a hammer, everything looks like a nail.
A. Maslow

• Golden hammer anti-pattern: “A familiar technology or concept applied


obsessively to many software problems.”

6
Law of the Relational Database
• Since we only know relational databases, every time we want to model a
new domain we’ll automatically think on how to represent it as columns
and rows Ireland et al.

Object-relational impedance mismatch is “… one in which a program written


using an object-oriented language uses a relational database for storage.”

7
The end of an architectural era
WEB 2.0 – Write Era Real time communication

Big Data

Semi-structured & Ubiquitous and concurrent


unstructured data

Maria Belen Bianchi

8
Michael Stonebraker and Ugur

RDBMS: why aren’t they enough? Çetintemel. One size fits all: an
idea whose time has come and
gone. ACM Books, p: 441-462,
2019.

RDBMS
• Generic architecture that can be Designed for consistency and
tuned according to the needs: integrity, making them excellent
• Mainly-write OLTP Systems for applications with complex
• Normalization queries and transaction processing.
• Indexes: B+, Hash
• Joins: BNL, RNL, Hash-Join, Merge We need to deal with massive
Join
reads and writes at the same time!
• Read-only DW (OLAP) Systems
• Denormalized data
• Indexes: Bitmaps Data fragmentation
• Joins: Star-join Data replication
• Materialized views
Distributed RDBMS
9
Distributed RDBMS: limitations
• ACID Transactions
• Relational databases follow ACID (Atomicity, Consistency, Isolation, Durability) properties
to ensure data integrity. Maintaining them across distributed nodes introduces
complexity and performance bottlenecks
• Locking and Contention
• To enforce data integrity, relational databases use locking mechanisms to manage
simultaneous transactions, which is very costly in a distributed environment.
• Schema Rigidness
• Relational databases rely on predefined schemas, which makes it difficult to adapt to
changes in the data structure without significant overhead
• Joins across Nodes
• Queries in relational models often involve complex joins between tables. Distributing
tables across multiple nodes makes these joins inefficient and costly

We need alternative data models and architectures! 10


NOSQL
Different problems entail different solutions

11
New challenges for data management

VOLUME Verac¿ty

Velocity
Variability
Variety
12
NOSQL Goals
• Schemaless: Allow flexible (even runtime) schema definition
• Reliability / availability: Keep delivering service even if its software or
hardware components fail
• Scalability: Continuously evolve to support a growing amount of tasks
• Efficiency: How well the system performs, usually measured in terms of
response time (latency) and throughput (bandwidth)

13
Aggregate data models
• The relational model with simple record structures, referential integrity,
transactions … is not suitable for distribution -> not designed to run on
clusters
• We need to operate on data in units that have a more complex structure
(complex records)
• Think of a complex record as a structure that allows lists and other
record structures to be nested inside it
• These complex records are sometimes referred to as aggregates

14
Aggregate data models
• Aggregate orientation fits well with scaling out (i.e., use lots of small
machines in a cluster)
• The aggregate is a natural unit for distribution
• The aggregate makes a natural unit for replication and sharding

• Key-value, document, and column-family databases all make use of


this more complex record

15
Example
Aggregate dataof Relations
models: Orders and Aggregates (1)

Relational database
perspective: no aggregates

source: Martin Fowler, NoSQL Distilled 16


Example
Aggregate dataof Relations
models: Orders and Aggregates (2)

Relational data model:


Everything is properly
normalized

source: Martin Fowler, NoSQL Distilled 17


Example
Aggregate dataof Relations
models: Orders and Aggregates (3)

Two main aggregates:


Customer and Order

source: Martin Fowler, NoSQL Distilled 18


Example
Aggregate dataof Relations
models: Orders and Aggregates (4)
• There are two main aggregates: customer and
order
• The customer contains a list of billing
addresses and a name; the order contains a list
of order items, a shipping address, and a list
of payments. Each payment contains a billing
address for that payment.
• A single logical address record appears three
times in the example data, but, instead of using
IDs, it is treated as a value and copied each time.
• The link between the customer and the order is
not an aggregation.
• The product name is part of the order to
minimize the number of aggregates we access
during a data interaction

source: Martin Fowler, NoSQL Distilled 19


Example
Aggregate dataof Relations
models: Orders and Aggregates (4)
An alternative way of
aggregating data!

source: Martin Fowler, NoSQL Distilled 20


Aggregate data models: Orders
Consequences of Aggregate Orientation (1)
• The fact that an order consists of order items, a shipping address, and a payment can
be expressed in the relational model in terms of foreign key relationships but there
is nothing to distinguish relationships that represent aggregations from those
that don’t. As a result, the database can’t use the knowledge about an aggregate
structure to help it store and distribute the data
• Aggregation is however, not a logical data property: It is all about how the data is
being used by applications -- a concern that is often outside the boundary of data
modeling
• Also, an aggregate structure may help with some data interactions but be an
obstacle for others (in our example, to get to product sales history, you’ll have to
dig into every aggregate in the database)
• The clinching reason for aggregate orientation is that it helps greatly with running
on a cluster!

source: Martin Fowler, NoSQL Distilled 21


Aggregate data models: Orders
Consequences of Aggregate Orientation (2)
Aggregates have an important consequence for transactions
• Relational databases allow you to manipulate any combination of rows from any
tables in a single (ACID) transaction (i.e., Atomic, Consistent, Isolated, and Durable)
• It’s often said that NoSQL databases don’t support ACID transactions and thus
sacrifice consistency. This is however not true for graph databases (which are, as
relational database, aggregate-agnostic)
• In general, its true that aggregate-oriented databases don’t have ACID transactions
that span multiple aggregates (rows). Instead, they support atomic manipulation of
a single aggregate (row) at a time: This means that if we need to manipulate
multiple aggregates in an atomic way, we have to manage that ourselves in the
application code!
• In practice, we find that most of the time we are able to keep our atomicity needs to
within a single aggregate; indeed, that is part of the consideration for deciding
how to divide up our data into aggregates

source: Martin Fowler, NoSQL Distilled 22


Different data models
Relational (OLTP) Multidimensional (OLAP) Key-Value

Wide-Column Document Graph


(Column-family)

By Aina Montalban, inspired by Daniel G. McCreary and Ann M. Kelly

23
Michael Stonebraker and Ugur

One size does not fit all Çetintemel. One size fits all: an
idea whose time has come and
gone. ACM Books, p: 441-462,
2019.

Different problems entail different solutions

• OLTP • Semantic Web and Open Data


• VoltDB, HANA, Hekaton • GraphDB, Stardog, Virtuoso
• Data warehousing and OLAP • Text
• Vertica, Red Shift, Sybase IQ • ElasticSearch, Google File Syst.
• Scientific data • Documents (XML, JSON)
• R, Matlab, SciDB • MongoDB, CouchDB
• Stream processing
• Spark Streaming, Flink, Storm

24
Evolution of different data models

R. Angles and C. Gutierrez

25
Schema definition

26
Schema variability
• CREATE TABLE Students(id int, name varchar(50),surname varchar(50),enrolment date);
• INSERT INTO Students (1,‘Sergi’,‘Nadal’,‘01/01/2012’,true,‘Igualada’); WRONG
• INSERT INTO Students (1,‘Sergi’,‘Nadal’,NULL); OK
• INSERT INTO Students (1,‘Sergi’,‘Nadal’,‘01/01/2012’); OK

• Schemaless → INSERT INTO Students (1, {‘Sergi’, ‘Nadal’, ‘01/01/2012’, true});

• Consequences
• Gain flexibility
• Lose semantics (also consistency)
• The data independence principle is lost (!)
• The ANSI / SPARC architecture is not followed → Implicit schema
• Applications can access and manipulate the database internal structures

27
ANSI/SPARC (recap)

Physical independence

External Conceptual Internal


schemas schema schema
Logical independence

28
ANSI/SPARC

Physical independence

External Conceptual Internal


schemas schema schema
Logical independence

29
Database
Database models
models

RELATIONAL NOSQL
• Based on the relational model • No single reference model
• Tables, rows and columns • Key-value, document, stream, graph
• Sets, instances and attributes • Ideally, the schema should be
• Constraints are allowed defined at insertion time and not at
• PK, FK, Check, … definition time (schemaless
• When creating the tables you MUST databases)
specify their schema (i.e., columns • The closer the data model in use
and constraints) looks to the way data is stored
• Data is restructured when brought internally the better (read/write
into memory (impedance throughput)
mismatch)
30
Considered database models
• Relational
city(name, population, region) VALUES (’BCN’, ’2,000,000’, ’CAT’)
• Key-Value
[‘BCN’, ‘2,000,000;CAT’]
• Document
{id:‘BCN’, population:‘2,000,000’, region:‘CAT’}
• Wide-Column (Column-Family)
[‘BCN’, population:{value:’2,000,000’}, region:{value:’CAT’}]
[‘BCN’, all:{value:’2,000,000;CAT’}]
[‘BCN’, all:{population:’2,000,000’;region:’CAT’}]

31
Relevant schema dimensions
Some new models lack of an explicit schema (declared by the user)
• An implicit schema (hidden in the application code) always remains
• May reduce the impedance mismatch

The schema is in the mind


of the developer/program

In the same collection I can’t


have different schemas

32
Key-value and Document
Data Models

33
Key-value and Document Data Models
Key-value and document databases are strongly aggregate-oriented
• In a key-value database, the aggregate is opaque to the database: just some big
blob of bits. The advantage of opacity is that we can store whatever we like in the
aggregate. It is the responsibility of the application to understand what was stored.
Since key-value stores always use primary-key access, they generally have great
performances.
• In contrast, a document database is able to see a structure in the aggregate, but
imposes limits on what we can place in it, defining allowable structures and types.
In return, however, we get more flexibility when accessing data.

34
Ordersdata
Aggregate example (1NF)
models: Orders

Customer CreditCard
Customer CustKey Name Phone CustKey CCNum Expiry
Credit card 1 Ann 234 1 02345 04/28
2 Dan 211 2 01221 05/24
Orders
Order lines (Line items)

Orders LineItem

OrderID CustKey Price OrderKey LineNum PriceItem Qty TotPrice


1001 1 $210 1001 03214 $50 3 $150
1002 2 $230
1001 03222 $40 1 $40
1001 04114 $10 2 $20
1002 05512 $50 4 $200
1002 03711 $15 2 $30

35
Key-values: Orders Example

Key Value
1001 03214_$50_3_$150, 03222_$40_1_$40 …

LineItem
In a key-value store, we can only access OrderKey LineNum PriceItem Qty TotPrice
an aggregate by lookup based on its key 1001 03214 $50 3 $150

1001 03222 $40 1 $40


1001 04114 $10 2 $20
1002 05512 $50 4 $200
1002 03711 $15 2 $30

36
The Document
Aggregate data models data model – Orders Example
.json
Orders
OrderID CustKey Price
ID:1001 1001 1 $210
Customer
1002 2 $230
customer: Ann CustKey Name Phone
1 Ann 234
line items:
2 Dan 211

03214 $50 3 $150 LineItem


OrderKey LineNum PriItem Qty TotPrice
03222 $40 1 $40
1001 03214 $50 3 $150
04114 $10 2 $20 1001 03222 $40 1 $40
1001 04114 $10 2 $20
payment details: 1002 05512 $50 4 $200
1002 03711 $15 2 $30
card: Amex
cc number: 12345 CreditCard
expiry: 04/28 CustKey CCNum Expiry
1 02345 04/28
2 01221 05/24

An order, which looks like a single document


37
The Document data model - Characteristics
• Document Structure: Data is stored in formats like JSON, BSON, or XML, making it
easy to represent hierarchical and complex data
• Embedded Data: Allows embedding of related data within a single document to reduce the need
for joins and improve performance
• Schema Flexibility: Supports dynamic, schema-less structures, allowing for easy
storage and retrieval of unstructured or semi-structured data
• Scalability and Reliability: Designed for horizontal scaling, making it easy to distribute and
replicate data across multiple nodes or servers
• Efficient Data Access: Optimized for fast read and write operations, especially for applications with
frequently changing data requirements
• Rich Query Capabilities: Supports complex queries, indexing, and aggregations tailored to
document structures

38
The Column-Family data model – Example
Column-Families are organized in terms of distributed The column-family model can be seen as a
maps
two-level aggregate structure
{ Row-identifier • As with key-value stores, the first key is
"1001" : { Column-family often described as a row identifier,
"profile" : { picking up the aggregate of interest
"customer": "Ann”, Columns • This row aggregate is itself formed of a
"card":"Amex”, map of more detailed values. These
"cc_number":"12345", second-level values are referred to as
"expiry":"04/28"}, columns, each being a key-value pair
"line-items" : • Columns are organized into column
"items": families. Each column has to be part of a
"[[03214,$50,3,$150],…]" single column family (data for a particular
} column family will be usually accessed
"1002" : { together)
"profile" : "…", • Each row identifier (i.e., first-level key) is
"line-items" : "…" unique
},
}

39
The Column-Family data model - Characteristics
• Column Family Structure: Data is organized into rows and columns grouped into column families,
allowing efficient data retrieval
• Sparse Data Handling: Optimized for storing sparse datasets, where rows can have a variable
number of columns, saving storage space
• High Write Performance: Designed for high-throughput write operations, making it suitable for
write-heavy workloads
• Scalability: Supports horizontal scaling across distributed nodes, ideal for handling large volumes of
data
• Efficient Data Access: Queries are optimized to read only the necessary columns within a column
family, improving performance
• Data Locality: Related columns are stored together on disk, making access patterns efficient for
certain types of queries
• Flexible Schema: Allows easy addition of new columns to existing rows without schema changes,
supporting evolving data requirements

40
Design choices
• Denormalization

• Partitioning/Fragmenting
• Horizontal
• Vertical

• Data placement
• Distribution
• Clustering

43
Alternative storage structures

44
The problem is not SQL
• Relational systems are too generic…
• OLTP: stored procedures and simple queries
• OLAP: ad-hoc complex queries
• Documents: large objects
• Streams: time windows with volatile data
• Scientific: uncertainty and heterogeneity
• …but the overhead of RDBMS has nothing to do with SQL
• Low-level, record-at-a-time interface is not the solution

Michael Stonebraker
SQL Databases vs. NoSQL Databases
Communications of the ACM, 53(4), 2010

45
The RUM conjecture
“Designing access methods that set an upper bound for two of the RUM
overheads, leads to a hard lower bound for the third overhead which
cannot be further reduced.”

M. Athanassoulis et al.

46
Example of RUM conjecture
Find(x) Space overhead, must
update the index
x

Updating means
Space overhead (appends),
reordering
hard to find
Find(x)
Update(x) Binary search

x x x sorted
M. Athanassoulis et al.

47
RUM classification space Manos Athanassoulis, Michael S.
Kester, Lukas M. Maas, Radu
Stoica, Stratos Idreos, Anastasia
Ailamaki, Mark Callaghan:
Designing Access Methods: The
RUM Conjecture. EDBT 2016.

LSM M. Athanassoulis et al.

48
Data
Data Storage
Storage

RELATIONAL NOSQL
• Generic architecture that can be • Specific architectures for a specific
tuned according to the needs: need:
• Mainly-write OLTP Systems • Primary indexes
• Normalization • Sequential reads
• Indexes: B+, Hash
• Vertical partitioning
• Joins: BNL, RNL, Hash-Join, Merge
Join • Compression
• Read-only DW Systems • Fixed-size values
• Denormalized data • In-memory processing
• Indexes: Bitmaps
• Joins: Star-join • Very specific and good (but very
• Materialized Views good) in solving a particular problem

49
Different internal structures

B-tree LSM Vertical Partioning

MongoDB, Riak HBase, Cassandra, RocksDB Parquet, Hana

Aina Montalban

50
51
Closing

52
Summary
• NOSQL systems
• Schemaless databases
• Impedance mismatch
• Aggregate data models
• Key-value data models
• Document data models
• Column-family data models

53
References

• Pramod J. Sadalage and Martin Fowler. NoSQL Distilled: A Brief Guide to


the Emerging World of Polyglot Persistence. Addison Wesley. 2013
• M. Stonebraker et al. The End of an Architectural Era (It's Time for a
Complete Rewrite). VLDB, 2007
• R. Cattell. Scalable SQL and NoSQL Data Stores. SIGMOD Record 39(4),
2010
• M. Stonebraker. SQL Databases vs. NoSQL Databases. Communications
of the ACM, 53(4), 2010

54

You might also like