Webinar Day 1
Webinar Day 1
Intro
When did it started?
«Visualization provides an interesting
challenge for computer systems: data sets are generally quite large,
taxing the capacities of main memory, local disk, and even remote
disk. We call this the problem of big data.»
Characteristics requiring
• Multi-region availability
• Very fast and reliable response
• No single point of failure
Big Data - Intro
Why not relational data?
Relational model provides
NoSQL - Intro
A Brief History of Databases
• Formally, a "database" refers to a set of related
data and the way it is organized
NoSQL - Intro
A Brief History of Databases
• Database management systems are
often classified according to the database model that
they support; the most popular database systems since
the 1980s have all supported the relational model
NoSQL - Intro
A Brief History of Databases
• In 1970, Edgar Codd (IBM) wrote a number of papers
that outlined a new approach to database
construction that eventually culminated in the
groundbreaking “A Relational Model of Data for Large
Shared Data Banks”
NoSQL - Intro
A Brief History of Databases
• Codd's idea was to use a "table" of fixed-length records, with each
table used for a different type of entity
• The relational model solved this by splitting the data into a series
of normalized tables with optional elements being moved out of
the main table
NoSQL - Intro
A Brief History of Databases
• IBM started working on a prototype system loosely
based on Codd's concepts as System R in the early
1970s. The first version was ready in 1975
NoSQL - Intro
A Brief History of Databases
• Larry Ellison's Oracle started from a
different chain, based on IBM's papers on
System R, and beat IBM to market when
the first version was released in 1978
NoSQL - Intro
A Brief History of Databases
• Codd, after his extensive research on the
Relational Model of database systems, came
up with twelve rules of his own, which
according to him, a database must obey in
order to be regarded as a
true relational database
NoSQL - Intro
A Brief History of Databases
1 Information Rule
NoSQL - Intro
A Brief History of Databases
2 Guaranteed Access Rule
NoSQL - Intro
A Brief History of Databases
3 Systematic Treatment of NULL Values
NoSQL - Intro
A Brief History of Databases
4 Active Online Catalog
NoSQL - Intro
A Brief History of Databases
5 Comprehensive Data Sub-Language Rule
NoSQL - Intro
A Brief History of Databases
6 View Updating Rule
NoSQL - Intro
A Brief History of Databases
7 High-Level Insert, Update, and Delete Rule
NoSQL - Intro
A Brief History of Databases
8 Physical Data Independence
NoSQL - Intro
A Brief History of Databases
10 Integrity Independence
NoSQL - Intro
A Brief History of Databases
11 Distribution Independence
NoSQL - Intro
A Brief History of Databases
12 Non-Subversion Rule
NoSQL - Intro
RDBMS Theory and Architecture
• Database normalization, or simply normalization, is the
process of organizing the columns (attributes) and
tables (relations) of a relational database
to reduce data redundancy and improve data integrity
NoSQL - Intro
RDBMS Theory and Architecture
• Normalization is accomplished through applying some
formal rules either by a process
of synthesis or decomposition
NoSQL - Intro
RDBMS Theory and Architecture
• 3NF tables are free of insertion, update,
and deletion anomalies
NoSQL - Intro
RDBMS Theory and Architecture
NoSQL - Intro
RDBMS Theory and Architecture
• Structured Query Language (SQL) is a domain-
specific language used in programming and designed for
managing data held in a relational database management system
NoSQL - Intro
RDBMS Theory and Architecture
• SQL consists of a data definition language (DDL), data
manipulation language (DML), and data control language (DCL)
• DDL Example
NoSQL - Intro
RDBMS Theory and Architecture
• DML Example
• DCL Example
NoSQL - Intro
ACID capabilities
• Atomicity requires that each transaction be "all
or nothing": if one part of the transaction fails, the
entire transaction fails, and the database state is
left unchanged
• The Consistency property ensures that
any transaction will bring the database from
one valid state to another
o Any data written to the database must be
valid according to all defined rules,
including constraints, cascades, triggers, and any
combination thereof
NoSQL - Intro
ACID capabilities
• The Isolation property ensures that the concurrent
execution of transactions results in a system state that
would be obtained if transactions were
executed serially, i.e., one after the other
NoSQL - Intro
ACID capabilities
• Write Ahead Logging (WAL) is a family of techniques for
providing atomicity and durability
o In a system using WAL, all modifications are written
to a log before they are applied
NoSQL - Intro
ACID capabilities
• If a Write Ahead Log is used, the program can check this
log and compare what it was supposed to be doing
when it unexpectedly lost power to what was actually
done
NoSQL - Intro
ACID capabilities
• Another way to implement atomic updates is
with shadow paging, which is not in-place
NoSQL - Intro
ACID capabilities
• When the page is ready to become durable, all pages
that referred to the original are updated to refer to the
new replacement page instead.
NoSQL - Intro
ACID capabilities
• Two-Phase Commit protocol (2PC) is a type
of atomic commitment protocol (ACP). It is
a distributed algorithm that coordinates all the processes
that participate in a distributed atomic transaction
NoSQL - Intro
Two-Phase Commit (2 PC)
NoSQL - Intro
ACID capabilities
• Three-Phase Commit protocol (3PC) is another
distributed algorithm which lets all nodes in a
distributed system agree to commit a transaction
NoSQL - Intro
Tree-Phase Commit (3 PC)
NoSQL - Intro
Shared Nothing Architecture
• A shared nothing architecture (SN) is a
distributed computing architecture in which
each node is independent and self-sufficient, and there
is no single point of contention across the system
NoSQL - Intro
Brewer's theorem (CAP)
• Consistency: All database clients will read the same
value for the same query, even given concurrent
updates
NoSQL - Intro
Brewer's theorem (CAP)
NoSQL - Intro
Brewer's theorem (CAP)
• In February 2012, Eric Brewer provided
an updated perspective on his CAP theorem in the
article “CAP Twelve Years Later: How the ‘Rules’ Have
Changed”
https://www.researchgate.net/publication/220476881_CA
P_Twelve_years_later_How_the_Rules_have_Changed
NoSQL - Intro
Database Sharding
• A database shard is a horizontal partition of data in a
database
NoSQL - Intro
Database Sharding
NoSQL - Intro
Database Sharding
NoSQL - Intro
Security Approches
NoSQL - Intro
NoSQL Approach
• NoSQL database, also called Not Only SQL, is an
approach to data management and database
design that's useful for very large
sets of distributed data
NoSQL - Intro
NoSQL Approach
• A basic classification based on data model:
❑ Key-value
❑ Document-oriented
❑ Columnar
❑ Graph-oriented
❑ Other (mixed)
NoSQL - Intro
NoSQL Approach
• A key-value store, or key-value database, is a
data storage paradigm designed for storing, retrieving,
and managing associative arrays (or dictionary or hash)
NoSQL - Intro
Key-Value Store (JSON)
NoSQL - Intro
NoSQL Approach
Horizontal scaling
NoSQL - Intro
Key-Value Store
NoSQL - Intro
Key-Value Store
Sample Retailers Web Application
NoSQL - Intro
Key-Value Store
Sample Retailers Web Application
NoSQL - Intro
Key-Value Store
Sample Retailers Web Application
NoSQL - Intro
NoSQL Approach
• Document-oriented databases are inherently
a subclass of the key-value store
NoSQL - Intro
Document-oriented Store
NoSQL - Intro
Document-oriented Store
• In a RDBMS, data is first categorized into a
number of predefined types, and tables are created to
hold individual records of each type
Horizontal scaling
NoSQL - Intro
Document-oriented Store
Sample Retailers Web Application
NoSQL - Intro
Document-oriented Store
Sample Retailers Web Application
NoSQL - Intro
Column-oriented Store
• A column-oriented DBMS is a database
management system that stores data tables as sections
of columns of data rather than as rows of data
NoSQL - Intro
Column-oriented Store
• This simple table includes an employee identifier (EmpId), and some
fields
NoSQL - Intro
Column-oriented Store
• Row-based systems are designed to efficiently return data
for an entire row, or record, in as few operations as possible. This
matches the common use-case where the system is attempting to
retrieve information about a particular object
• For instance, in order to find all the records in the example table that
have salaries between 40,000 and 50,000, the DBMS would have
to seek through the entire data set looking for matching records
NoSQL - Intro
Column-oriented Store
• To improve the performance of these sorts of operations, most
DBMSs support the use of database indexes
NoSQL - Intro
Column-oriented Store
• A column-oriented database serializes all of the values of a column
together, then the values of the next column, and so on.
• For our example table, the data would be stored in this fashion:
NoSQL - Intro
Column-oriented Store
• Bigtable is a compressed, high performance, and proprietary data
storage system built on Google File System (GFS)
NoSQL - Intro
Column-oriented Store
• Bigtable maps two arbitrary string values (row key and column key)
and timestamp (hence three-dimensional mapping) into an
associated arbitrary byte array.
NoSQL - Intro
Bigtable Data Model
• Table is a collection of rows
• Row is a collection of column families
• Column family is a collection of columns
• Column is a collection of key-value pairs
NoSQL - Intro
Column-oriented Store
NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application
NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application
NoSQL - Intro
Graph-oriented Store
• A graph database is a database that uses graph structures for
semantic queries with nodes, edges and properties to represent and
store data
NoSQL - Intro
Graph-oriented Store
NoSQL - Intro
Graph-oriented Store
NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application
NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application
NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application
NoSQL - Intro
Best Practices
NoSQL - Intro
Best Practices
• NoSQL data modeling often starts from the application-
specific queries as opposed to relational modeling
NoSQL - Intro
Design Patterns
• Data Access Object Pattern (or DAO pattern) is
used to separate low level data accessing API
or operations from high level business services
NoSQL - Intro
Design Patterns
NoSQL - Intro
Design Patterns
• CQRS is a simple pattern that
strictly segregates the responsibility of
handling command input into an autonomous
system from the responsibility of handling side-
effect-free query / read access on the
same system
NoSQL - Intro
Design Patterns
NoSQL - Intro
Design Patterns
• The fundamental idea of Event Sourcing is that
of ensuring every change to the state of an
application is captured in an event object
NoSQL - Intro
Design Patterns
NoSQL - Intro
Apache Cassandra
Data Model and CQL
Relational Model
• We have some tables with some columns with PK and
more and more FK
• A row can have some static columns (shared into all table rows)
• Not surprising when you recall tables and relations, columns and attributes, row
and tuples in relational databases
• Generated using time (60 bits), a clock sequence number (14 bits), and
MAC address (48 bits)
• 1be43390-9fe4-11e3-8d05-425861b86ab6
• CQL function now() generates a new TIMEUUID
• Time can be extracted from TIMEUUID
• CQL function dateOf() extracts the embedded timestamp as a date
• TIMEUUID values in clustering columns or in column names are
ordered based on time
• The query must only include the aggregate function itself, but no
columns.
• The state function is called once for each row, and the value returned
by the state function becomes the new state.
• After all rows are processed, the optional final function is executed with
the last state value as its argument. Aggregation is performed by the
coordinator
Secondary indexes
Do not use:
• On high-cardinality columns
• On counter column tables
• On a frequently updated or deleted columns
• To look for a row in a large partition unless narrowly queried
• e.g., search on both a partition key and an indexed column
Apache Cassandra – Data Model and CQL
Secondary Index - SASI
• SASI is significantly less resource intensive, using less memory, disk,
and CPU. It enables querying with prefix and contains on
strings, similar to the SQL implementation of LIKE = "foo%" or LIKE =
"%foo%",
• The SASI index data structures are built in memory as the SSTable is
written and flushed to disk as sequential writes before the SSTable
writing completes. One index file is written for each indexed column.
now()
• generates a new unique timeuuid
uuid()
• generates a unique id
token() function