0% found this document useful (0 votes)

25 views163 pages

Webinar Day 1

Uploaded by

vinit garg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views163 pages

Webinar Day 1

Uploaded by

vinit garg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 163

Big Data

Intro
When did it started?
«Visualization provides an interesting
challenge for computer systems: data sets are generally quite large,
taxing the capacities of main memory, local disk, and even remote
disk. We call this the problem of big data.»

Michael Cox and David Ellsworth

@ NASA - July 1997

Big Data - Intro

New data sources: IoT
“Internet of things” terms used for the first time in 1999

«Today computers - and, therefore, the Internet - are almost wholly

dependent on human beings for information. [ … ]
If we had computers that knew everything there was to know about things
- using data they gathered without any help from us - we would be able to
track and count everything, and greatly reduce waste, loss and cost.»

Kevin Ashton, Co-Founder of the Auto-ID Center

@ MIT

Big Data - Intro

Data is growing

Big Data - Intro

What is Big Data
90% of world data has been generated in the last two years

Big Data - Intro

What are big data system?
Applications involving the "three V's"

• Volume: Gigabytes growing to terabytes and beyond

• Velocity: sensor data, click streams, ﬁnancial transactions
• Variety: data must be ingested from many different
formats

Characteristics requiring

• Multi-region availability
• Very fast and reliable response
• No single point of failure
Big Data - Intro
Why not relational data?
Relational model provides

• Normalized table schema

• Cross table joins
• ACID compliance

But, at very high cost

• Big data table joins – billions of rows, or more – require

massive overhead
• Sharding tables across system is complex and fragile

Big Data - Intro

Why not relational data?
Modern applications have different priorities

• Need for speed and availability trump "always on"

consistency
• Commodity server racks trump massive high-end
systems
• Real world need for transactional guarantees is limitated

Big Data - Intro

Why Big Data?
Cost reduction

• Big data technologies such as Hadoop and cloud-based

analytics bring significant cost advantages when it comes
to storing large amounts of data-plus they can identify
more efficient ways of doing business

Big Data - Intro

Why Big Data?
Faster, better decision making

• With the speed of Hadoop and in-memory analytics,

combined with the ability to analyze new sources of data,
businesses are able to analyze information immediately
and make decisions based on what they've learned.

Big Data - Intro

Why Big Data?
New products and services

• With the ability and gouge customer needs and

satisfaction through analytics comes the power to give
consumer what they want. Davenport points out that with
big data analytics, more companies are creating new
products to meet customer's needs.

Big Data - Intro

Data value

Big Data - Intro

Data chain

Data Data Data Data Data

Data Retail
Generation Ingestion Aggregation Elaboration Visualization

Big Data - Intro

Big Data - Intro
NoSQL
Intro
A Brief History of Databases
• A database management system (DBMS) is
a computer software application that interacts with
the user, other applications, and the database itself
to capture and analyze data

• A database is not generally portable across different

DBMSs, but different DBMS can interoperate by
using standards such as SQL and ODBC or JDBC to
allow a single application to work with more than
one DBMS

NoSQL - Intro
A Brief History of Databases
• Formally, a "database" refers to a set of related
data and the way it is organized

• Access to this data is usually provided by

a "database management system" (DBMS) consisting
of an integrated set of computer software that allows
users to interact with one or more databases and
provides access to all of the data contained in the
database

NoSQL - Intro
A Brief History of Databases
• Database management systems are
often classified according to the database model that
they support; the most popular database systems since
the 1980s have all supported the relational model

• The development of database technology can

be divided into three eras based on data
model or structure:
o Navigational
o SQL/relational
o Post-relational (NoSQL)
NoSQL - Intro
A Brief History of Databases
• 1960 - The two main early navigational data models were
the hierarchical model, epitomized by IBM's IMS system, and
the CODASYL model implemented in a number of products such
as IDMS (CA)

NoSQL - Intro
A Brief History of Databases
• In 1970, Edgar Codd (IBM) wrote a number of papers
that outlined a new approach to database
construction that eventually culminated in the
groundbreaking “A Relational Model of Data for Large
Shared Data Banks”

• He was unhappy with the navigational model of the

CODASYL approach, notably the lack of a "search"
facility

NoSQL - Intro
A Brief History of Databases
• Codd's idea was to use a "table" of fixed-length records, with each
table used for a different type of entity

• A linked-list system would be very inefficient when storing

"sparse" databases where some of the data for any one record
could be left empty

• The relational model solved this by splitting the data into a series
of normalized tables with optional elements being moved out of
the main table

NoSQL - Intro
A Brief History of Databases
• IBM started working on a prototype system loosely
based on Codd's concepts as System R in the early
1970s. The first version was ready in 1975

• Codd's ideas were establishing themselves

as both workable and superior to
CODASYL, pushing IBM to develop a true
production version of System R, known as SQL/DS,
and, later, Database 2 (DB2)

NoSQL - Intro
A Brief History of Databases
• Larry Ellison's Oracle started from a
different chain, based on IBM's papers on
System R, and beat IBM to market when
the first version was released in 1978

• The entity–relationship model, emerged

in 1976 and gained popularity for database
design as it emphasized a more familiar
description than the earlier relational model

NoSQL - Intro
A Brief History of Databases
• Codd, after his extensive research on the
Relational Model of database systems, came
up with twelve rules of his own, which
according to him, a database must obey in
order to be regarded as a
true relational database

NoSQL - Intro
A Brief History of Databases
1 Information Rule

The data stored in a database, may it be

user data or metadata, must be a value of
some table cell. Everything in a database must
be stored in a table format.

NoSQL - Intro
A Brief History of Databases
2 Guaranteed Access Rule

Every single data element (value) is guaranteed

to be accessible logically with a combination of
table-name, primary-key (row value), and
attribute-name (column value). No other means,
such as pointers, can be used to access data.

NoSQL - Intro
A Brief History of Databases
3 Systematic Treatment of NULL Values

The NULL values in a database must be given a

systematic and uniform treatment. This is a very
important rule because a NULL can
be interpreted as one the following − data
is missing, data is not known, or data is
not applicable.

NoSQL - Intro
A Brief History of Databases
4 Active Online Catalog

The structure description of the entire

database must be stored in an online catalog,
known as data dictionary, which can be
accessed by authorized users. Users can use the
same query language to access the catalog
which they use to access the database itself.

NoSQL - Intro
A Brief History of Databases
5 Comprehensive Data Sub-Language Rule

A database can only be accessed using a language

having linear syntax that supports data definition, data
manipulation, and transaction management operations.
This language can be used directly or by means of some
application. If the database allows access to data without
any help of this language, then it is considered as a
violation.

NoSQL - Intro
A Brief History of Databases
6 View Updating Rule

All the views of a database, which

can theoretically be updated, must also be
updatable by the system.

NoSQL - Intro
A Brief History of Databases
7 High-Level Insert, Update, and Delete Rule

A database must support high-level

insertion, updation and deletion. This must not
be limited to a single row, that is, it must also
support union, intersection and minus operations
to yield sets of data records.

NoSQL - Intro
A Brief History of Databases
8 Physical Data Independence

The data stored in a database must

be independent of the applications that access
the database. Any change in the physical
structure of a database must not have any
impact on how the data is being accessed by
external applications.

NoSQL - Intro
A Brief History of Databases
10 Integrity Independence

A database must be independent of

the application that uses it. All its integrity
constraints can be independently modified
without the need of any change in the
application. This rule makes a database
independent of the front-end application and its
interface.

NoSQL - Intro
A Brief History of Databases
11 Distribution Independence

The end-user must not be able to see that

the data is distributed over various locations.
Users should always get the impression that the
data is located at one site only. This rule has
been regarded as the foundation of
distributed database systems.

NoSQL - Intro
A Brief History of Databases
12 Non-Subversion Rule

If a system has an interface that provides

access to low-level records, then the interface
must not be able to subvert the system and
bypass security and integrity constraints.

NoSQL - Intro
RDBMS Theory and Architecture
• Database normalization, or simply normalization, is the
process of organizing the columns (attributes) and
tables (relations) of a relational database
to reduce data redundancy and improve data integrity

• Normalization involves arranging attributes in tables

based on dependencies between attributes, ensuring
that the dependencies are properly enforced by
database integrity constraints

NoSQL - Intro
RDBMS Theory and Architecture
• Normalization is accomplished through applying some
formal rules either by a process
of synthesis or decomposition

• Codd introduced the concept of normalization and what

is now known as the First normal form (1NF) in 1970.
Codd went on to define the Second normal form (2NF)
and Third normal form (3NF) in 1971 with Raymond F.
Boyce defined the Boyce-Codd Normal Form (BCNF) in
1974

NoSQL - Intro
RDBMS Theory and Architecture
• 3NF tables are free of insertion, update,
and deletion anomalies

• The biggest problem with normalization is that you end

up with multiple tables representing what is
conceptually a single item

• Joining tables is slow and

it limits horizontal scaling (partitioning)

NoSQL - Intro
RDBMS Theory and Architecture

NoSQL - Intro
RDBMS Theory and Architecture
• Structured Query Language (SQL) is a domain-
specific language used in programming and designed for
managing data held in a relational database management system

• Despite not entirely adhering to the relational model as described

by Codd, it is the most widely used database language

• SQL became a standard of the American National Standards

Institute (ANSI) in 1986, and of the International Organization for
Standardization (ISO) in 1987

NoSQL - Intro
RDBMS Theory and Architecture
• SQL consists of a data definition language (DDL), data
manipulation language (DML), and data control language (DCL)

• DDL Example

NoSQL - Intro
RDBMS Theory and Architecture
• DML Example

• DCL Example

NoSQL - Intro
ACID capabilities
• Atomicity requires that each transaction be "all
or nothing": if one part of the transaction fails, the
entire transaction fails, and the database state is
left unchanged
• The Consistency property ensures that
any transaction will bring the database from
one valid state to another
o Any data written to the database must be
valid according to all defined rules,
including constraints, cascades, triggers, and any
combination thereof
NoSQL - Intro
ACID capabilities
• The Isolation property ensures that the concurrent
execution of transactions results in a system state that
would be obtained if transactions were
executed serially, i.e., one after the other

• The Durability property ensures that once a transaction

has been committed, it will remain so, even in the event
of power loss, crashes, or errors

NoSQL - Intro
ACID capabilities
• Write Ahead Logging (WAL) is a family of techniques for
providing atomicity and durability
o In a system using WAL, all modifications are written
to a log before they are applied

• Imagine a program that is in the middle of

performing some operation when the machine it is
running on loses power. Upon restart, that program
might well need to know whether the operation it
was performing succeeded, half-succeeded, or failed

NoSQL - Intro
ACID capabilities
• If a Write Ahead Log is used, the program can check this
log and compare what it was supposed to be doing
when it unexpectedly lost power to what was actually
done

• File systems typically use a variant of WAL for at least

file system metadata called journaling

NoSQL - Intro
ACID capabilities
• Another way to implement atomic updates is
with shadow paging, which is not in-place

• Shadow paging is a copy-on-write technique

for avoiding in-place updates of pages. Instead, when a
page is to be modified, a shadow page is allocated

• Since the shadow page has no references , it can be

modified liberally, without concern
for consistency constraints, etc

NoSQL - Intro
ACID capabilities
• When the page is ready to become durable, all pages
that referred to the original are updated to refer to the
new replacement page instead.

• Because the page is "activated" only when it is ready, it

is atomic

NoSQL - Intro
ACID capabilities
• Two-Phase Commit protocol (2PC) is a type
of atomic commitment protocol (ACP). It is
a distributed algorithm that coordinates all the processes
that participate in a distributed atomic transaction

• To accommodate recovery from failure the protocol's participants

use logging of the protocol's states

• Log records, which are typically slow to generate but survive

failures, are used by the protocol's recovery procedures

NoSQL - Intro
Two-Phase Commit (2 PC)

NoSQL - Intro
ACID capabilities
• Three-Phase Commit protocol (3PC) is another
distributed algorithm which lets all nodes in a
distributed system agree to commit a transaction

• Unlike the two-phase commit however, 3PC is non-

blocking. Specifically, 3PC places an upper bound on
the amount of time required before a transaction
either commits or aborts

NoSQL - Intro
Tree-Phase Commit (3 PC)

NoSQL - Intro
Shared Nothing Architecture
• A shared nothing architecture (SN) is a
distributed computing architecture in which
each node is independent and self-sufficient, and there
is no single point of contention across the system

• More specifically, none of the nodes share memory

or disk storage

• A shared nothing is popular in cloud enviroments because

of its scalability. A pure SN system can scale almost
infinitely simply by adding nodes in the form
of inexpensive computers
NoSQL - Intro
Brewer's theorem (CAP)
• The CAP theorem, also known as
Brewer's theorem, states that it is impossible for
a distributed computer system to
simultaneously provide all three of the following
guarantees

NoSQL - Intro
Brewer's theorem (CAP)
• Consistency: All database clients will read the same
value for the same query, even given concurrent
updates

• Availability: All database clients will always be able

to read and write data

• Partition tolerance: The database can be split into

multiple machines; it can continue functioning in the face
of network segmentation breaks

NoSQL - Intro
Brewer's theorem (CAP)

NoSQL - Intro
Brewer's theorem (CAP)
• In February 2012, Eric Brewer provided
an updated perspective on his CAP theorem in the
article “CAP Twelve Years Later: How the ‘Rules’ Have
Changed”

https://www.researchgate.net/publication/220476881_CA
P_Twelve_years_later_How_the_Rules_have_Changed

NoSQL - Intro
Database Sharding
• A database shard is a horizontal partition of data in a
database

• Each shard is commonly held on a separate database

server instance (node), to spread load

• Since the tables are divided and distributed into multiple

servers, the total number of rows in each table in each
database is reduced. This reduces index size too, which
generally improves search performance

NoSQL - Intro
Database Sharding

NoSQL - Intro
Security Approches

NoSQL - Intro
NoSQL Approach
• NoSQL database, also called Not Only SQL, is an
approach to data management and database
design that's useful for very large
sets of distributed data

• A NoSQL database provides a mechanism

for storage and retrieval of data that is modeled
in means other than the tabular relations used
in relational databases
NoSQL - Intro
NoSQL Approach
• NoSQL, which encompasses a wide range
of technologies and architectures, seeks to
solve the scalability and big data performance
issues that relational databases weren’t designed
to address

• There have been various approaches to

classify NoSQL databases, each with
different categories and subcategories, some of which
overlap

NoSQL - Intro
NoSQL Approach
• A basic classification based on data model:

❑ Key-value

❑ Document-oriented

❑ Columnar

❑ Graph-oriented

❑ Other (mixed)

NoSQL - Intro
NoSQL Approach
• A key-value store, or key-value database, is a
data storage paradigm designed for storing, retrieving,
and managing associative arrays (or dictionary or hash)

• These records are stored and retrieved using

a key that uniquely identifies the record, and is used
to quickly find the data within the database

• key-value systems treat the data as a

single opaque collection which may have different fields
for every record
NoSQL - Intro
Key-Value Store

NoSQL - Intro
Key-Value Store (JSON)

NoSQL - Intro
NoSQL Approach

Simple Data Model

Ease of use No complex queries

Horizontal scaling

NoSQL - Intro
Key-Value Store

NoSQL - Intro
Key-Value Store
Sample Retailers Web Application

NoSQL - Intro
NoSQL Approach
• Document-oriented databases are inherently
a subclass of the key-value store

• The difference lies in the way the data is processed; in a

key-value store the data is considered to be inherently
opaque to the database, whereas a document-oriented
system relies on internal structure in the document order
to extract metadata that the database engine uses for
further optimization

NoSQL - Intro
Document-oriented Store

NoSQL - Intro
Document-oriented Store
• In a RDBMS, data is first categorized into a
number of predefined types, and tables are created to
hold individual records of each type

• In contrast, in a document-oriented database there may

be no internal structure that maps directly onto the
concept of a table

• A key difference between the document-oriented and

relational models is that the data formats are not
predefined in the document case
NoSQL - Intro
Document-oriented Store

Schema Free Slow complex queries

Ease of use Not ACID

Horizontal scaling

NoSQL - Intro
Document-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Column-oriented Store
• A column-oriented DBMS is a database
management system that stores data tables as sections
of columns of data rather than as rows of data

• It has advantages for data warehouses and other ad hoc

inquiry systems where aggregates are computed over
large numbers of similar data items

• Column-oriented storage is closely related to

database normalization due to the way it restricts the
database schema design
NoSQL - Intro
Column-oriented Store
A relational database management system provides data
that represents a two-dimensional table, of columns and
rows.

For example, a database might have this table:

NoSQL - Intro
Column-oriented Store
• This simple table includes an employee identifier (EmpId), and some
fields

• This two-dimensional format exists only in theory. In practice,

storage hardware requires the data to be serialized into one form or
another.

• The common solution to the storage problem is to serialize each

row of data, like this:

NoSQL - Intro
Column-oriented Store
• Row-based systems are designed to efficiently return data
for an entire row, or record, in as few operations as possible. This
matches the common use-case where the system is attempting to
retrieve information about a particular object

• Row-based systems are NOT efficient at performing operations that

apply to the entire data set, as opposed to a specific record.

• For instance, in order to find all the records in the example table that
have salaries between 40,000 and 50,000, the DBMS would have
to seek through the entire data set looking for matching records

NoSQL - Intro
Column-oriented Store
• To improve the performance of these sorts of operations, most
DBMSs support the use of database indexes

• An index on the salary column would look something like this:

NoSQL - Intro
Column-oriented Store
• A column-oriented database serializes all of the values of a column
together, then the values of the next column, and so on.

• For our example table, the data would be stored in this fashion:

• In this layout, any one of the columns more closely matches

the structure of an index in a row-based system.

NoSQL - Intro
Column-oriented Store
• Bigtable is a compressed, high performance, and proprietary data
storage system built on Google File System (GFS)

• Bigtable development began in 2004 and is now used by a number

of Google applications, such as web indexing, MapReduce, which is
often used for generating and modifying data stored in
Bigtable, Google Maps, Google Book Search, "My Search History",
Google Earth, Blogger.com, Google Code hosting, YouTube, and Gmail

NoSQL - Intro
Column-oriented Store
• Bigtable maps two arbitrary string values (row key and column key)
and timestamp (hence three-dimensional mapping) into an
associated arbitrary byte array.

• It is not a relational database and can be better defined as a sparse,

distributed multi-dimensional sorted map

• HBase is an open source, non-relational, distributed database

modeled after Google's BigTable and written in Java. It is developed
as part of Apache Software Foundation's Apache Hadoop project
and runs on top of HDFS

NoSQL - Intro
Bigtable Data Model
• Table is a collection of rows
• Row is a collection of column families
• Column family is a collection of columns
• Column is a collection of key-value pairs

NoSQL - Intro
Column-oriented Store

OLAP Performances OLTP Performances

Horizontal scaling Handling Single

Records

NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Graph-oriented Store
• A graph database is a database that uses graph structures for
semantic queries with nodes, edges and properties to represent and
store data

• Compared with relational databases, graph databases are

often faster for associative data and map more directly to the
structure of object-oriented applications

• They can scale more naturally to large data sets as they do

not typically require expensive join operations

NoSQL - Intro
Graph-oriented Store

Handling relations Complex queries

Optimal for some Not ACID

use cases (maps,
social, …)

NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Best Practices

NoSQL - Intro
Best Practices
• NoSQL data modeling often starts from the application-
specific queries as opposed to relational modeling

• NoSQL data modeling often

requires a
deeper understanding
of data structures and algorithms than relational
database modeling does

• Data duplication and denormalization are first-class

citizens

NoSQL - Intro
Design Patterns
• Data Access Object Pattern (or DAO pattern) is
used to separate low level data accessing API
or operations from high level business services

• Data access object (DAO) is an object

that provides an abstract interface to some
type of database or other
persistence mechanism

NoSQL - Intro
Design Patterns

NoSQL - Intro
Design Patterns
• CQRS is a simple pattern that
strictly segregates the responsibility of
handling command input into an autonomous
system from the responsibility of handling side-
effect-free query / read access on the
same system

NoSQL - Intro
Design Patterns

NoSQL - Intro
Design Patterns
• The fundamental idea of Event Sourcing is that
of ensuring every change to the state of an
application is captured in an event object

• These event objects are themselves stored in

the sequence they were applied for the same
lifetime as the application state itself

NoSQL - Intro
Design Patterns

NoSQL - Intro
Apache Cassandra
Data Model and CQL
Relational Model
• We have some tables with some columns with PK and
more and more FK

• Relational schema help up to identify actors (Cart, Article,

User, …)

• There are some entity table to M-N Relationship

• We use a model approach, queries are always the last

task

Apache Cassandra – Data Model and CQL

Data Model
• Column family as a way to store and organize data

• Table as a two-dimensional view of a multi-dimensional

column family

• Operations on tables using the Cassandra Query

Language (CQL)

Apache Cassandra – Data Model and CQL

Data Model
• Data modeling in Cassandra uses a query driven
approach

• The fewer partitions that must be queried to get an

answer to a question, the faster the response

• The maximum number of columns per row is two billion

• Rule of thumb is to keep the maximum number of values

below 100.000 items and the disk size under 100 MB per
partition

Apache Cassandra – Data Model and CQL

Data Model

Apache Cassandra – Data Model and CQL

Data Model
• Sometime row key is called also partition key

• A row can have some static columns (shared into all table rows)

• A row can have some clustering keys

Apache Cassandra – Data Model and CQL

Data Model

Apache Cassandra – Data Model and CQL

Data Model

Apache Cassandra – Data Model and CQL

Data Model
Row is the smallest unit that stores related data in Apache
Cassandra

• Rows – individual rows constitute a column family

• Row key – uniquely identifies a row in a column family
• Row – stores pairs of column keys and column values
• Column key – uniquely identifies a column value in a row
• Column value – stores one value or a collection of values

Apache Cassandra – Data Model and CQL

Column Family
• Set of rows with a similar structure
• Distribuited
• Sparse
• Sorted columns
• Multidimensional

Apache Cassandra – Data Model and CQL

Column Family
• Size of a column family is only limited to the size of a cluster
• Data form a one row must fit on one node
• Maximum columns per row is 2 billion
• Maximum data size per cell (column value) is 2 GB

Apache Cassandra – Data Model and CQL

Cassandra Query Language - CQL
A CQL table is a column family

• It provide two-dimensional view of a column family which contains potentially

multi-dimensional data, due to composite keys and collections

CQL table and column family are largely interchangeable terms

• Not surprising when you recall tables and relations, columns and attributes, row
and tuples in relational databases

Supported by declarative language CQL

• DDL is a subset of CQL

• SQL-like syntax, but with somewhat different semantics
• Covenient for defining and expressing Cassandra database schemas

Apache Cassandra – Data Model and CQL

Cassandra Query Language - CQL

Apache Cassandra – Data Model and CQL

Cassandra Query Language - CQL

Apache Cassandra – Data Model and CQL

Cassandra Query Language - CQL

Apache Cassandra – Data Model and CQL

Simple Primary Key
• Just one column in Primary Key
• It is very fast on write and punctual read
• Keep in mind that only the primary key can be specified when
retrieving data from the table
• Into where clause you can use only equals and IN operator (lte and gte
operators are available only with ByteOrdered Partitioner)
• It use a TALL NARROW Pattern

Apache Cassandra – Data Model and CQL

Composite Partition Key
• There are multiple columns in Partition Key
• Could reduce hotspot in time series
• Keep in mind that to retrieve data from the table, values for all
columns defined in the partition key have to be supplied
• It use a TALL NARROW Pattern

Apache Cassandra – Data Model and CQL

Compound Primary Key
• Clustering columns
• Clustering is a storage engine process that sorts data within each
partition based on the definition of the clustering columns
• Use clustering instead of JOIN in cassandra data model
• Pay attention to full scan
• Under the hood, it leverages super-columns
• It use a Flat Wide Pattern

Apache Cassandra – Data Model and CQL

Materialized View
• A materialized view is a table that is built from another table's
data with a new primary key and new properties

• Usage is suggested on high cardinality data. It causes hotspots

when low cardinality data is insered

• Secondary Indexes are suited for low cardinality data. Queries of

high cardinality columns on secondary indexes require Apache
Cassandra to access all nodes in a cluster, causing high read
latency.

Apache Cassandra – Data Model and CQL

Materialized View
• Source of the source table's primary key must be part of the
materialized view's primary key

• Only one new column can be added to the materialized view-s

primary key

• Static columns are not allowed

Apache Cassandra – Data Model and CQL

Materialized View

Apache Cassandra – Data Model and CQL

Materialized View
• Materialized view do not have the same write performance
characteristics that normal table writes have

• The materialized view requires an additional read-before-write,

as well as data consistency checks on each replica before
creating the view updates. These additions overhead and may
change the latency of writes

• It the rows are to be combined before placed in the view,

materialized views will not work. Materialized views will create a
CQL Row in the view for eache CQL Row in the base table

Apache Cassandra – Data Model and CQL

DDL – Primary Key Summary
Simple partition key, no clustering columns
PRIMARY KEY ( partition_key_column )

Composite partition key, no clustering columns

PRIMARY KEY ( ( partition_key_col1, …, partition_key_colN ) )

Simple partition key and clustering columns

PRIMARY KEY ( partition_key_column, clustering_column1,
…, clustering_columnM )

Composite partition key and clustering columns

PRIMARY KEY ( ( partition_key_col1, …, partition_key_colN),
clustering_column1, …, clustering_columnM)

Apache Cassandra – Data Model and CQL

DDL – Types

Apache Cassandra – Data Model and CQL

UUID & TIMEUUID
TimeUUID embeds a time value within a UUID

• Generated using time (60 bits), a clock sequence number (14 bits), and
MAC address (48 bits)

• 1be43390-9fe4-11e3-8d05-425861b86ab6
• CQL function now() generates a new TIMEUUID
• Time can be extracted from TIMEUUID
• CQL function dateOf() extracts the embedded timestamp as a date
• TIMEUUID values in clustering columns or in column names are
ordered based on time

Apache Cassandra – Data Model and CQL

Distributed Counters
Apache Cassandra supports distributed counters

• Useful for tracking a count

• Counter column stores a number that can only be updated
• Incremented or decremented
• Cannot assign an initial value to a counter (initial value is 0)
• Counter column cannot be part of a primary key
• If a table has a counter column, all non-counter columns must be
part of a primary key
• Read is as efﬁcient as for non-counter columns
• Update slightly slower: a read is required before a write can be
performed
• Counter update is not an idempotent operation

Apache Cassandra – Data Model and CQL

Clustering Order By
CLUSTERING ORDER BY deﬁnes how data values in clustering columns
are ordered (ASC or DESC) in a table

• ASC is the default order for all clustering columns

• When retrieving data, the default order or the order speciﬁed by a
CLUSTERING ORDER BY clause is used

• The order can be reversed in a query using the ORDER BY clause

Apache Cassandra – Data Model and CQL

Collection columns
Set – typed collection of unique values
keywords SET<VARCHAR>
• Ordered by values
• No duplicates
List – typed collection of non-unique values
Songwriters LIST<VARCHAR>
• Ordered by position
• Duplicates are allowed
Map – typed collection of key-value pairs
Tracks MAP<INT,VARCHAR>
• Ordered by keys
• Unique keys but not values

Apache Cassandra – Data Model and CQL

User Defined Types
User-deﬁned types group related ﬁelds of information

Represents related data in a single table, instead of multiple, separate

tables

Apache Cassandra – Data Model and CQL

User Defined Functions
To calculate values, you can declare Java or Javascript functions

Represents related data in a single table, instead of multiple, separate

tables

Apache Cassandra – Data Model and CQL

User Defined Agggregated Functions
• Supported Languages are: Java, Javascript, Python, Scala
• To enable it you must enable enable_user_defined_functions attribute in
Cassandra.yaml

• The query must only include the aggregate function itself, but no
columns.

• The state function is called once for each row, and the value returned
by the state function becomes the new state.

• After all rows are processed, the optional final function is executed with
the last state value as its argument. Aggregation is performed by the
coordinator

Apache Cassandra – Data Model and CQL

User Defined Agggregated Functions

Apache Cassandra – Data Model and CQL

Tuple
• Tuples hold fixed-length sets of typed positional fields
• Convenient alternative to creating a user-defined type
• Accommodates up to 32768 fields, but generally only use a few
• Useful when prototyping
• Must use the frozen keyword in C* 2.1
• Tuples can be nested in other tuples

Apache Cassandra – Data Model and CQL

Secondary Index
Tables are indexed on columns in a primary key

• Search on a partition key is very efﬁcient

• Search on a partition key and clustering
columns is very efﬁcient
• Search on other columns is not supported

Secondary indexes

• Can index additional columns to enable

searching by those columns
• One column per index

Apache Cassandra – Data Model and CQL

Secondary Index
Secondary indexes are for searching convenience
• Use with low-cardinality columns
• Columns that may contain a relatively small set of distinct values
• For example, there are many artists but only a few dozen music styles
• Allows searching for all artists for a speciﬁed style (a potentially expensive
query because it may return a large result set)

Use with smaller datasets or when prototyping

Do not use:
• On high-cardinality columns
• On counter column tables
• On a frequently updated or deleted columns
• To look for a row in a large partition unless narrowly queried
• e.g., search on both a partition key and an indexed column
Apache Cassandra – Data Model and CQL
Secondary Index - SASI
• SASI is significantly less resource intensive, using less memory, disk,
and CPU. It enables querying with prefix and contains on
strings, similar to the SQL implementation of LIKE = "foo%" or LIKE =
"%foo%",

• The SASI index data structures are built in memory as the SSTable is
written and flushed to disk as sequential writes before the SSTable
writing completes. One index file is written for each indexed column.

• SASI supports all queries already supported by CQL, and supports

the LIKE operator using PREFIX, CONTAINS, and SPARSE

Apache Cassandra – Data Model and CQL

Secondary Index - SASI

Apache Cassandra – Data Model and CQL

DML - Select

Apache Cassandra – Data Model and CQL

DML - Where
Equality search – one partition
• To retrieve one partition, values for all partition key columns must be specified
Equality search – one row
• To retrieve one row, values for all primary key columns must be specified
Equality search – subset of rows
• To retrieve a subset of rows in a partition, values for all partition key columns and one
or more but not all clustering columns must be specified

Range search >, >=, <, <=

• Can only a range search on a partition key using the token() function
• Can only search on Clustering Columns if partition key filter has been specified

Apache Cassandra – Data Model and CQL

DML – Allow Filtering
Allows scanning over all partitions

• Predicate does not specify values for partition key columns

• Relaxes the requirement that a partition key must be speciﬁed
• Potentially expensive queries that may return large results
• LIMIT clause is recommended
• Predicate can have equality or inequality relations on clustering
columns

Do NOT use in production

Apache Cassandra – Data Model and CQL

Query UDT – User Defined Type

Apache Cassandra – Data Model and CQL

Order by
ORDER BY speciﬁes how query results must be sorted

• Allowed only on clustering columns

• Default order is ASC or as deﬁned by WITH CLUSTERING ORDER

• Default order can be reversed for all clustering columns at once

Apache Cassandra – Data Model and CQL

Functions
dateOf()
• extracts the timestamp as a date of a timeuuid column

now()
• generates a new unique timeuuid

uuid()
• generates a unique id

minTimeuuid() and maxTimeuuid()

• return a UUID-like result given a conditional time component as an
argument

Apache Cassandra – Data Model and CQL

Functions
unixTimestampOf()
• extracts the “raw” timestamp of a timeuuid column as a 64-bit
integer

typeAsBlob() and blobAsType()

token() function

Apache Cassandra – Data Model and CQL

Lightweight Transactions
Handles at coordinator level the Read before Write pattern.

• Works well only on a single partition

• Transactions on different partitions are isolated

• It’s not locking the record

Apache Cassandra – Data Model and CQL

Batch Statement
Batch statement combines multiple INSERT, UPDATE, and DELETE statements
into a single logical operation

• Saves on client-server and coordinator-replica communication

• Atomic operation
• If any statement in the batch succeeds, all will
• No batch isolation
• Other “transactions” can read and write data being affected by a
partially executed batch
• Lightweight transactions in batch
• Batch will execute only if conditions for all lightweight transactions are
met
• All operations in batch will execute serially with the increased
performance overhead

Apache Cassandra – Data Model and CQL

Batch Statement
• It use a BatchLog system table to consume statements
• You can disable it using the keywork UNLOGGED

Apache Cassandra – Data Model and CQL

Batch Conditional Update
• Leverages Lightweight Transactions
• Updates must occour into the same partition

Apache Cassandra – Data Model and CQL

Relational Model

Use Case – Hotel Reservation

Query First Approach
Query first is a right approach to design NoSQL Solution

• Q1: Find Hotels near a given POI

• Q2: Find Hotel informations by name and location
• Q3: Find POIs near a given hotel
• Q4: Find an available room by day
• Q5: Find a rate and amenities for a room
• Q6: Lookup a reservation by confirmation number
• Q7. Lookup a reservation by hotel, date, and guest name
• Q8. Lookup all reservations by guest name
• Q9. View guest details

Use Case – Hotel Reservation

Application queries

Use Case – Hotel Reservation

Chebotko Diagram

Use Case – Hotel Reservation

Chebotko Diagram – Use Case

Use Case – Hotel Reservation

Chebotko Diagram – Use Case

Use Case – Hotel Reservation

Physical Model Diagram

Use Case – Hotel Reservation

Physical Model - Use Case

Use Case – Hotel Reservation

Physical Model - Use Case

Use Case – Hotel Reservation

Materialized View

Use Case – Hotel Reservation

Thanks for the
attention!

1 - The Databases Revolutions
No ratings yet
1 - The Databases Revolutions
46 pages
Anatomy and Pathophysiology of Anemia
88% (8)
Anatomy and Pathophysiology of Anemia
9 pages
Database Systems The Complete Book C1
No ratings yet
Database Systems The Complete Book C1
12 pages
Fundamentals of Database Systems 7th Edition Unlocked Test Bank
No ratings yet
Fundamentals of Database Systems 7th Edition Unlocked Test Bank
307 pages
Unit 1 DBMS
No ratings yet
Unit 1 DBMS
33 pages
Lect 1-2pdf
No ratings yet
Lect 1-2pdf
55 pages
Database Programming: CPC 223 (A)
No ratings yet
Database Programming: CPC 223 (A)
80 pages
Chapter 01 Introduction
No ratings yet
Chapter 01 Introduction
52 pages
Database - Wikipedia
No ratings yet
Database - Wikipedia
109 pages
AI&ML-Unit1 DBMS
No ratings yet
AI&ML-Unit1 DBMS
18 pages
DBMS Unit1 Notes
No ratings yet
DBMS Unit1 Notes
40 pages
BDA - M 3 - NoSQL
No ratings yet
BDA - M 3 - NoSQL
81 pages
The Context of Database Management
No ratings yet
The Context of Database Management
23 pages
Oracle Material-Latest PDF
No ratings yet
Oracle Material-Latest PDF
330 pages
Chapter 1 - The Worlds of Database Systems
No ratings yet
Chapter 1 - The Worlds of Database Systems
31 pages
BICT2205 Databases
No ratings yet
BICT2205 Databases
46 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
A Timeline of Database History & Management
No ratings yet
A Timeline of Database History & Management
9 pages
Intro 2 DB
No ratings yet
Intro 2 DB
126 pages
2019 08 05 IntroBasesDeDatos
No ratings yet
2019 08 05 IntroBasesDeDatos
60 pages
02 - Database Evolution
No ratings yet
02 - Database Evolution
5 pages
Unit 1
No ratings yet
Unit 1
104 pages
بحث قواعد بيانات
No ratings yet
بحث قواعد بيانات
6 pages
Database
No ratings yet
Database
24 pages
DBMS 1st Unit Notes
No ratings yet
DBMS 1st Unit Notes
20 pages
Fundamentals of Database System
No ratings yet
Fundamentals of Database System
10 pages
1.1 - Chapter 1 - The Worlds of Database Systems
No ratings yet
1.1 - Chapter 1 - The Worlds of Database Systems
31 pages
Oracle SQL
No ratings yet
Oracle SQL
56 pages
History of Database Systems
No ratings yet
History of Database Systems
4 pages
History and Background of Database
No ratings yet
History and Background of Database
16 pages
Step To Sample Book
100% (1)
Step To Sample Book
120 pages
Senator Ron Johnson Interim Report - The Clinton Email Scandal and The FBI's Investigation of It
100% (6)
Senator Ron Johnson Interim Report - The Clinton Email Scandal and The FBI's Investigation of It
25 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Global Operations Ass.2
100% (3)
Global Operations Ass.2
26 pages
Instruction Manual Fieldvue dvc2000 Digital Valve Controller Fisher en 135208
No ratings yet
Instruction Manual Fieldvue dvc2000 Digital Valve Controller Fisher en 135208
80 pages
A Brief History of Database Systems
100% (1)
A Brief History of Database Systems
3 pages
DBMS Module1 2023 v1
No ratings yet
DBMS Module1 2023 v1
63 pages
DBMS2
No ratings yet
DBMS2
28 pages
DBMS
No ratings yet
DBMS
63 pages
History of Database Applications
No ratings yet
History of Database Applications
4 pages
Database
100% (2)
Database
106 pages
Sudhakar SQL Server Notes
91% (11)
Sudhakar SQL Server Notes
100 pages
Database PDF
No ratings yet
Database PDF
106 pages
IntroToDbms Test1Notes
No ratings yet
IntroToDbms Test1Notes
14 pages
BDII
No ratings yet
BDII
10 pages
Assignment 1,2
No ratings yet
Assignment 1,2
18 pages
Sudhakar SQL Server Notes PDF
No ratings yet
Sudhakar SQL Server Notes PDF
100 pages
Chapter 1-1.1
No ratings yet
Chapter 1-1.1
22 pages
102 Copies Adv Lesson 1
No ratings yet
102 Copies Adv Lesson 1
5 pages
Chapter 8
No ratings yet
Chapter 8
13 pages
SBP Ing
No ratings yet
SBP Ing
18 pages
Basic Concepts To Botany
No ratings yet
Basic Concepts To Botany
6 pages
Data Design and Modeling Database Management System Software End Users
No ratings yet
Data Design and Modeling Database Management System Software End Users
18 pages
Secondary Market DR S Sreenivasa Murthy
No ratings yet
Secondary Market DR S Sreenivasa Murthy
33 pages
Learn SQL in 24 Hours
From Everand
Learn SQL in 24 Hours
Alex Nordeen
5/5 (4)
MCA 301 Data Mining Notes
No ratings yet
MCA 301 Data Mining Notes
6 pages
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Operator'S Manual: 110 Series Leveling System HWH Lever-Controlled
100% (1)
Operator'S Manual: 110 Series Leveling System HWH Lever-Controlled
15 pages
DBMS Detailed Project
No ratings yet
DBMS Detailed Project
20 pages
Database: Database Management Systems (DBMSS) Are Specially Designed Software Applications That
No ratings yet
Database: Database Management Systems (DBMSS) Are Specially Designed Software Applications That
10 pages
Csta Standards Mapped To Commoncorestandards
No ratings yet
Csta Standards Mapped To Commoncorestandards
6 pages
Ahmad Farooqui Dbms Practical File
No ratings yet
Ahmad Farooqui Dbms Practical File
15 pages
Occlusion and Periodontal Health
No ratings yet
Occlusion and Periodontal Health
8 pages
USAMO
No ratings yet
USAMO
7 pages
English ss1 2nd Term
No ratings yet
English ss1 2nd Term
17 pages
Database PDF
No ratings yet
Database PDF
22 pages
Data Design and Modeling Database Management System Software End Users
No ratings yet
Data Design and Modeling Database Management System Software End Users
15 pages
History of Databases
No ratings yet
History of Databases
1 page
Lesson 1 Relational Database
No ratings yet
Lesson 1 Relational Database
14 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Facility Inspection Checklist
No ratings yet
Facility Inspection Checklist
2 pages
Frida Kahlo: By: Maria Jose Castillo, Camila Amaya, Danna Valencia
No ratings yet
Frida Kahlo: By: Maria Jose Castillo, Camila Amaya, Danna Valencia
9 pages
Dbms MBA Notes
50% (2)
Dbms MBA Notes
125 pages
History of Databases
No ratings yet
History of Databases
5 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
A Brief History of Database Systems
100% (1)
A Brief History of Database Systems
4 pages
Learning Computing History
No ratings yet
Learning Computing History
12 pages
March June 2022
No ratings yet
March June 2022
24 pages
L4C1 Examiner Report March 2022
No ratings yet
L4C1 Examiner Report March 2022
7 pages
9 TLE - Poultry Production - Module 5 - Perform Preventive N Therapeutic Measures
No ratings yet
9 TLE - Poultry Production - Module 5 - Perform Preventive N Therapeutic Measures
26 pages
Acct Statement - XX6157 - 29012025
No ratings yet
Acct Statement - XX6157 - 29012025
40 pages
AGAPE
No ratings yet
AGAPE
13 pages
Gcse Ict: by The End of This Session, You Will Be Able To
No ratings yet
Gcse Ict: by The End of This Session, You Will Be Able To
10 pages
购买定制论文
100% (1)
购买定制论文
7 pages
01 Guide To Drafting Your Critical Role Letters
No ratings yet
01 Guide To Drafting Your Critical Role Letters
3 pages
Family Dynamics
No ratings yet
Family Dynamics
3 pages
ESR Pipette MSDS
No ratings yet
ESR Pipette MSDS
3 pages
Growth Unhinged Carousel
No ratings yet
Growth Unhinged Carousel
10 pages
Socrates Term Paper
No ratings yet
Socrates Term Paper
6 pages
Deduction in Respect of Health Insurance Premia. 80D
No ratings yet
Deduction in Respect of Health Insurance Premia. 80D
2 pages