0% found this document useful (0 votes)
25 views163 pages

Webinar Day 1

Uploaded by

vinit garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views163 pages

Webinar Day 1

Uploaded by

vinit garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 163

Big Data

Intro
When did it started?
«Visualization provides an interesting
challenge for computer systems: data sets are generally quite large,
taxing the capacities of main memory, local disk, and even remote
disk. We call this the problem of big data.»

Michael Cox and David Ellsworth


@ NASA - July 1997

Big Data - Intro


New data sources: IoT
“Internet of things” terms used for the first time in 1999

«Today computers - and, therefore, the Internet - are almost wholly


dependent on human beings for information. [ … ]
If we had computers that knew everything there was to know about things
- using data they gathered without any help from us - we would be able to
track and count everything, and greatly reduce waste, loss and cost.»

Kevin Ashton, Co-Founder of the Auto-ID Center


@ MIT

Big Data - Intro


Data is growing

Big Data - Intro


What is Big Data
90% of world data has been generated in the last two years

Big Data - Intro


What are big data system?
Applications involving the "three V's"

• Volume: Gigabytes growing to terabytes and beyond


• Velocity: sensor data, click streams, financial transactions
• Variety: data must be ingested from many different
formats

Characteristics requiring

• Multi-region availability
• Very fast and reliable response
• No single point of failure
Big Data - Intro
Why not relational data?
Relational model provides

• Normalized table schema


• Cross table joins
• ACID compliance

But, at very high cost

• Big data table joins – billions of rows, or more – require


massive overhead
• Sharding tables across system is complex and fragile

Big Data - Intro


Why not relational data?
Modern applications have different priorities

• Need for speed and availability trump "always on"


consistency
• Commodity server racks trump massive high-end
systems
• Real world need for transactional guarantees is limitated

Big Data - Intro


Why Big Data?
Cost reduction

• Big data technologies such as Hadoop and cloud-based


analytics bring significant cost advantages when it comes
to storing large amounts of data-plus they can identify
more efficient ways of doing business

Big Data - Intro


Why Big Data?
Faster, better decision making

• With the speed of Hadoop and in-memory analytics,


combined with the ability to analyze new sources of data,
businesses are able to analyze information immediately
and make decisions based on what they've learned.

Big Data - Intro


Why Big Data?
New products and services

• With the ability and gouge customer needs and


satisfaction through analytics comes the power to give
consumer what they want. Davenport points out that with
big data analytics, more companies are creating new
products to meet customer's needs.

Big Data - Intro


Data value

Big Data - Intro


Data chain

Data Data Data Data Data


Data Retail
Generation Ingestion Aggregation Elaboration Visualization

Big Data - Intro


Big Data - Intro
NoSQL
Intro
A Brief History of Databases
• A database management system (DBMS) is
a computer software application that interacts with
the user, other applications, and the database itself
to capture and analyze data

• A database is not generally portable across different


DBMSs, but different DBMS can interoperate by
using standards such as SQL and ODBC or JDBC to
allow a single application to work with more than
one DBMS

NoSQL - Intro
A Brief History of Databases
• Formally, a "database" refers to a set of related
data and the way it is organized

• Access to this data is usually provided by


a "database management system" (DBMS) consisting
of an integrated set of computer software that allows
users to interact with one or more databases and
provides access to all of the data contained in the
database

NoSQL - Intro
A Brief History of Databases
• Database management systems are
often classified according to the database model that
they support; the most popular database systems since
the 1980s have all supported the relational model

• The development of database technology can


be divided into three eras based on data
model or structure:
o Navigational
o SQL/relational
o Post-relational (NoSQL)
NoSQL - Intro
A Brief History of Databases
• 1960 - The two main early navigational data models were
the hierarchical model, epitomized by IBM's IMS system, and
the CODASYL model implemented in a number of products such
as IDMS (CA)

NoSQL - Intro
A Brief History of Databases
• In 1970, Edgar Codd (IBM) wrote a number of papers
that outlined a new approach to database
construction that eventually culminated in the
groundbreaking “A Relational Model of Data for Large
Shared Data Banks”

• He was unhappy with the navigational model of the


CODASYL approach, notably the lack of a "search"
facility

NoSQL - Intro
A Brief History of Databases
• Codd's idea was to use a "table" of fixed-length records, with each
table used for a different type of entity

• A linked-list system would be very inefficient when storing


"sparse" databases where some of the data for any one record
could be left empty

• The relational model solved this by splitting the data into a series
of normalized tables with optional elements being moved out of
the main table

NoSQL - Intro
A Brief History of Databases
• IBM started working on a prototype system loosely
based on Codd's concepts as System R in the early
1970s. The first version was ready in 1975

• Codd's ideas were establishing themselves


as both workable and superior to
CODASYL, pushing IBM to develop a true
production version of System R, known as SQL/DS,
and, later, Database 2 (DB2)

NoSQL - Intro
A Brief History of Databases
• Larry Ellison's Oracle started from a
different chain, based on IBM's papers on
System R, and beat IBM to market when
the first version was released in 1978

• The entity–relationship model, emerged


in 1976 and gained popularity for database
design as it emphasized a more familiar
description than the earlier relational model

NoSQL - Intro
A Brief History of Databases
• Codd, after his extensive research on the
Relational Model of database systems, came
up with twelve rules of his own, which
according to him, a database must obey in
order to be regarded as a
true relational database

NoSQL - Intro
A Brief History of Databases
1 Information Rule

The data stored in a database, may it be


user data or metadata, must be a value of
some table cell. Everything in a database must
be stored in a table format.

NoSQL - Intro
A Brief History of Databases
2 Guaranteed Access Rule

Every single data element (value) is guaranteed


to be accessible logically with a combination of
table-name, primary-key (row value), and
attribute-name (column value). No other means,
such as pointers, can be used to access data.

NoSQL - Intro
A Brief History of Databases
3 Systematic Treatment of NULL Values

The NULL values in a database must be given a


systematic and uniform treatment. This is a very
important rule because a NULL can
be interpreted as one the following − data
is missing, data is not known, or data is
not applicable.

NoSQL - Intro
A Brief History of Databases
4 Active Online Catalog

The structure description of the entire


database must be stored in an online catalog,
known as data dictionary, which can be
accessed by authorized users. Users can use the
same query language to access the catalog
which they use to access the database itself.

NoSQL - Intro
A Brief History of Databases
5 Comprehensive Data Sub-Language Rule

A database can only be accessed using a language


having linear syntax that supports data definition, data
manipulation, and transaction management operations.
This language can be used directly or by means of some
application. If the database allows access to data without
any help of this language, then it is considered as a
violation.

NoSQL - Intro
A Brief History of Databases
6 View Updating Rule

All the views of a database, which


can theoretically be updated, must also be
updatable by the system.

NoSQL - Intro
A Brief History of Databases
7 High-Level Insert, Update, and Delete Rule

A database must support high-level


insertion, updation and deletion. This must not
be limited to a single row, that is, it must also
support union, intersection and minus operations
to yield sets of data records.

NoSQL - Intro
A Brief History of Databases
8 Physical Data Independence

The data stored in a database must


be independent of the applications that access
the database. Any change in the physical
structure of a database must not have any
impact on how the data is being accessed by
external applications.

NoSQL - Intro
A Brief History of Databases
10 Integrity Independence

A database must be independent of


the application that uses it. All its integrity
constraints can be independently modified
without the need of any change in the
application. This rule makes a database
independent of the front-end application and its
interface.

NoSQL - Intro
A Brief History of Databases
11 Distribution Independence

The end-user must not be able to see that


the data is distributed over various locations.
Users should always get the impression that the
data is located at one site only. This rule has
been regarded as the foundation of
distributed database systems.

NoSQL - Intro
A Brief History of Databases
12 Non-Subversion Rule

If a system has an interface that provides


access to low-level records, then the interface
must not be able to subvert the system and
bypass security and integrity constraints.

NoSQL - Intro
RDBMS Theory and Architecture
• Database normalization, or simply normalization, is the
process of organizing the columns (attributes) and
tables (relations) of a relational database
to reduce data redundancy and improve data integrity

• Normalization involves arranging attributes in tables


based on dependencies between attributes, ensuring
that the dependencies are properly enforced by
database integrity constraints

NoSQL - Intro
RDBMS Theory and Architecture
• Normalization is accomplished through applying some
formal rules either by a process
of synthesis or decomposition

• Codd introduced the concept of normalization and what


is now known as the First normal form (1NF) in 1970.
Codd went on to define the Second normal form (2NF)
and Third normal form (3NF) in 1971 with Raymond F.
Boyce defined the Boyce-Codd Normal Form (BCNF) in
1974

NoSQL - Intro
RDBMS Theory and Architecture
• 3NF tables are free of insertion, update,
and deletion anomalies

• The biggest problem with normalization is that you end


up with multiple tables representing what is
conceptually a single item

• Joining tables is slow and


it limits horizontal scaling (partitioning)

NoSQL - Intro
RDBMS Theory and Architecture

NoSQL - Intro
RDBMS Theory and Architecture
• Structured Query Language (SQL) is a domain-
specific language used in programming and designed for
managing data held in a relational database management system

• Despite not entirely adhering to the relational model as described


by Codd, it is the most widely used database language

• SQL became a standard of the American National Standards


Institute (ANSI) in 1986, and of the International Organization for
Standardization (ISO) in 1987

NoSQL - Intro
RDBMS Theory and Architecture
• SQL consists of a data definition language (DDL), data
manipulation language (DML), and data control language (DCL)

• DDL Example

NoSQL - Intro
RDBMS Theory and Architecture
• DML Example

• DCL Example

NoSQL - Intro
ACID capabilities
• Atomicity requires that each transaction be "all
or nothing": if one part of the transaction fails, the
entire transaction fails, and the database state is
left unchanged
• The Consistency property ensures that
any transaction will bring the database from
one valid state to another
o Any data written to the database must be
valid according to all defined rules,
including constraints, cascades, triggers, and any
combination thereof
NoSQL - Intro
ACID capabilities
• The Isolation property ensures that the concurrent
execution of transactions results in a system state that
would be obtained if transactions were
executed serially, i.e., one after the other

• The Durability property ensures that once a transaction


has been committed, it will remain so, even in the event
of power loss, crashes, or errors

NoSQL - Intro
ACID capabilities
• Write Ahead Logging (WAL) is a family of techniques for
providing atomicity and durability
o In a system using WAL, all modifications are written
to a log before they are applied

• Imagine a program that is in the middle of


performing some operation when the machine it is
running on loses power. Upon restart, that program
might well need to know whether the operation it
was performing succeeded, half-succeeded, or failed

NoSQL - Intro
ACID capabilities
• If a Write Ahead Log is used, the program can check this
log and compare what it was supposed to be doing
when it unexpectedly lost power to what was actually
done

• File systems typically use a variant of WAL for at least


file system metadata called journaling

NoSQL - Intro
ACID capabilities
• Another way to implement atomic updates is
with shadow paging, which is not in-place

• Shadow paging is a copy-on-write technique


for avoiding in-place updates of pages. Instead, when a
page is to be modified, a shadow page is allocated

• Since the shadow page has no references , it can be


modified liberally, without concern
for consistency constraints, etc

NoSQL - Intro
ACID capabilities
• When the page is ready to become durable, all pages
that referred to the original are updated to refer to the
new replacement page instead.

• Because the page is "activated" only when it is ready, it


is atomic

NoSQL - Intro
ACID capabilities
• Two-Phase Commit protocol (2PC) is a type
of atomic commitment protocol (ACP). It is
a distributed algorithm that coordinates all the processes
that participate in a distributed atomic transaction

• To accommodate recovery from failure the protocol's participants


use logging of the protocol's states

• Log records, which are typically slow to generate but survive


failures, are used by the protocol's recovery procedures

NoSQL - Intro
Two-Phase Commit (2 PC)

NoSQL - Intro
ACID capabilities
• Three-Phase Commit protocol (3PC) is another
distributed algorithm which lets all nodes in a
distributed system agree to commit a transaction

• Unlike the two-phase commit however, 3PC is non-


blocking. Specifically, 3PC places an upper bound on
the amount of time required before a transaction
either commits or aborts

NoSQL - Intro
Tree-Phase Commit (3 PC)

NoSQL - Intro
Shared Nothing Architecture
• A shared nothing architecture (SN) is a
distributed computing architecture in which
each node is independent and self-sufficient, and there
is no single point of contention across the system

• More specifically, none of the nodes share memory


or disk storage

• A shared nothing is popular in cloud enviroments because


of its scalability. A pure SN system can scale almost
infinitely simply by adding nodes in the form
of inexpensive computers
NoSQL - Intro
Brewer's theorem (CAP)
• The CAP theorem, also known as
Brewer's theorem, states that it is impossible for
a distributed computer system to
simultaneously provide all three of the following
guarantees

NoSQL - Intro
Brewer's theorem (CAP)
• Consistency: All database clients will read the same
value for the same query, even given concurrent
updates

• Availability: All database clients will always be able


to read and write data

• Partition tolerance: The database can be split into


multiple machines; it can continue functioning in the face
of network segmentation breaks

NoSQL - Intro
Brewer's theorem (CAP)

NoSQL - Intro
Brewer's theorem (CAP)
• In February 2012, Eric Brewer provided
an updated perspective on his CAP theorem in the
article “CAP Twelve Years Later: How the ‘Rules’ Have
Changed”

https://www.researchgate.net/publication/220476881_CA
P_Twelve_years_later_How_the_Rules_have_Changed

NoSQL - Intro
Database Sharding
• A database shard is a horizontal partition of data in a
database

• Each shard is commonly held on a separate database


server instance (node), to spread load

• Since the tables are divided and distributed into multiple


servers, the total number of rows in each table in each
database is reduced. This reduces index size too, which
generally improves search performance

NoSQL - Intro
Database Sharding

NoSQL - Intro
Database Sharding

NoSQL - Intro
Security Approches

NoSQL - Intro
NoSQL Approach
• NoSQL database, also called Not Only SQL, is an
approach to data management and database
design that's useful for very large
sets of distributed data

• A NoSQL database provides a mechanism


for storage and retrieval of data that is modeled
in means other than the tabular relations used
in relational databases
NoSQL - Intro
NoSQL Approach
• NoSQL, which encompasses a wide range
of technologies and architectures, seeks to
solve the scalability and big data performance
issues that relational databases weren’t designed
to address

• There have been various approaches to


classify NoSQL databases, each with
different categories and subcategories, some of which
overlap

NoSQL - Intro
NoSQL Approach
• A basic classification based on data model:

❑ Key-value

❑ Document-oriented

❑ Columnar

❑ Graph-oriented

❑ Other (mixed)

NoSQL - Intro
NoSQL Approach
• A key-value store, or key-value database, is a
data storage paradigm designed for storing, retrieving,
and managing associative arrays (or dictionary or hash)

• These records are stored and retrieved using


a key that uniquely identifies the record, and is used
to quickly find the data within the database

• key-value systems treat the data as a


single opaque collection which may have different fields
for every record
NoSQL - Intro
Key-Value Store

NoSQL - Intro
Key-Value Store (JSON)

NoSQL - Intro
NoSQL Approach

Simple Data Model

Ease of use No complex queries

Horizontal scaling

NoSQL - Intro
Key-Value Store

NoSQL - Intro
Key-Value Store
Sample Retailers Web Application

NoSQL - Intro
Key-Value Store
Sample Retailers Web Application

NoSQL - Intro
Key-Value Store
Sample Retailers Web Application

NoSQL - Intro
NoSQL Approach
• Document-oriented databases are inherently
a subclass of the key-value store

• The difference lies in the way the data is processed; in a


key-value store the data is considered to be inherently
opaque to the database, whereas a document-oriented
system relies on internal structure in the document order
to extract metadata that the database engine uses for
further optimization

NoSQL - Intro
Document-oriented Store

NoSQL - Intro
Document-oriented Store
• In a RDBMS, data is first categorized into a
number of predefined types, and tables are created to
hold individual records of each type

• In contrast, in a document-oriented database there may


be no internal structure that maps directly onto the
concept of a table

• A key difference between the document-oriented and


relational models is that the data formats are not
predefined in the document case
NoSQL - Intro
Document-oriented Store

Schema Free Slow complex queries

Ease of use Not ACID

Horizontal scaling

NoSQL - Intro
Document-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Document-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Column-oriented Store
• A column-oriented DBMS is a database
management system that stores data tables as sections
of columns of data rather than as rows of data

• It has advantages for data warehouses and other ad hoc


inquiry systems where aggregates are computed over
large numbers of similar data items

• Column-oriented storage is closely related to


database normalization due to the way it restricts the
database schema design
NoSQL - Intro
Column-oriented Store
A relational database management system provides data
that represents a two-dimensional table, of columns and
rows.

For example, a database might have this table:

NoSQL - Intro
Column-oriented Store
• This simple table includes an employee identifier (EmpId), and some
fields

• This two-dimensional format exists only in theory. In practice,


storage hardware requires the data to be serialized into one form or
another.

• The common solution to the storage problem is to serialize each


row of data, like this:

NoSQL - Intro
Column-oriented Store
• Row-based systems are designed to efficiently return data
for an entire row, or record, in as few operations as possible. This
matches the common use-case where the system is attempting to
retrieve information about a particular object

• Row-based systems are NOT efficient at performing operations that


apply to the entire data set, as opposed to a specific record.

• For instance, in order to find all the records in the example table that
have salaries between 40,000 and 50,000, the DBMS would have
to seek through the entire data set looking for matching records

NoSQL - Intro
Column-oriented Store
• To improve the performance of these sorts of operations, most
DBMSs support the use of database indexes

• An index on the salary column would look something like this:

NoSQL - Intro
Column-oriented Store
• A column-oriented database serializes all of the values of a column
together, then the values of the next column, and so on.

• For our example table, the data would be stored in this fashion:

• In this layout, any one of the columns more closely matches


the structure of an index in a row-based system.

NoSQL - Intro
Column-oriented Store
• Bigtable is a compressed, high performance, and proprietary data
storage system built on Google File System (GFS)

• Bigtable development began in 2004 and is now used by a number


of Google applications, such as web indexing, MapReduce, which is
often used for generating and modifying data stored in
Bigtable, Google Maps, Google Book Search, "My Search History",
Google Earth, Blogger.com, Google Code hosting, YouTube, and Gmail

NoSQL - Intro
Column-oriented Store
• Bigtable maps two arbitrary string values (row key and column key)
and timestamp (hence three-dimensional mapping) into an
associated arbitrary byte array.

• It is not a relational database and can be better defined as a sparse,


distributed multi-dimensional sorted map

• HBase is an open source, non-relational, distributed database


modeled after Google's BigTable and written in Java. It is developed
as part of Apache Software Foundation's Apache Hadoop project
and runs on top of HDFS

NoSQL - Intro
Bigtable Data Model
• Table is a collection of rows
• Row is a collection of column families
• Column family is a collection of columns
• Column is a collection of key-value pairs

NoSQL - Intro
Column-oriented Store

OLAP Performances OLTP Performances

Horizontal scaling Handling Single


Records

NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Graph-oriented Store
• A graph database is a database that uses graph structures for
semantic queries with nodes, edges and properties to represent and
store data

• Compared with relational databases, graph databases are


often faster for associative data and map more directly to the
structure of object-oriented applications

• They can scale more naturally to large data sets as they do


not typically require expensive join operations

NoSQL - Intro
Graph-oriented Store

NoSQL - Intro
Graph-oriented Store

Handling relations Complex queries

Optimal for some Not ACID


use cases (maps,
social, …)

NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Column-oriented Store
Sample Retailers Web Application

NoSQL - Intro
Best Practices

NoSQL - Intro
Best Practices
• NoSQL data modeling often starts from the application-
specific queries as opposed to relational modeling

• NoSQL data modeling often


requires a
deeper understanding
of data structures and algorithms than relational
database modeling does

• Data duplication and denormalization are first-class


citizens

NoSQL - Intro
Design Patterns
• Data Access Object Pattern (or DAO pattern) is
used to separate low level data accessing API
or operations from high level business services

• Data access object (DAO) is an object


that provides an abstract interface to some
type of database or other
persistence mechanism

NoSQL - Intro
Design Patterns

NoSQL - Intro
Design Patterns
• CQRS is a simple pattern that
strictly segregates the responsibility of
handling command input into an autonomous
system from the responsibility of handling side-
effect-free query / read access on the
same system

NoSQL - Intro
Design Patterns

NoSQL - Intro
Design Patterns
• The fundamental idea of Event Sourcing is that
of ensuring every change to the state of an
application is captured in an event object

• These event objects are themselves stored in


the sequence they were applied for the same
lifetime as the application state itself

NoSQL - Intro
Design Patterns

NoSQL - Intro
Apache Cassandra
Data Model and CQL
Relational Model
• We have some tables with some columns with PK and
more and more FK

• Relational schema help up to identify actors (Cart, Article,


User, …)

• There are some entity table to M-N Relationship

• We use a model approach, queries are always the last


task

Apache Cassandra – Data Model and CQL


Data Model
• Column family as a way to store and organize data

• Table as a two-dimensional view of a multi-dimensional


column family

• Operations on tables using the Cassandra Query


Language (CQL)

Apache Cassandra – Data Model and CQL


Data Model
• Data modeling in Cassandra uses a query driven
approach

• The fewer partitions that must be queried to get an


answer to a question, the faster the response

• The maximum number of columns per row is two billion

• Rule of thumb is to keep the maximum number of values


below 100.000 items and the disk size under 100 MB per
partition

Apache Cassandra – Data Model and CQL


Data Model

Apache Cassandra – Data Model and CQL


Data Model
• Sometime row key is called also partition key

• A row can have some static columns (shared into all table rows)

• A row can have some clustering keys

Apache Cassandra – Data Model and CQL


Data Model

Apache Cassandra – Data Model and CQL


Data Model

Apache Cassandra – Data Model and CQL


Data Model
Row is the smallest unit that stores related data in Apache
Cassandra

• Rows – individual rows constitute a column family


• Row key – uniquely identifies a row in a column family
• Row – stores pairs of column keys and column values
• Column key – uniquely identifies a column value in a row
• Column value – stores one value or a collection of values

Apache Cassandra – Data Model and CQL


Column Family
• Set of rows with a similar structure
• Distribuited
• Sparse
• Sorted columns
• Multidimensional

Apache Cassandra – Data Model and CQL


Column Family
• Size of a column family is only limited to the size of a cluster
• Data form a one row must fit on one node
• Maximum columns per row is 2 billion
• Maximum data size per cell (column value) is 2 GB

Apache Cassandra – Data Model and CQL


Cassandra Query Language - CQL
A CQL table is a column family

• It provide two-dimensional view of a column family which contains potentially


multi-dimensional data, due to composite keys and collections

CQL table and column family are largely interchangeable terms

• Not surprising when you recall tables and relations, columns and attributes, row
and tuples in relational databases

Supported by declarative language CQL

• DDL is a subset of CQL


• SQL-like syntax, but with somewhat different semantics
• Covenient for defining and expressing Cassandra database schemas

Apache Cassandra – Data Model and CQL


Cassandra Query Language - CQL

Apache Cassandra – Data Model and CQL


Cassandra Query Language - CQL

Apache Cassandra – Data Model and CQL


Cassandra Query Language - CQL

Apache Cassandra – Data Model and CQL


Simple Primary Key
• Just one column in Primary Key
• It is very fast on write and punctual read
• Keep in mind that only the primary key can be specified when
retrieving data from the table
• Into where clause you can use only equals and IN operator (lte and gte
operators are available only with ByteOrdered Partitioner)
• It use a TALL NARROW Pattern

Apache Cassandra – Data Model and CQL


Composite Partition Key
• There are multiple columns in Partition Key
• Could reduce hotspot in time series
• Keep in mind that to retrieve data from the table, values for all
columns defined in the partition key have to be supplied
• It use a TALL NARROW Pattern

Apache Cassandra – Data Model and CQL


Compound Primary Key
• Clustering columns
• Clustering is a storage engine process that sorts data within each
partition based on the definition of the clustering columns
• Use clustering instead of JOIN in cassandra data model
• Pay attention to full scan
• Under the hood, it leverages super-columns
• It use a Flat Wide Pattern

Apache Cassandra – Data Model and CQL


Materialized View
• A materialized view is a table that is built from another table's
data with a new primary key and new properties

• Usage is suggested on high cardinality data. It causes hotspots


when low cardinality data is insered

• Secondary Indexes are suited for low cardinality data. Queries of


high cardinality columns on secondary indexes require Apache
Cassandra to access all nodes in a cluster, causing high read
latency.

Apache Cassandra – Data Model and CQL


Materialized View
• Source of the source table's primary key must be part of the
materialized view's primary key

• Only one new column can be added to the materialized view-s


primary key

• Static columns are not allowed

Apache Cassandra – Data Model and CQL


Materialized View

Apache Cassandra – Data Model and CQL


Materialized View
• Materialized view do not have the same write performance
characteristics that normal table writes have

• The materialized view requires an additional read-before-write,


as well as data consistency checks on each replica before
creating the view updates. These additions overhead and may
change the latency of writes

• It the rows are to be combined before placed in the view,


materialized views will not work. Materialized views will create a
CQL Row in the view for eache CQL Row in the base table

Apache Cassandra – Data Model and CQL


DDL – Primary Key Summary
Simple partition key, no clustering columns
PRIMARY KEY ( partition_key_column )

Composite partition key, no clustering columns


PRIMARY KEY ( ( partition_key_col1, …, partition_key_colN ) )

Simple partition key and clustering columns


PRIMARY KEY ( partition_key_column, clustering_column1,
…, clustering_columnM )

Composite partition key and clustering columns


PRIMARY KEY ( ( partition_key_col1, …, partition_key_colN),
clustering_column1, …, clustering_columnM)

Apache Cassandra – Data Model and CQL


DDL – Types

Apache Cassandra – Data Model and CQL


UUID & TIMEUUID
TimeUUID embeds a time value within a UUID

• Generated using time (60 bits), a clock sequence number (14 bits), and
MAC address (48 bits)

• 1be43390-9fe4-11e3-8d05-425861b86ab6
• CQL function now() generates a new TIMEUUID
• Time can be extracted from TIMEUUID
• CQL function dateOf() extracts the embedded timestamp as a date
• TIMEUUID values in clustering columns or in column names are
ordered based on time

Apache Cassandra – Data Model and CQL


Distributed Counters
Apache Cassandra supports distributed counters

• Useful for tracking a count


• Counter column stores a number that can only be updated
• Incremented or decremented
• Cannot assign an initial value to a counter (initial value is 0)
• Counter column cannot be part of a primary key
• If a table has a counter column, all non-counter columns must be
part of a primary key
• Read is as efficient as for non-counter columns
• Update slightly slower: a read is required before a write can be
performed
• Counter update is not an idempotent operation

Apache Cassandra – Data Model and CQL


Clustering Order By
CLUSTERING ORDER BY defines how data values in clustering columns
are ordered (ASC or DESC) in a table

• ASC is the default order for all clustering columns


• When retrieving data, the default order or the order specified by a
CLUSTERING ORDER BY clause is used

• The order can be reversed in a query using the ORDER BY clause

Apache Cassandra – Data Model and CQL


Collection columns
Set – typed collection of unique values
keywords SET<VARCHAR>
• Ordered by values
• No duplicates
List – typed collection of non-unique values
Songwriters LIST<VARCHAR>
• Ordered by position
• Duplicates are allowed
Map – typed collection of key-value pairs
Tracks MAP<INT,VARCHAR>
• Ordered by keys
• Unique keys but not values

Apache Cassandra – Data Model and CQL


User Defined Types
User-defined types group related fields of information

Represents related data in a single table, instead of multiple, separate


tables

Apache Cassandra – Data Model and CQL


User Defined Functions
To calculate values, you can declare Java or Javascript functions

Represents related data in a single table, instead of multiple, separate


tables

Apache Cassandra – Data Model and CQL


User Defined Agggregated Functions
• Supported Languages are: Java, Javascript, Python, Scala
• To enable it you must enable enable_user_defined_functions attribute in
Cassandra.yaml

• The query must only include the aggregate function itself, but no
columns.

• The state function is called once for each row, and the value returned
by the state function becomes the new state.

• After all rows are processed, the optional final function is executed with
the last state value as its argument. Aggregation is performed by the
coordinator

Apache Cassandra – Data Model and CQL


User Defined Agggregated Functions

Apache Cassandra – Data Model and CQL


Tuple
• Tuples hold fixed-length sets of typed positional fields
• Convenient alternative to creating a user-defined type
• Accommodates up to 32768 fields, but generally only use a few
• Useful when prototyping
• Must use the frozen keyword in C* 2.1
• Tuples can be nested in other tuples

Apache Cassandra – Data Model and CQL


Secondary Index
Tables are indexed on columns in a primary key

• Search on a partition key is very efficient


• Search on a partition key and clustering
columns is very efficient
• Search on other columns is not supported

Secondary indexes

• Can index additional columns to enable


searching by those columns
• One column per index

Apache Cassandra – Data Model and CQL


Secondary Index
Secondary indexes are for searching convenience
• Use with low-cardinality columns
• Columns that may contain a relatively small set of distinct values
• For example, there are many artists but only a few dozen music styles
• Allows searching for all artists for a specified style (a potentially expensive
query because it may return a large result set)

Use with smaller datasets or when prototyping

Do not use:
• On high-cardinality columns
• On counter column tables
• On a frequently updated or deleted columns
• To look for a row in a large partition unless narrowly queried
• e.g., search on both a partition key and an indexed column
Apache Cassandra – Data Model and CQL
Secondary Index - SASI
• SASI is significantly less resource intensive, using less memory, disk,
and CPU. It enables querying with prefix and contains on
strings, similar to the SQL implementation of LIKE = "foo%" or LIKE =
"%foo%",

• The SASI index data structures are built in memory as the SSTable is
written and flushed to disk as sequential writes before the SSTable
writing completes. One index file is written for each indexed column.

• SASI supports all queries already supported by CQL, and supports


the LIKE operator using PREFIX, CONTAINS, and SPARSE

Apache Cassandra – Data Model and CQL


Secondary Index - SASI

Apache Cassandra – Data Model and CQL


DML - Select

Apache Cassandra – Data Model and CQL


DML - Where
Equality search – one partition
• To retrieve one partition, values for all partition key columns must be specified
Equality search – one row
• To retrieve one row, values for all primary key columns must be specified
Equality search – subset of rows
• To retrieve a subset of rows in a partition, values for all partition key columns and one
or more but not all clustering columns must be specified

Range search >, >=, <, <=


• Can only a range search on a partition key using the token() function
• Can only search on Clustering Columns if partition key filter has been specified

Apache Cassandra – Data Model and CQL


DML – Allow Filtering
Allows scanning over all partitions

• Predicate does not specify values for partition key columns


• Relaxes the requirement that a partition key must be specified
• Potentially expensive queries that may return large results
• LIMIT clause is recommended
• Predicate can have equality or inequality relations on clustering
columns

Do NOT use in production

Apache Cassandra – Data Model and CQL


Query UDT – User Defined Type

Apache Cassandra – Data Model and CQL


Order by
ORDER BY specifies how query results must be sorted

• Allowed only on clustering columns

• Default order is ASC or as defined by WITH CLUSTERING ORDER

• Default order can be reversed for all clustering columns at once

Apache Cassandra – Data Model and CQL


Functions
dateOf()
• extracts the timestamp as a date of a timeuuid column

now()
• generates a new unique timeuuid

uuid()
• generates a unique id

minTimeuuid() and maxTimeuuid()


• return a UUID-like result given a conditional time component as an
argument

Apache Cassandra – Data Model and CQL


Functions
unixTimestampOf()
• extracts the “raw” timestamp of a timeuuid column as a 64-bit
integer

typeAsBlob() and blobAsType()

token() function

Apache Cassandra – Data Model and CQL


Lightweight Transactions
Handles at coordinator level the Read before Write pattern.

• Works well only on a single partition

• Transactions on different partitions are isolated

• It’s not locking the record

Apache Cassandra – Data Model and CQL


Batch Statement
Batch statement combines multiple INSERT, UPDATE, and DELETE statements
into a single logical operation

• Saves on client-server and coordinator-replica communication


• Atomic operation
• If any statement in the batch succeeds, all will
• No batch isolation
• Other “transactions” can read and write data being affected by a
partially executed batch
• Lightweight transactions in batch
• Batch will execute only if conditions for all lightweight transactions are
met
• All operations in batch will execute serially with the increased
performance overhead

Apache Cassandra – Data Model and CQL


Batch Statement
• It use a BatchLog system table to consume statements
• You can disable it using the keywork UNLOGGED

Apache Cassandra – Data Model and CQL


Batch Conditional Update
• Leverages Lightweight Transactions
• Updates must occour into the same partition

Apache Cassandra – Data Model and CQL


Relational Model

Use Case – Hotel Reservation


Query First Approach
Query first is a right approach to design NoSQL Solution

• Q1: Find Hotels near a given POI


• Q2: Find Hotel informations by name and location
• Q3: Find POIs near a given hotel
• Q4: Find an available room by day
• Q5: Find a rate and amenities for a room
• Q6: Lookup a reservation by confirmation number
• Q7. Lookup a reservation by hotel, date, and guest name
• Q8. Lookup all reservations by guest name
• Q9. View guest details

Use Case – Hotel Reservation


Application queries

Use Case – Hotel Reservation


Chebotko Diagram

Use Case – Hotel Reservation


Chebotko Diagram – Use Case

Use Case – Hotel Reservation


Chebotko Diagram – Use Case

Use Case – Hotel Reservation


Physical Model Diagram

Use Case – Hotel Reservation


Physical Model - Use Case

Use Case – Hotel Reservation


Physical Model - Use Case

Use Case – Hotel Reservation


Materialized View

Use Case – Hotel Reservation


Thanks for the
attention!

You might also like