0% found this document useful (0 votes)
40 views93 pages

Ado Lecture III 2024-26

Uploaded by

thehorizon2026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views93 pages

Ado Lecture III 2024-26

Uploaded by

thehorizon2026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 93

Advance Data Organization

Lecture III
MBA(DSDA) 2024-26, SCIT
BigTable
ADO
• BigTable
• Amazon DynamoDB
• Hbase
• Cassandra
Revision of Last Lecture
ADO

Journey From
RDBMS to NoSQL (II)
ADO
• BigTable
• Amazon DynamoDB
• HBase
• Cassandra
BigTable
Amazon DynamoDB
• Amazon DynamoDB is a key-value and document
database that is fully managed, multi-region, and auto-
scaling so that you don’t have to worry about the
infrastructure or datacenter.
• Dynamo Paper- Published by a group of Amazon.com
engineers in 2007, the Dynamo Paper described a new
kind of database.
• As the other main NoSQL solutions, such as MongoDB (2009)
or Apache Cassandra (2008)
• Single-Table Design- put all your entities in the same
table.
• A fully managed NoSQL database released by AWS in 2012.
Amazon DynamoDB
• Create primary keys for your items which are
composed of a partition key and sort key.
• You can use just the partition key as the primary
key, but for most cases, you will also want to
leverage a sort key.
• DynamoDB allows two kinds of primary keys:
– Simple primary keys, made of a single element called
partition key (PK).
– Composite primary keys, made of partition key (PK)
and sort key (SK).
Amazon DynamoDB
Amazon DynamoDB
Amazon DynamoDB
Amazon DynamoDB
Amazon DynamoDB
• Partitioning- Horizontal Scaling

Table named Pets, which spans multiple


partitions. The table's primary key is AnimalType
Amazon DynamoDB
• Partitioning- Horizontal Scaling
Amazon DynamoDB
• Partitioning- Horizontal Scaling

Pets table has a composite primary key


consisting of AnimalType (partition key)
and Name (sort key).
Amazon DynamoDB
• Partitioning- Horizontal Scaling
Amazon DynamoDB
• Partitioning- Horizontal Scaling
Amazon DynamoDB
• Partitioning- Horizontal Scaling
Amazon DynamoDB
• Partitioning- Horizontal Scaling
Amazon DynamoDB
Partitioning- Horizontal Scaling
• Each partition is roughly 10GB in size, so
DynamoDB will add additional partitions to
your table as it grows. A small table may only
have 2-3 partitions, while a large table could
have thousands of partitions.
• The great part about this setup is how well it
scales. The request router’s work of finding
the proper node has time complexity of O(1).
Amazon DynamoDB
Partitioning- Horizontal Scaling
Amazon DynamoDB
Partitioning- Horizontal Scaling
When thinking about data access in DynamoDB,
remember this image:
Amazon DynamoDB
Partitioning- Horizontal Scaling
• RCU (read capacity unit) and WCU (write
capacity unit) values spread across a number
of partitions.
Amazon DynamoDB
Partitioning- Horizontal Scaling
Amazon DynamoDB
Partitioning- Horizontal Scaling
• One partition can handle 10GB of data, 3000 read
capacity units (RCU) and 1000 write capacity units
(WCU), indicating a direct relationship between the
amount of data stored in a table and performance
requirements.
• A new partition will be added when more than 10GB
of data is stored in a table, or RCUs are greater than
3000, or WCUs are greater than 1000. Then, the data
will get spread across these partitions.
Amazon DynamoDB
Partitioning- Horizontal Scaling
A formula can be created to calculate the desired
number of partitions:
• Partitions for desired read performance =
Desired RCU / 3000 RCU
• Partitions for desired write performance =
Desired WCU / 1000 WCU
• Total partitions for desired performance =
(Desired RCU / 3000 RCU) + (Desired WCU / 1000 WCU)
Amazon DynamoDB
Partitioning- Horizontal Scaling
A formula can be created to calculate the
desired number of partitions:
• Total partitions for desired storage = Desired
capacity in GB / 10GB
• Total partitions =
MAX(Total partitions for desired performance,
Total partitions for desired capacity)
Amazon DynamoDB
Partitioning- Horizontal Scaling
As an example, consider the following
requirements:

RCU Capacity: 7500


WCU Capacity: 4000
Storage Capacity: 100GB
Amazon DynamoDB
Partitioning- Horizontal Scaling
• The required number of partitions for
performances can be calculated as:
(7500/3000) + (4000/1000) = 2.5 + 4 = 6.5 ~ 7
• The required number of partitions for capacity is:
100/10 = 10
• So the total number of partitions required is:
MAX(7, 10) = 10
Amazon DynamoDB
Partitioning- Horizontal Scaling
• RCU per partition= 7500/10=750
• WCU per partition= 4000/10=400
Amazon DynamoDB
Partitioning- Horizontal Scaling
• DynamoDB adaptive capacity
Amazon DynamoDB
Partitioning- Horizontal Scaling
• DynamoDB’s lack of support for joins is mostly
due to the partitioning scheme.
Amazon DynamoDB
Query Language
• PartiQL ( a SQL –Compatible Query Language)
/* Return a single song, by primary key ( in SQL)*/
SELECT * FROM Music
WHERE Artist='No One You Know' AND SongTitle = 'Call Me Today';

// Return a single song, by primary key (In DynamoDB)

{
TableName: "Music",
KeyConditionExpression: "Artist = :a and SongTitle = :t",
ExpressionAttributeValues: {
":a": "No One You Know",
":t": "Call Me Today"
} https://docs.aws.amazon.com/
} amazondynamodb/latest/developerguide/
SQLtoNoSQL.ReadData.Query.html
Amazon DynamoDB
• Useful Links:
• https://medium.com/swlh/data-modeling-in-a
ws-dynamodb-dcec6798e955
• https://blog.theodo.com/2021/04/introductio
n-to-dynamo-db-modeling/
BigTable
• Bigtable + Mapreduce-> Hbase(HDFS)

• Bigtable + Amazon DynamoDB-> Cassandra


HBase
• Distributed
• Big Data Store
• Non-Relational
• Flexible Data Model
• Scalable
HBase
• Apache HBase is a wide-column data store
based on Apache Hadoop and on BigTable
concepts.
• The basic unit of storage in HBase is a table . A
table consists of one or more column families ,
which further consists of columns .
• Columns are grouped into column families.
Data is stored in rows .
HBase
• A row is a collection of key/value pairs.
• Each row is uniquely identified by a row key.
• The row keys are created when table data is
added and the row keys are used to determine
the sort order and for data sharding , which is
splitting a large table and distributing data
across the cluster.
HBase
• HBase provides a flexible schema model in which
columns may be added to a table column family
as required without predefining the columns.
• Only the table and column family/ies are
required to be defined in advance.
• No two rows in a table are required to have the
same column/s.
• All columns in a column family are stored in
close proximity.
HBase
• HBase does not support transactions.
• HBase is not eventually consistent but is a
strongly consistent at the record level.
• Strong consistency implies that the latest data
is always served but at the cost of increased
latency.
• In contrast, eventual consistency can return
out-of-date data.
HBase
• HBase does not have the notion of data types,
but all data is stored as an array of bytes.
• Rows in a table are sorted lexicographically by
row key, a design feature that makes it feasible
to store related rows (or rows that will be read
together) together for optimized scan.
HBase
HBase
HBase
• Revision
HBase
• Revision

A column-oriented database serializes all of the values of a column together, then the
values of the next column, and so on.
HBase
• Revision
HBase
• Revision
HBase
• HBase Blocks
Key:Value
HBase
• HBase Blocks
Key:Value
HBase
• HBase Blocks
Key:Value
HBase
• HBase Blocks
Key:Value

Cell Example
HBase
• HBase Blocks
Key:Value

Physical Representation of HBase


Table
HBase
• HBase Blocks

Physical Representation of HBase


Table
HBase
• HBase Blocks

Physical Representation of First Row of the HBase Table


HBase
• HBase Blocks

Physical Representation of HBase


Table
HBase
• HBase
HBase
• HBase
HBase
• Hbase Query Language
• put(), get(), scan()

More at:
HBase Query Example: put(), get(), scan() Comm
and in HBase
Cassandra
History
• Cassandra was developed at Facebook for
inbox search.
• It was open-sourced by Facebook in July 2008.
• Cassandra was accepted into Apache
Incubator in March 2009.
• It was made an Apache top-level project since
February 2010.
Cassandra
Imp Websites
• http://cassandra.apache.org/

• https://academy.datastax.com/planet-cassand
ra/what-is-apache-cassandra

• https://www.credera.com/blog/technology-in
sights/java/cassandra-explained-5-minutes-les
s/
Cassandra
Keyspace
Cassandra
Column Family
Cassandra
Column

Insert into KeyspaceName.TableName(ColumnNames) values(ColumnValues) using ttl


TimeInseconds;
Cassandra
• No Join
• No Foreign Key
• No Sequences

Example of Sequence
CREATE SEQUENCE sequence_name START WITH initial_value
INCREMENT BY increment_value MINVALUE minimum value
MAXVALUE maximum value CYCLE|NOCYCLE ;
Cassandra
• Although Cassandra falls under the column-
oriented database type, which stores its data
storage in columns, it actually is a partitioned
data store, with “partitioned” referring to the
fact that the database uses unique keys for
each row to distribute the rows across
multiple nodes. It stores data in sparse hash
tables, with “sparse” alluding to the fact that
all rows may not have the same columns.
Cassandra
• Cassandra handles various types of data, such as
structured, semi-structured, and unstructured
data.
• Cassandra is fully distributed
• Cassandra is designed as a decentralized database,
meaning that all nodes are the same; there’s no
concept of master/slave nodes. This also means
that there is no single point of failure since there
are no special hosts, and the cluster continues
operations regardless of node failures.
Cassandra
• Cassandra uses an efficient log–structured
engine that turns updates into sequential I/O.
Cassandra
• It is scalable, fault-tolerant, and consistent.
• It is a column-oriented database.
• Its distribution design is based on Amazon’s Dynamo and its
data model on Google’s Bigtable.
• Created at Facebook, it differs sharply from relational database
management systems.
• Cassandra implements a Dynamo-style replication model with
no single point of failure, but adds a more powerful “column
family” data model.
• Cassandra is being used by some of the biggest companies such
as Facebook, Twitter, Cisco, Rackspace, ebay, Twitter, Netflix,
and more.
Cassandra Features
• Elastic scalability − Cassandra is highly scalable; it
allows to add more hardware to accommodate more
customers and more data as per requirement.
• Always on architecture − Cassandra has no single
point of failure and it is continuously available for
business-critical applications that cannot afford a
failure.
• Fast linear-scale performance − Cassandra is linearly
scalable, i.e., it increases your throughput as you
increase the number of nodes in the cluster.
Therefore it maintains a quick response time.
Cassandra Features
• Flexible data storage − Cassandra accommodates all possible
data formats including: structured, semi-structured, and
unstructured. It can dynamically accommodate changes to
your data structures according to your need.
• Easy data distribution − Cassandra provides the flexibility to
distribute data where you need by replicating data across
multiple data centers.
• Transaction support − Cassandra supports properties like
Atomicity, Consistency, Isolation, and Durability (ACID).
• Fast writes − Cassandra was designed to run on cheap
commodity hardware. It performs blazingly fast writes and
can store hundreds of terabytes of data, without sacrificing
the read efficiency.
Cassandra Architecture
• Node Node is the place where data is stored. It is the basic
component of Cassandra.
• Data Center A collection of nodes are called data center.
Many nodes are categorized as a data center.
• Cluster The cluster is the collection of many data centers.
• Commit Log Every write operation is written to Commit Log.
Commit log is used for crash recovery.
• Mem-table After data written in Commit log, data is written in
Mem-table. Data is written in Mem-table temporarily.
• SSTable When Mem-table reaches a certain threshold, data is
flushed to an SSTable disk file.
Cassandra Architecture
• Cassandra places replicas of data on different nodes based on
these two factors.
• Where to place next replica is determined by the Replication
Strategy.
• While the total number of replicas placed on different nodes
is determined by the Replication Factor.
Cassandra Architecture
Keyspace
The basic attributes of a Keyspace in Cassandra
are −
• Replication factor − It is the number of
machines in the cluster that will receive copies of
the same data.
• Replica placement strategy − It is nothing but
the strategy to place replicas in the ring. We have
strategies such as simple strategy (rack-aware
strategy), old network topology
strategy (rack-aware strategy), and network
topology strategy (datacenter-shared strategy).
Cassandra Architecture
Keyspace
The basic attributes of a Keyspace in Cassandra
are −
• Column families − Keyspace is a container for a
list of one or more column families. A column
family, in turn, is a container of a collection of
rows. Each row contains ordered columns.
Column families represent the structure of your
data. Each keyspace has at least one and often
many column families.
Cassandra Architecture
Keyspace
The basic attributes of a Keyspace in Cassandra
are −
• Column families − Keyspace is a container for a
list of one or more column families. A column
family, in turn, is a container of a collection of
rows. Each row contains ordered columns.
Column families represent the structure of your
data. Each keyspace has at least one and often
many column families.
Cassandra Architecture
Keyspace
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : 3};
Cassandra Architecture
Cassandra Architecture
• Replication Strategy
• SimpleStrategy

• NetworkTopologyStrategy
Cassandra Architecture
Keyspace
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : 3};
Cassandra Architecture
• Write Operation
1. When write request comes to the node, first of all, it logs in
the commit log.
2. Then Cassandra writes the data in the mem-table. Data
written in the mem-table on each write request also writes in
commit log separately. Mem-table is a temporarily stored
data in the memory while Commit log logs the transaction
records for back up purposes.
3. When mem-table is full, data is flushed to the SSTable data
file.
SSTable stands for Sorted Strings Table
Cassandra Architecture
• Write Operation
Cassandra
Native Data Types
Cassandra
Native Data Types
BigTable- Case Study
Moving from Cassandra to Auto-Scaling Bigtable at
Spotify (Cloud Next '19)
https://www.youtube.com/watch?v=Hfd3VZOYXNU&autoplay=1
Cassandra
Downloading Cassandra:
https://cassandra.apache.org/_/download.html
Installing Cassandra on Windows
https://phoenixnap.com/kb/install-cassandra-on-windows
https://www.javatpoint.com/cassandra-setup-and-installation
Installing Cassandra on Linux
https://www.hostinger.in/tutorials/set-up-and-install-
cassandra-ubuntu/
BigTable
MongoDB
Installation:
• Community Server 8.0.3
https://www.mongodb.com/try/download/com
munity-edition
Tools
• MongoDB Shell
• MongoDB Compass
• MongoDB Database Tools
MongoDB
Each Row( in Column Store)
MongoDB
Mind Mapping
MongoDB
Documents
MongoDB
Documents
• The document model is a superset of other data models, including
key-value pairs, relational, objects, graph, and geospatial.
• Key-value pairs can be modeled with fields and values in a
document. Any field in a document can be indexed, providing
developers with additional flexibility in how to query the data.
• Relational data can be modeled differently (and some would argue
more intuitively) by keeping related data together in a single
document using embedded documents and arrays. Related data
can also be stored in separate documents, and database
references can be used to connect the related data.
MongoDB

Documents
• Documents map to objects in most popular
programming languages.
• Graph nodes and/or edges can be modeled as
documents. Edges can also be modeled
through database references. Graph queries can be run
using operations like $graphLookup.
• Geospatial data can be modeled as arrays in documents.
ADO

MySQL Document Store (MySQl v8.0 onwards)


NoSQL + SQL = MySQL

You might also like