SQL Vs NoSQL - Full
SQL Vs NoSQL - Full
2
SQL Databases - 2
• Transaction
• A transaction is a unit of work that you want to treat as "a whole." It has to
either happen in full or not at all.
• Example
• Transferring money from one bank account to another.
• To do that you have first to withdraw the amount from the source account,
• and then deposit it to the destination account.
• The operation has to succeed in full. If you stop halfway, the money will be lost, and that
is Very Bad
https://stackoverflow.com/questions/974596/what-is-a-database-transaction/974615 3
SQL Databases - 3
• ACID Properties
• Atomicity
• All changes to data are performed as if they are a single operation. That is, all the
changes are performed, or none of them are.
• Example: in an application that transfers funds from one account to another, the
atomicity property ensures that, if a debit is made successfully from one account, the
corresponding credit is made to the other account.
• Consistency
• Data is in a consistent state when a transaction starts and when it ends.
• Example: in an application that transfers funds from one account to another, the
consistency property ensures that the total value of funds in both the accounts is the
same at the start and end of each transaction
https://www.ibm.com/docs/en/cics-ts/5.4?topic=processing-acid-properties-transactions 4
SQL Databases - 4
• ACID Properties
• Isolation
• If multiple transactions are running concurrently, they should not be affected by each
other; i.e., the result should be the same as the result obtained if the transactions were
running sequentially.
• Example: Let B_bal is initially 100. If a context switch occurs after B_bal *= 1.2, then the
changes should only be visible to T2 once T1 commits.
https://www.educative.io/edpresso/what-are-acid-properties-in-a-database 5
SQL Databases - 5
• ACID Properties
• Durability
• After a transaction successfully completes, changes to data persist and are not undone,
even in the event of a system failure.
• Example: in an application that transfers funds from one account to another, the
durability property ensures that the changes made to each account will not be reversed.
https://www.ibm.com/docs/en/cics-ts/5.4?topic=processing-acid-properties-transactions 6
SQL Databases - 6
• Example Databases
• Oracle RDBMS
• Microsoft SQL Server
• MySQL
• …..
7
NoSQL Databases - 1
• “NoSQL” stands for “non SQL” or “not
only SQL.”
• History
• Emerged in the late 2000s as the cost of
storage dramatically decreased
• Need no longer exists for creating a
complex, difficult-to-manage data
model in order to avoid data duplication
• As storage costs rapidly decreased, the
amount of data that applications
needed to store and query increased.
https://www.mongodb.com/nosql-explained 8
NoSQL Databases - 2
• Features of NoSQL databases
• Large Data Volumes
• Scalable Replication
• Distributed Databases
• Queries need to return answer quickly
• Mostly queries, few updates
• Schema-less
• Simpler and Faster
https://www.mongodb.com/nosql-explained 9
NoSQL Databases - 3
• BASE Properties
• Basically Available
• Rather than enforcing immediate consistency, BASE-modelled NoSQL databases will ensure availability of
data by spreading and replicating it across the nodes of the database cluster.
• Soft State
• Due to the lack of immediate consistency, data values may change over time
• Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the
time.
• Eventually Consistent
• The fact that BASE does not enforce immediate consistency does not mean that it never achieves it.
However, until it does, data reads are still possible (even though they might not reflect the reality).
• Stores exhibit consistency at some later point (e.g., lazily at read time).
https://phoenixnap.com/kb/acid-vs-base
https://neo4j.com/blog/acid-vs-base-consistency-models-explained/ 10
NoSQL Databases - 4
• Types of NoSQL databases
• Document databases: store data in documents similar to JSON (JavaScript Object Notation)
objects. Each document contains pairs of fields and values.
• Example: MongoDB
• Key-value databases: are a simpler type of database where each item contains keys and values.
• Example: Redis
• Graph databases: store data in nodes and edges. Nodes typically store information about people,
places, and things, while edges store information about the relationships between the nodes.
• Example: Neo4j, Gremlin
https://www.mongodb.com/nosql-explained 11
NoSQL Databases - 5
• Data Modelling Difference between RDBMS and NoSQL databases
• Example: storing information about a user and their hobbies
13
Partition Tolerance
https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e 14
High Consistency
• All nodes see the same data
at the same time
• Performing a read operation
will return the value of the
most recent write operation
causing all nodes to return
the same data
15
https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e
High Availability
• Achieving availability
in a distributed
system requires that
the system remains
operational 100% of
the time.
• Every client gets a
response, regardless
of the state of any
individual node in the
system
16
https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e
RDBMS
• Great at Consistency
• Okay at Availability
• Not so great at partitioning
17
Mathematical Model of RDBMS
• Based on “Relational Algebra” which is an extension of “set theory”
• But not every problem is a set problem like
• Shortest path from point a to point x
• Friend of a Friend (FOAF) problem
• Friend recommendation based on similar features
• I play hockey, he plays soccer, so we both have interests in sports, so we can be added as
a friends
18
SQL vs NoSQL: Architectural Difference
• Architecture of Storage Media
• Data Placement on Disc in Row and Column Oriented Databases
• Query Execution
19
Types of Storage Media
• Databases typically stored on magnetic disks
• Primary storage • Cache memory
• CPU main memory, cache • Static RAM
memory • DRAM
• Secondary storage • Mass storage
• Magnetic disks (HDDs), Solid- • Magnetic disks
State Drives (SSDs) • CD-ROM, DVD, tape
• Tertiary storage drives
• Removable media
Types of Storage with Capacity, Access Time, Max Bandwidth (Transfer Speed), and Commodity Cost
Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 21
Storage Organization of Databases
• Persistent data
• Most databases
• Transient data
• Exists only during program execution
• File organization
• how records are physically placed on the disk
• how records are accessed
• Disk packs with multiple surfaces are controlled by several read/write heads—one for
each surface
• All arms are connected to an actuator attached to another electrical motor, which moves
the read/write heads and positions them precisely over the cylinder of tracks specified in
a block address.
• while the beginning of the desired block rotates into position under
the read/write head. It depends on the rpm of the disk.
• For example, at 15,000 rpm, the time per rotation is 4 msec and the
average rotational delay is the time per half revolution, or 2 msec.
R = 1/2 revolution
Head Here
• “Typical” Value: 0
37
Data Placement on Disc in Row and Column
Oriented Databases
• Data Placement on Disc in Row and Column Oriented Databases
• Query Access
38
Explaining Row Stores and Column Stores-1
Row ID CNIC Name Gender Dept Salary
1001 10 A M CS 10500
1002 11 B M CS 20400
1003 12 C F CS 8000
1004 13 D F AI 15000
1005 14 E F AI 18000
1006 15 F M AI 10300
1007 16 G M AI 5980
1008 17 H M AI 4000
1009 18 I M CS 30900
1010 19 J F CS 5000
1011 20 K F CS 3000
1012 21 L F CS 2080
39
Row Oriented Data Placement on Disc
1001,10,A,M,CS,10500|||
Block-1 1002,11,B,M,CS,20400|||
1003,12,C,F,CS,8000
1004,13,D,F,AI,15000|||
Block-2 1005,14,E,F,AI,18000|||
1006,15,F,M,AI,10300
1007,16,G,M,AI,5980|||
Block-3 1008,17,H,M,AI,4000|||
1009,18,I,M,CS,30900
1010,19,J,F,CS,5000|||
Block-4 1011,20,K,F,CS,3000|||
1012,21,L,F,CS,2080
40
Row Oriented Storage - Facts
• Tables are stored as rows in the disc
• Single block I/O to the table, retrieves multiple rows with all their
columns
• More I/Os are needed to find a particular row in a table scan but
provides all the columns for that row
42
Row Oriented Storage – Behavior of Select
Queries
Row Oriented Storage – Behavior of Select
Queries
1001,10,A,M,CS,10500|||
Block-1 1002,11,B,M,CS,20400|||
1003,12,C,F,CS,8000
1004,13,D,F,AI,15000|||
Block-2 1005,14,E,F,AI,18000|||
1006,15,F,M,AI,10300
1007,16,G,M,AI,5980|||
Block-3 1008,17,H,M,AI,4000|||
1009,18,I,M,CS,30900
1010,19,J,F,CS,5000|||
Block-4 1011,20,K,F,CS,3000|||
1012,21,L,F,CS,2080
44
Executing Queries 1/4
1. Select Name from table where CNIC = 20
Block I/O: 1
Status: Not Found
Block I/O: 2
Status: Not Found
Block I/O: 3
Status: Not Found
Block I/O: 4
Status: Found
Entire Data of 1st 3 Blocks Read but no Luck
Required Data Found in 4th Block
All the Columns were read but only ‘Name’ was required 45
Executing Queries 2/4
2. Select * from table where CNIC = 12
46
Executing Queries 3/4
3. Select sum(Salary) from table
1. We only need salary value but we read all the remaining un-wanted values
2. Overhead of going to different blocks also occurs because entire table’s data does
not fit in same block due to its large size
47
Executing Queries 4/4
4. Select Dept, sum (Salary) from table group by Dept
49
Holistic view…
Column Oriented Data Placement on Disc
53
Executing Queries 1/3
• Select Name from table where CNIC = 20
54
Executing Queries - Note
• DBMS only maintains the information that which rowId is in
which block on disc and also the which block is for which column.
• But it does not know the values.
• The values are mapped using indexes but in this example, we are not using
indexes.
55
Executing Queries 2/3
• Select * from table where CNIC = 10
This analytical query requires only 1 I/O because all the salary
information is in 1 block.
57
Pros and Cons/Features
• Optimal for read/write • Writes are Slower
• OLTP • OLAP
• Compression is not • Compression is great
efficient
• Aggregation is not • Efficient Aggregation
efficient • Inefficient queries
• Efficient queries when when accessing
accessing multiple multiple columns
columns • Vertical Partitioning
• Horizontal Partitioning
Row Based Column Based
58
Column Oriented Storage
• Better suits for OLAP and Analytical Queries
59
Real World Application Scenarios for
Analytical Queries
• Trend Analysis
• Dashboards
• Sales Forecasting
60
Real World Application
for Analytical Queries
• Trend Analysis
• Dashboards
• Sales Forecasting
61
Basics: Behind the Dashboards and OLAP
Queries
• Group operations in SQL
• Roll-Up, Cube
62
Roll Up (1/2)
• Definition
• ROLLUP enables a SELECT statement to calculate multiple levels of subtotals
across a specified group of dimensions.
• It also calculates a grand total.
• Syntax
• ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
63
Roll Up (2/2)
Calculating Subtotals without ROLLUP Calculating Subtotals with ROLLUP
SELECT Time, Region, Department, SUM(Profit)
SELECT Time, Region, Department,sum(Profit)
FROM Sales
FROM sales
GROUP BY Time, Region, Department
UNION GROUP BY ROLLUP(Time, Region, Dept)
SELECT Time, Region, '' , SUM(Profit)
FROM Sales
GROUP BY Time, Region
UNION
SELECT Time, '', '', SUM(Profits)
FROM Sales
GROUP BY Time
64
Another Roll-Up Example
65
Cube
• Definition
• CUBE enables a SELECT statement to calculate subtotals for all possible
combinations of a group of dimensions.
• It also calculates a grand total
• Syntax
SELECT ... GROUP BY
CUBE (grouping_column_reference_list)
• Calculating subtotals without CUBE
• multiple SELECT statements combined with UNION statements could provide
the same information gathered through CUBE
• for an n-dimensional cube, 2n SELECT statements are needed.
66
Data Cube Lattice
Country, Month,
Color
Total
67
Cube
Fact table view:
Multi-dimensional cube:
sale prodId storeId amt
p1
p2
c1
c1
12
11
Dim1 store
p1 c3 50
p2 c2 8 c1 c2 c3
Dim2 p1 12 50
p2 11 8
dimensions = 2
product
68
Cube
sale prodId storeId date amt
p1 c1 1 12 Multi-dimensional cube:
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4 c1 c2 c3
day 2
p1 44 4
day 1 p c1 c2 c3
p1 2 12 50
p2 11 8
dimensions = 3
70
Aggregates
71
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
72
Cube Aggregation
Example: computing sums
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8
c1 c2 c3
c1 c2 c3
sum 67 12 50
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
73
Cube Operators
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50 sale(c1,*,*)
p2 11 8
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)
74
Cube
* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* 67
c2 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
75
Aggregation Using Hierarchies
c1 c2 c3
Store
day 2
p1 44 4
day 1
p2 c1 c2 c3 region
p1 12 50
p2 11 8
country
region A region B
p1 56 54 (store c1 in Region A;
p2 11 8 store c2, c3 in Region B)
76
Data Cube Lattice
Country, Month,
Color
Total
77
Data Selection Performance Analysis
• DBMS: MariaDB CREATE TABLE table_row (ID BIGINT,JobTitle
VARCHAR(500),salary INT) engine = InnoDB;
• Row Oriented Storage: InnoDB
• Column Oriented Storage: ColumnStore
78
Row Oriented Vs Column Oriented –
79
Row Oriented Storage – Behavior of Select
Queries – Time Comparison - 2
• select salary from table_row where id = 99998;
80
Row Oriented Storage – Behavior of Select
Queries – Time Comparison - 3
• select sum(salary) from table_col where id > 2000 and id<90000;
81
Row Oriented Storage – Behavior of Insert
Queries
New row is appended to the end of each block
-Block
Block-1 Block-2 Block-3 Block-4
Creation
-Memory
Reservation
87
Late Tuple Materialization
88
Data Compression
• Row oriented
• Compression is low because, we do not have homogenous data in each block
i.e. different columns have different data types
• Column oriented
• Compression is higher because we have homogenous data in each block
because a column having certain data type is stored in 1 column.
• if cardinality of a column is low e.g gender has only 2 values/cardinality then more
compression
89
Data Compression: Run length Encoding (RLE)
• Run length encoding is an algorithm for performing lossless data compression.
• Lossless data compression refers to compressing the data in such a way that the original form
of the data can then be derived from it.
• Runs of data (sequences in which the same data value occurs in many
consecutive data elements) are stored as a single data value and count, rather
than as the original run.
• When a character occurs a large number of times consecutively in a sequence,
then we can represent the same consecutive subsequence using only a single
occurrence of that character and its count.
• Using run length encoding, we can save memory space while transmitting data
and preserving its original form.
https://www.pythonpool.com/run-length-encoding-python/ 90
Data Compression: Run length Encoding (RLE)
• Example:
• Consider a sequence: AACCCBBBBBAAAAFFFFFFFF
• RLE representation: A2C3B5A4F8
• 22 length sequence compressed to a 10 length sequence.
• To avoid confusion, use flags + appearance counter
• Example: ABCCCCCCCCDEFGGG
• Becomes: ABC!8DEFGGG
• ! is flag
https://www.pythonpool.com/run-length-encoding-python/ 91
Relevant Readings
92
C-Store
93
Mining MSN Messenger Data
94
Data Cube
• Must read this paper !!
95