0% found this document useful (0 votes)
37 views95 pages

SQL Vs NoSQL - Full

The document discusses SQL and NoSQL databases. SQL databases store data in tables with rows and columns and support ACID properties, while NoSQL databases are non-tabular, support BASE properties and include document, key-value, wide-column and graph databases. The document also covers database storage, CAP theorem and differences between SQL and NoSQL architectures.

Uploaded by

rsumaira80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views95 pages

SQL Vs NoSQL - Full

The document discusses SQL and NoSQL databases. SQL databases store data in tables with rows and columns and support ACID properties, while NoSQL databases are non-tabular, support BASE properties and include document, key-value, wide-column and graph databases. The document also covers database storage, CAP theorem and differences between SQL and NoSQL architectures.

Uploaded by

rsumaira80
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Big Data Analytics

Topic 2: SQL vs NoSQL

“there is no alternative to patience and hard work”

"Effort does not betray you”


SQL Databases - 1
• Data stored as Rows and Columns in Tables
• Relationships among the tables
• In the form of data connectivity
• Data Manipulation Language (DML)
• Insert, Update, Delete, Select
• Data Definition Language (DDL)
• Create Table, Indexes, Views,….
• Alter, Drop,

2
SQL Databases - 2
• Transaction
• A transaction is a unit of work that you want to treat as "a whole." It has to
either happen in full or not at all.
• Example
• Transferring money from one bank account to another.
• To do that you have first to withdraw the amount from the source account,
• and then deposit it to the destination account.
• The operation has to succeed in full. If you stop halfway, the money will be lost, and that
is Very Bad

https://stackoverflow.com/questions/974596/what-is-a-database-transaction/974615 3
SQL Databases - 3
• ACID Properties
• Atomicity
• All changes to data are performed as if they are a single operation. That is, all the
changes are performed, or none of them are.
• Example: in an application that transfers funds from one account to another, the
atomicity property ensures that, if a debit is made successfully from one account, the
corresponding credit is made to the other account.
• Consistency
• Data is in a consistent state when a transaction starts and when it ends.
• Example: in an application that transfers funds from one account to another, the
consistency property ensures that the total value of funds in both the accounts is the
same at the start and end of each transaction

https://www.ibm.com/docs/en/cics-ts/5.4?topic=processing-acid-properties-transactions 4
SQL Databases - 4
• ACID Properties
• Isolation
• If multiple transactions are running concurrently, they should not be affected by each
other; i.e., the result should be the same as the result obtained if the transactions were
running sequentially.
• Example: Let B_bal is initially 100. If a context switch occurs after B_bal *= 1.2, then the
changes should only be visible to T2 once T1 commits.

https://www.educative.io/edpresso/what-are-acid-properties-in-a-database 5
SQL Databases - 5
• ACID Properties
• Durability
• After a transaction successfully completes, changes to data persist and are not undone,
even in the event of a system failure.
• Example: in an application that transfers funds from one account to another, the
durability property ensures that the changes made to each account will not be reversed.

https://www.ibm.com/docs/en/cics-ts/5.4?topic=processing-acid-properties-transactions 6
SQL Databases - 6
• Example Databases
• Oracle RDBMS
• Microsoft SQL Server
• MySQL
• …..

7
NoSQL Databases - 1
• “NoSQL” stands for “non SQL” or “not
only SQL.”
• History
• Emerged in the late 2000s as the cost of
storage dramatically decreased
• Need no longer exists for creating a
complex, difficult-to-manage data
model in order to avoid data duplication
• As storage costs rapidly decreased, the
amount of data that applications
needed to store and query increased.

https://www.mongodb.com/nosql-explained 8
NoSQL Databases - 2
• Features of NoSQL databases
• Large Data Volumes
• Scalable Replication
• Distributed Databases
• Queries need to return answer quickly
• Mostly queries, few updates
• Schema-less
• Simpler and Faster

https://www.mongodb.com/nosql-explained 9
NoSQL Databases - 3
• BASE Properties
• Basically Available
• Rather than enforcing immediate consistency, BASE-modelled NoSQL databases will ensure availability of
data by spreading and replicating it across the nodes of the database cluster.

• Soft State
• Due to the lack of immediate consistency, data values may change over time
• Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the
time.

• Eventually Consistent
• The fact that BASE does not enforce immediate consistency does not mean that it never achieves it.
However, until it does, data reads are still possible (even though they might not reflect the reality).
• Stores exhibit consistency at some later point (e.g., lazily at read time).

https://phoenixnap.com/kb/acid-vs-base
https://neo4j.com/blog/acid-vs-base-consistency-models-explained/ 10
NoSQL Databases - 4
• Types of NoSQL databases
• Document databases: store data in documents similar to JSON (JavaScript Object Notation)
objects. Each document contains pairs of fields and values.
• Example: MongoDB

• Key-value databases: are a simpler type of database where each item contains keys and values.
• Example: Redis

• Wide-column stores: store data in tables, rows + columns


• Example: Cassandra, Hbase, BigTable

• Graph databases: store data in nodes and edges. Nodes typically store information about people,
places, and things, while edges store information about the relationships between the nodes.
• Example: Neo4j, Gremlin

https://www.mongodb.com/nosql-explained 11
NoSQL Databases - 5
• Data Modelling Difference between RDBMS and NoSQL databases
• Example: storing information about a user and their hobbies

Data Storage in MongoDB


In a relational database, two table are needed
https://www.mongodb.com/nosql-explained 12
Why we don’t use RDBMS for all types of
Data/Systems
• CAP Theorem
• We can have 3 features in each system but we cannot provide these 3
features at the same time.

13
Partition Tolerance

• System continues to run, despite the number of messages being


delayed by the network between nodes

https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e 14
High Consistency
• All nodes see the same data
at the same time
• Performing a read operation
will return the value of the
most recent write operation
causing all nodes to return
the same data

15
https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e
High Availability
• Achieving availability
in a distributed
system requires that
the system remains
operational 100% of
the time.
• Every client gets a
response, regardless
of the state of any
individual node in the
system

16
https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e
RDBMS
• Great at Consistency
• Okay at Availability
• Not so great at partitioning

17
Mathematical Model of RDBMS
• Based on “Relational Algebra” which is an extension of “set theory”
• But not every problem is a set problem like
• Shortest path from point a to point x
• Friend of a Friend (FOAF) problem
• Friend recommendation based on similar features
• I play hockey, he plays soccer, so we both have interests in sports, so we can be added as
a friends

• Writing SQL is possible, but is not a straightforward query

18
SQL vs NoSQL: Architectural Difference
• Architecture of Storage Media
• Data Placement on Disc in Row and Column Oriented Databases
• Query Execution

19
Types of Storage Media
• Databases typically stored on magnetic disks
• Primary storage • Cache memory
• CPU main memory, cache • Static RAM
memory • DRAM
• Secondary storage • Mass storage
• Magnetic disks (HDDs), Solid- • Magnetic disks
State Drives (SSDs) • CD-ROM, DVD, tape
• Tertiary storage drives
• Removable media

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 20


Storage Types and Characteristics

Types of Storage with Capacity, Access Time, Max Bandwidth (Transfer Speed), and Commodity Cost
Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 21
Storage Organization of Databases
• Persistent data
• Most databases

• Transient data
• Exists only during program execution

• File organization
• how records are physically placed on the disk
• how records are accessed

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 22


Single-Sided Disk and Disk Pack

(a) A single-sided disk with


read/write hardware
(b) A disk pack with
read/write hardware

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 23


Sectors on a Disk

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 24


Top View

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 25


Explaining Disc – Tracks
• A disk is single-sided if it stores information on one of its surfaces only and
double-sided if both surfaces are used.
• To increase storage capacity, disks are assembled into a disk pack

• Information is stored on a disk surface in concentric circles of small width,


each having a distinct diameter.

• Each circle is called a track


• In disk packs, tracks with the same diameter on the various surfaces are called a
cylinder because of the shape they would form if connected in space
• The concept of a cylinder is important because data stored on one cylinder can be
retrieved much faster than if it were distributed among different cylinders.
Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 26
Explaining Disc – Tracks and Sectors
• The number of tracks on a disk ranges from a few thousand to
152,000 on the disk drives
• capacity of each track typically ranges from tens of kilobytes to 150 Kbytes

• Because a track usually contains a large amount of information, it is


divided into smaller sectors.
• division of a track into sectors is hard-coded on the disk surface and cannot
be changed.

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 27


Explaining Disc – Blocks
• The division of a track into
equal-sized disk blocks (or
pages) is set by the operating
system during disk formatting

• Block size is fixed during


initialization and cannot be
changed dynamically
• Typical disk block sizes range
from 512 to 8192 bytes.

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 28


Explaining Disc – Blocks
• A disk with hard-coded sectors often has the sectors subdivided or
combined into blocks during initialization.

• Blocks are separated by fixed- size interblock gaps


• Gaps store control information written during disk initialization
• This information determines which block on the track follows each interblock
gap

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 29


Explaining Disc – Storage Space
# of Surfaces # of Surfaces: 10
* *
# of Tracks # of Tracks : 152,000 capacity of each track = 150 Kbytes
* *
# of Blocks : ~18
# of Blocks
*
*
Size of each block: 8192 Bytes
Size of each block
Disc Space: ~ 224 GB
Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 30
Data Access
• A disk is a random access addressable device
• Transfer of data between main memory and disk takes place in units
of disk blocks
• The hardware address of a block is a combination of a
• cylinder number +
• track number (surface number within the cylinder on which the track is
located) +
• block number (within the track)

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 31


Data Access
• The actual hardware mechanism that reads or writes a block is the disk read/write head,
which is part of a system called a disk drive
• A disk or disk pack is mounted in the disk drive, which includes a motor that rotates the disks.

• A read/write head includes an electronic component attached to a mechanical arm

• Disk packs with multiple surfaces are controlled by several read/write heads—one for
each surface

• All arms are connected to an actuator attached to another electrical motor, which moves
the read/write heads and positions them precisely over the cylinder of tracks specified in
a block address.

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 32


Disc Access Time
I want block x
block X in memory
?
Time = Seek Time +
Rotational Delay +
Transfer Time +
Other
Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 33
Seek Time
• To transfer a disk block, given its address, the disk controller must first
mechanically position the read/write head on the correct track.

• The time required to position the read/write head on correct track is


called the seek time.

• Typical seek times are 5 to 10 msec on desktops and 3 to 8 msec on


servers

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 34


Rotational Delay or Latency

• while the beginning of the desired block rotates into position under
the read/write head. It depends on the rpm of the disk.
• For example, at 15,000 rpm, the time per rotation is 4 msec and the
average rotational delay is the time per half revolution, or 2 msec.

R = 1/2 revolution

“typical” R = 8.33 ms (3600 RPM)

Head Here

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 35


Block I Want
Transfer Time
• Some additional time is needed to transfer the data
• The seek time and rotational delay are usually much larger than the block
transfer time.

“typical” Transfer Rate ‘t’: 1  3 MB/second

transfer time: block size


t

Fundamentals of Database Systems, Chapter 16, 7th Edition, Ramez Elmasri 36


Other Delays
• CPU time to issue I/O
• Contention for controller
• Contention for bus, memory

• “Typical” Value: 0

37
Data Placement on Disc in Row and Column
Oriented Databases
• Data Placement on Disc in Row and Column Oriented Databases
• Query Access

38
Explaining Row Stores and Column Stores-1
Row ID CNIC Name Gender Dept Salary
1001 10 A M CS 10500
1002 11 B M CS 20400
1003 12 C F CS 8000
1004 13 D F AI 15000
1005 14 E F AI 18000
1006 15 F M AI 10300
1007 16 G M AI 5980
1008 17 H M AI 4000
1009 18 I M CS 30900
1010 19 J F CS 5000
1011 20 K F CS 3000
1012 21 L F CS 2080
39
Row Oriented Data Placement on Disc
1001,10,A,M,CS,10500|||
Block-1 1002,11,B,M,CS,20400|||
1003,12,C,F,CS,8000

1004,13,D,F,AI,15000|||
Block-2 1005,14,E,F,AI,18000|||
1006,15,F,M,AI,10300

1007,16,G,M,AI,5980|||
Block-3 1008,17,H,M,AI,4000|||
1009,18,I,M,CS,30900

1010,19,J,F,CS,5000|||
Block-4 1011,20,K,F,CS,3000|||
1012,21,L,F,CS,2080

40
Row Oriented Storage - Facts
• Tables are stored as rows in the disc

• Single block I/O to the table, retrieves multiple rows with all their
columns

• More I/Os are needed to find a particular row in a table scan but
provides all the columns for that row

• Suitable for OLTP applications


• Aggregation bottleneck..
41
Query Processing on Row Oriented Storage
• DB Setup
• We are not creating indexes on the columns
• Queries
1. Select Name from table where CNIC = 20
2. Select * from table where CNIC = 10
3. Select sum(Salary) from table
4. Select Dept, sum (Salary) from table group by Dept

42
Row Oriented Storage – Behavior of Select
Queries
Row Oriented Storage – Behavior of Select
Queries
1001,10,A,M,CS,10500|||
Block-1 1002,11,B,M,CS,20400|||
1003,12,C,F,CS,8000

1004,13,D,F,AI,15000|||
Block-2 1005,14,E,F,AI,18000|||
1006,15,F,M,AI,10300

1007,16,G,M,AI,5980|||
Block-3 1008,17,H,M,AI,4000|||
1009,18,I,M,CS,30900

1010,19,J,F,CS,5000|||
Block-4 1011,20,K,F,CS,3000|||
1012,21,L,F,CS,2080

44
Executing Queries 1/4
1. Select Name from table where CNIC = 20
Block I/O: 1
Status: Not Found

Block I/O: 2
Status: Not Found

Block I/O: 3
Status: Not Found

Block I/O: 4
Status: Found
Entire Data of 1st 3 Blocks Read but no Luck
Required Data Found in 4th Block
All the Columns were read but only ‘Name’ was required 45
Executing Queries 2/4
2. Select * from table where CNIC = 12

By chance, the required data is in 1st block

46
Executing Queries 3/4
3. Select sum(Salary) from table

1. We only need salary value but we read all the remaining un-wanted values
2. Overhead of going to different blocks also occurs because entire table’s data does
not fit in same block due to its large size
47
Executing Queries 4/4
4. Select Dept, sum (Salary) from table group by Dept

The same behavior as that of query # 3


48
Column Oriented Storage
• Tables are stored as columns in the disc
• A single block I/O read to the table. Retrieves multiple columns with
all matching rows
• Less I/Os are required to retrieve a column
• However, time is proportional to call multiple columns
• Better suits for OLAP and Analytical Queries
• Adhoc queries

49
Holistic view…
Column Oriented Data Placement on Disc

10: 1001, 11:1002, 12:1003, 13:1004, 14:1005, 15:1006,


16:1007, 17:1008, 18:1009, 19:1010, 20:1011, 21:1012

A: 1001, B:1002, C:1003, D:1004, E:1005, F:1006

G:1007, H:1008, I:1009, J:1010, K:1011, L:1012

M: 1001, M:1002, F:1003, F:1004, F:1005, M:1006, M:1007,


M:1008, M:1009, M:1010, M:1011, M:1012

CS: 1001, CS:1002, CS:1003, AI:1004, AI:1005, AI:1006

AI:1007, AI:1008, CS:1009, CS:1010, CS:1011, CS:1012

10: 10500, 11:20400, 12:8000, 13:15000, 14:18000, 15:10300,


16:5980, 17:4000, 18:30900, 19:5000, 20:3000, 21:208050
Query Processing on Column Oriented
Storage – Behavior of Select Queries
• DB Setup
• We are not creating indexes on the columns
• Queries
1. Select Name from table where CNIC = 20
2. Select * from table where CNIC = 10
3. Select sum(Salary) from table

53
Executing Queries 1/3
• Select Name from table where CNIC = 20

DBMS knows which block is for which column


so to find CNIC, it directly goes to block containing CNIC.

Required CNIC is found in first block

Then using its rowId which is 1011, it directly goes to


the 3rd block of disc which contains the Name
associated with rowId 1011.

The dbms maintains the information that which


rowId is in which block on disc

54
Executing Queries - Note
• DBMS only maintains the information that which rowId is in
which block on disc and also the which block is for which column.
• But it does not know the values.
• The values are mapped using indexes but in this example, we are not using
indexes.

55
Executing Queries 2/3
• Select * from table where CNIC = 10

Using the information from dbms, it goes to first block


which contains CNIC.

Finds the CNIC 10 and its associated rowId which is


1001

Using 1001, it directly goes to all the other blocks to


fetch * columns

But its costs 7 I/Os and in total 8 I/Os.


That means that * query does not suits the column
stores. 56
Executing Queries 3/3
• Select sum(Salary) from table

This analytical query requires only 1 I/O because all the salary
information is in 1 block.

57
Pros and Cons/Features
• Optimal for read/write • Writes are Slower
• OLTP • OLAP
• Compression is not • Compression is great
efficient
• Aggregation is not • Efficient Aggregation
efficient • Inefficient queries
• Efficient queries when when accessing
accessing multiple multiple columns
columns • Vertical Partitioning
• Horizontal Partitioning
Row Based Column Based
58
Column Oriented Storage
• Better suits for OLAP and Analytical Queries

59
Real World Application Scenarios for
Analytical Queries
• Trend Analysis
• Dashboards
• Sales Forecasting

60
Real World Application
for Analytical Queries
• Trend Analysis
• Dashboards
• Sales Forecasting

61
Basics: Behind the Dashboards and OLAP
Queries
• Group operations in SQL
• Roll-Up, Cube

62
Roll Up (1/2)
• Definition
• ROLLUP enables a SELECT statement to calculate multiple levels of subtotals
across a specified group of dimensions.
• It also calculates a grand total.
• Syntax
• ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:

SELECT ... GROUP BY


ROLLUP(grouping_column_reference_list)

63
Roll Up (2/2)
Calculating Subtotals without ROLLUP Calculating Subtotals with ROLLUP
SELECT Time, Region, Department, SUM(Profit)
SELECT Time, Region, Department,sum(Profit)
FROM Sales
FROM sales
GROUP BY Time, Region, Department
UNION GROUP BY ROLLUP(Time, Region, Dept)
SELECT Time, Region, '' , SUM(Profit)
FROM Sales
GROUP BY Time, Region
UNION
SELECT Time, '', '', SUM(Profits)
FROM Sales
GROUP BY Time

64
Another Roll-Up Example

65
Cube
• Definition
• CUBE enables a SELECT statement to calculate subtotals for all possible
combinations of a group of dimensions.
• It also calculates a grand total
• Syntax
SELECT ... GROUP BY
CUBE (grouping_column_reference_list)
• Calculating subtotals without CUBE
• multiple SELECT statements combined with UNION statements could provide
the same information gathered through CUBE
• for an n-dimensional cube, 2n SELECT statements are needed.

66
Data Cube Lattice
Country, Month,
Color

Country, Country, Country,


Month Color Color
Drill Roll
Down Up
State Month Color

Total

67
Cube
Fact table view:
Multi-dimensional cube:
sale prodId storeId amt
p1
p2
c1
c1
12
11
Dim1 store
p1 c3 50
p2 c2 8 c1 c2 c3
Dim2 p1 12 50
p2 11 8
dimensions = 2
product
68
Cube
sale prodId storeId date amt
p1 c1 1 12 Multi-dimensional cube:
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4 c1 c2 c3
day 2
p1 44 4
day 1 p c1 c2 c3
p1 2 12 50
p2 11 8
dimensions = 3

Q: How would you represent a four-dimensional cube?


69
Aggregates

• Add up amounts for day 1


• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

70
Aggregates

• Add up amounts by day


• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 c1 1 12 ans date sum
p2 c1 1 11 1 81
p1 c3 1 50
2 48
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

71
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId

sale prodId storeId date amt


p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4

rollup Cube operations


drill-down

72
Cube Aggregation
Example: computing sums
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8

Month (day1 and day2…) by products by store

c1 c2 c3
c1 c2 c3
sum 67 12 50
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
73
Cube Operators
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50 sale(c1,*,*)
p2 11 8

c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)

74
Cube

* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* 67
c2 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81

75
Aggregation Using Hierarchies

c1 c2 c3
Store
day 2
p1 44 4
day 1
p2 c1 c2 c3 region
p1 12 50
p2 11 8
country

region A region B
p1 56 54 (store c1 in Region A;
p2 11 8 store c2, c3 in Region B)

76
Data Cube Lattice
Country, Month,
Color

Country, Country, Country,


Month Color Color
Drill Roll
Down Up
State Month Color

Total

77
Data Selection Performance Analysis
• DBMS: MariaDB CREATE TABLE table_row (ID BIGINT,JobTitle
VARCHAR(500),salary INT) engine = InnoDB;
• Row Oriented Storage: InnoDB
• Column Oriented Storage: ColumnStore

• Synthetically Generated Dataset


• 100,000 rows
• 3 Columns
• No Index on any column CREATE TABLE table_col(ID BIGINT,JobTitle
• Dataset shared on GCR VARCHAR(500),salary INT) engine = ColumnStore;
• https://www.onlinedatagenerator.com/
• System Specs
• MacBook Pro 2.3 GHz, Intel Core i5
• 8 GB RAM

78
Row Oriented Vs Column Oriented –

Behavior of Select Queries – Time Comparison - 1

• select sum(salary) from table_row;

1 row in set (0.035 sec)

• select sum(salary) from table_col;

1 row in set (0.018 sec)

79
Row Oriented Storage – Behavior of Select
Queries – Time Comparison - 2
• select salary from table_row where id = 99998;

1 row in set (0.033 sec)

• select salary from table_col where id = 99998;

1 row in set (0.034 sec)

80
Row Oriented Storage – Behavior of Select
Queries – Time Comparison - 3
• select sum(salary) from table_col where id > 2000 and id<90000;

1 row in set (0.035 sec)

• select sum(salary) from table_row where id > 2000 and id<90000;

1 row in set (0.044 sec)

81
Row Oriented Storage – Behavior of Insert
Queries
New row is appended to the end of each block

-Block
Block-1 Block-2 Block-3 Block-4
Creation
-Memory
Reservation

1001,10,A,M,CS,10500||| 1004,13,D,F,AI,15000||| 1007,16,G,M,AI,5980|||


Data Writing 1002,11,B,M,CS,20400||| 1005,14,E,F,AI,18000||| 1008,17,H,M,AI,4000|||
1003,12,C,F,CS,8000 1006,15,F,M,AI,10300 1009,18,I,M,CS,30900 80
Column Oriented Storage – Behavior of Insert
Queries
New row must be added to each file

10:1001 A:1001 M:1001 CS:1001 10500:1001


11:1002 B:1002 M:1002 CS:1002 20400:1002
12:1003 C:1003 F:1003 CS:1003 8000:1003
13:1004 D:1004 F:1004 AI:1004 15000:1004
14:1005 E:1005 F:1005 AI:1005 18000:1005
15:1006 F:1006 M:1006 AI:1006 10300:1006
16:1007 G:1007 M:1007 AI:1007 5980:1007
17:1008 H:1008 M:1008 AI:1008 4000:1008
18:1009 I:1009 M:1009 CS:1009 30900:1009
19:1010 J:1010 F:1010 CS:1010 5000:1010
20:1011 K:1011 F:1011 CS:1011 3000:1011
21:1012 L:1012 F:1012 CS:1012 2080:1012
81
Data Insertion Performance Analysis
• DBMS: MariaDB CREATE TABLE table_row (ID BIGINT,JobTitle
• Row Oriented Storage: InnoDB VARCHAR(500),salary INT) engine = InnoDB;
• Column Oriented Storage: ColumnStore
• Synthetically Generated Dataset
• 100,000 rows
• 3 Columns
• No Index on any column
• Dataset shared on GCR
• System Specs
• MacBook Pro 2.3 GHz, Intel Core i5 CREATE TABLE table_col(ID BIGINT,JobTitle
• 8 GB RAM VARCHAR(500),salary INT) engine = ColumnStore;

InnoDB = Less than 15 minutes


ColumnStore = Around 3 hr and 15 mins
82
Behavior of Delete Query
A row is deleted from 1 block
after it is located

10:1001 A:1001 M:1001 CS:1001 10500:1001 Each column must be


11:1002 B:1002 M:1002 CS:1002 20400:1002 Deleted from its
12:1003 C:1003 F:1003 CS:1003 8000:1003 Respective file, Hence
13:1004 D:1004 F:1004 AI:1004 15000:1004 Slower
14:1005 E:1005 F:1005 AI:1005 18000:1005
15:1006 F:1006 M:1006 AI:1006 10300:1006
16:1007 G:1007 M:1007 AI:1007 5980:1007
17:1008 H:1008 M:1008 AI:1008 4000:1008
18:1009 I:1009 M:1009 CS:1009 30900:1009
19:1010 J:1010 F:1010 CS:1010 5000:1010
20:1011 K:1011 F:1011 CS:1011 3000:1011 83
Behavior of Update Query
A value is updated from 1 block
after it is located

10:1001 A:1001 M:1001 CS:1001 10500:1001 A value replaced in 1


11:1002 B:1002 M:1002 CS:1002 20400:1002 Block after it is
12:1003 C:1003 F:1003 CS:1003 8000:1003 located
13:1004 D:1004 F:1004 AI:1004 15000:1004
14:1005 E:1005 F:1005 AI:1005 18000:1005
15:1006 F:1006 M:1006 AI:1006 10300:1006
16:1007 G:1007 M:1007 AI:1007 5980:1007
17:1008 H:1008 M:1008 AI:1008 4000:1008
18:1009 I:1009 M:1009 CS:1009 30900:1009
19:1010 J:1010 F:1010 CS:1010 5000:1010
20:1011 K:1011 F:1011 CS:1011 3000:1011 84
Column-Store Optimizations
• Compression (10X Improvement)
• Late tuple materialization (3X improvement)
• Operate on columns as long as possible
• Merge columns into complete tuples as late as possible

Early Materialization: create rows at beginning of query plan

87
Late Tuple Materialization

Early materialization Late materialization

88
Data Compression
• Row oriented
• Compression is low because, we do not have homogenous data in each block
i.e. different columns have different data types
• Column oriented
• Compression is higher because we have homogenous data in each block
because a column having certain data type is stored in 1 column.
• if cardinality of a column is low e.g gender has only 2 values/cardinality then more
compression

89
Data Compression: Run length Encoding (RLE)
• Run length encoding is an algorithm for performing lossless data compression.
• Lossless data compression refers to compressing the data in such a way that the original form
of the data can then be derived from it.
• Runs of data (sequences in which the same data value occurs in many
consecutive data elements) are stored as a single data value and count, rather
than as the original run.
• When a character occurs a large number of times consecutively in a sequence,
then we can represent the same consecutive subsequence using only a single
occurrence of that character and its count.
• Using run length encoding, we can save memory space while transmitting data
and preserving its original form.

https://www.pythonpool.com/run-length-encoding-python/ 90
Data Compression: Run length Encoding (RLE)

• Example:
• Consider a sequence: AACCCBBBBBAAAAFFFFFFFF
• RLE representation: A2C3B5A4F8
• 22 length sequence compressed to a 10 length sequence.
• To avoid confusion, use flags + appearance counter
• Example: ABCCCCCCCCDEFGGG
• Becomes: ABC!8DEFGGG
• ! is flag

https://www.pythonpool.com/run-length-encoding-python/ 91
Relevant Readings

92
C-Store

93
Mining MSN Messenger Data

94
Data Cube
• Must read this paper !!

95

You might also like