0% found this document useful (0 votes)
17 views14 pages

03 Database

Uploaded by

farzi account
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views14 pages

03 Database

Uploaded by

farzi account
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Table of Contents

1. Database - Choosing a database.............................................................................................................................1


2. Database - Relational – ACID and Isolation levels....................................................................................................2
3. Database - Relational – ACID in distributed environment, 2PC?..............................................................................3
4. Database - Relational – Indexes, Faster writes........................................................................................................3
5. Database - Relational – What is N+1 problem.........................................................................................................4
6. Database - Non-relational – What is NoSQL............................................................................................................5
7. Database - Non-relational – Types of NoSQL...........................................................................................................5
8. Database - Non-relational - MongoDB....................................................................................................................6
9. Database - Non-relational - Can we have ACID in NoSQL?.......................................................................................6
10. Database Optimization/Performance - how to scale DB?....................................................................................7
11. Database Optimization/Performance - Database Partitioning.............................................................................7
12. Database Optimization/Performance - Sharding.................................................................................................7
13. Database Backup – DB Replication......................................................................................................................8
14. Database - Database Performance - Caching.......................................................................................................9
15. Database - Common Databases (AWS, Azure).....................................................................................................9
16. Database - Common NoSQL databases................................................................................................................9
17. Database - No SQL Performance – Lucene Index, Inverted index?......................................................................11
18. Datawarehouse - OLAP vs OLTP........................................................................................................................11
19. Datawarehouse - Row based vs Column based..................................................................................................12
20. Datawarehouse - OLAP – AWS Redshift.............................................................................................................12
1. Database - Choosing a database
Functional considerations
 Query pattern
 Search by key (Redis/ DynamoDB /s3)
 Search by key + sometimes other key (DynamoDB /Cassandra)
 Search by fuzzy (elastic)
 Search by joins (rdbms), graph
 Nature of data
 Structured or unstructured
 Data models
 Data models need joins, relationship
 Null values
 No fixed columns
Non-Functional considerations
 Consistency or Availability
 Strong Consistency(rdbms)
 Eventual Consistency (NoSQL, focus on HA), multiple datacenter and compromise on SC.
 Volume of growth
 TB/PB per month?
 Object storage if infinite growth like S3
 Performance
 Scalability
 RW separation, sharding
 Distribute data horizontally
 Scalability
 RW separation, sharding
 Distribute data horizontally
 License cost
 Switch to Aurora
2. Database - 4 main differentiator (Schema, Txn, Analytics,Scaling)
 Fixed schema (Fixed vs Flexibility)
 RDBMS follows fixed schema
 (-) NoSQL, the responsibility lies with the application to maintain data integrity

 Transaction Management (ACID vs Base)


 (-) NoSQL does not follow ACID strictly, rather it follows BASE model (Basically Available, Soft state, Eventual consistency)
 Soft state means the state of the system can be in a flux and change over time, even without input.
 However, different NoSQL db offer different transaction management.
 MongoDB
- Mongo supports ACID transaction in "document level". However, in multi document level, it started giving
ACID support since v4.0, that is complex and has performance limitations. With v4.2, MongoDB extended
multi-document transaction support to sharded clusters, enabling ACID transactions across distributed data.
 Cassandra:
- Cassandra supports lightweight transactions for operations like compare-and-set (CAS), but it does not
support full ACID transactions. Cassandra follows the BASE model and focuses on eventual consistency
rather than strong consistency.
 Redis:
- Transactions: Redis supports transactions with commands like MULTI, EXEC, DISCARD, and WATCH.
However, Redis transactions are not fully ACID-compliant and do not support rollbacks.
 Modern NoSQL databases increasingly support ACID transactions, though the scope and implementation vary
widely. While single-document ACID compliance is standard, more complex multi-document or multi-record
transactions are now possible in some systems, but they may have limitations in distributed scenarios. The
choice depends on the specific use case, required transactionality, and trade-offs in performance and scalability.

 Analytics
- SQL databases are generally better suited for traditional analytics with structured data and complex queries,
JOINS, while NoSQL databases shine in scenarios requiring scalability, flexibility, and real-time analytics.
Often, a hybrid approach using both types of databases, along with specialized solutions, can provide the
best of both worlds.

 2 main advantages for NoSQL


 Flexibility
 Scalability

3. Database - Relational – ACID and Isolation levels


ACID
 4 essentials of a relational transaction i.e., a unit of work
 Atomic
 All or none
 Consistency
 One valid state to another
 Correctness, maintain DB integrity and constraints
 Isolation
 Tx1 unaware of Tx2 local changes
 Dirty Read
 Txn1 cannot read of Tx2 unless changes are committed.
 Changes local till committed.
 Solution: READ_COMMITTED
 Non repeatable Read
 Txn1 reads two different values of Tx2 for the “same row” data
 Txn1 reads 100EUR for CustA-> Tx2 updates CustA balance to 50EUR -> Txn1 reads 50EUR, cant find 100
 Solution: REPEATABLE_READ by using versions, timestamp
 Phantom Read
 Reading additional or missing rows in query i.e., essentially not same rows
 Txn1 reads 2 customers with balance less than 100 EUR -> Tx2 has updated CustC balance -> Txn1 reads
again and sees 3 customers.
 Solution: SERIALIZABLE
 While both involve inconsistencies caused by concurrent transactions, the key distinction is whether the
inconsistency relates to row count (phantom read) or data values (non-repeatable read).
4 isolation levels
 READ_UNCOMMITTED
 Lowest level, dirty read possible
 READ_COMMITTED
 Resolves dirty read issue, i.e., Txn1 will read only committed values of Tx2
 REPEATABLE_READ
 Resolves unrepeatable read issue
 Txn1 will work on their respective “snapshot” version and modify locally
 SERIALIZABLE
 Highest level of isolation
 Resolves all issues, but Locks and performance
ACID or BASE?
 ACID
 Consistency (Strong) over Availability
 Focus on correctness
 Feature of Relational
 BASE
 Basically Available, Soft state, Eventual consistency
 Availability over consistency
 Soft state i.e., state may change eventually
 + No blocking, + No ACID overhead
 New NoSQL focusing on ACID as well
Primary vs Unique
 Primary attribute in a table to identify row vs Multiple attributes
 Not changed once set vs Multiple updates
 Not null vs nullable
 Clustered index vs non-clustered index.

4. Database - Relational – ACID in distributed environment, 2PC?

 Difficult to achieve.
 Careful coordination between nodes in 2 phases – Prepare + Commit
 Negatives
- Blocking, records blocked until all nodes consent to execute transaction
System unavailable unless finished.
- Network latency
- Difficult to achieve scalability, 2PC is antithesis of Scalability
5. Database - Relational – Indexes, Faster writes

 Table scan
 Scan all rows by rows
 Index Scan
 Query optimizer finds a particular page.
 Index -> page -> row
 Select * from employee where eid<10
S1 = Find rows stored in index
S2 = Go to heap and find 10 records sequentially
 Index Only Scan
 More efficient, faster
 Get column information from index only, avoid going to heap; No need to go to actual table data
All necessary information in index itself, no need to go to table for any additional info.
 Index -> page -> row
 Select * from employee where eid<10
S1 = find rows stored in index
No S2

 Why write is faster in row-based database? But not aggregation?


 Row based data is stored next to each other, if write one more, append it.
 Read is fine, aggregation is the issue.
 Select sum(salary) from employee.
 If employee is in different disks, load to memory first.
 Columnar database
 Columns are stored next to each other
 Sum(age), data stored in single disk access
 OLAP, compression, aggregation

 How are tables stored and how is index stored?


 Row -> Page -> Heap
 Rowid
 Pseudo id, system maintained
 Page
 Rows stored in pages
 Fixed size, e.g., 8kb for postgress, 16kb in MySQL
 Select * from employee.
Doesn’t read rows but returns page
 Heap
 Data structure holding pages, actual data storage
 Heap = Page0 + Page1 + Page2 + Page3 ….. + PageN
 Indexes used to find what part of heap we need to read
 Select * from employee where eid=10
//no index
//scan page0=no result
//scan page1=no result
//scan page3=find
//index
//scan page3=find , index tells which page to hit

 Clustered vs non clustered index?


 Clustered index
 Index which stores data based on PK
 Determines the physical order of data rows on disk.
 One CI per table, always stored sorted
 Usually CI is PK or the column which is most frequently used in queries
 Non-Clustered index
 NCI are used for query optimization purpose, e.g., employee is mostly searched using name
 No affect on the physical order of data on disk.

 Data Structure of index


 BTree
 Self-balanced tree
 They automatically maintain their balance during insertions and deletions by redistributing keys between
nodes. This ensures that the height of the tree remains relatively constant, which in turn guarantees
efficient search and access operations, making them suitable for database indexing
 Stores data in both internal and leaf nodes,
 BTree+
 Self-balanced tree
 They automatically maintain their balance during insertions and deletions by redistributing keys between
nodes. This ensures that the height of the tree remains relatively constant, which in turn guarantees
efficient search and access operations, making them suitable for database indexing
 Stores data in only leaf nodes, optimizing range queries and sequential access.

6. Database - Relational – What is N+1 problem


 Problem issue while querying relational database
 Initial query returns 1 row for all primary entities
 Subsequent N query for associated values with each N primary entities.
 Example
 2 students, and each having 3 course each
StudentA = History,Science,Maths
StudentB = Geography,Humanities,English
 N = primary entities i.e. 2.
 If we want to retrieve all students and their respective courses? N+1 i.e. 3 times
- 1 time to fetch all students initially
- N time to fetch respective courses for each student, select 2 times iteratively for each student
 Solution
 Use eager Fetching, fetch all in one go, don’t go for second query
 Use join

7. Database - Non-relational – What is NoSQL


 Important properties
1. Non-relational (no select * from employee)
2. Unstructured data
3. Schema less, flexible
4. Follows CAP and not ACID
5. Usually used for narrow access pattern
 Positives
 Scale horizontally
 Performance
 Flexible schema
When to choose SQL When to choose No-SQL
 Data is structured  Data is un-structured; Needs flexiblity
 ACID is required  No ACID is required
 Complex joins  Scalability and performance
 Volume growth stable  Huge volume of data
 Access patterns in database – Narrow vs wide
 How query retrieves data
 Narrow using some key like ID or wide using some range like in (20,30,40)
 NoSQL has narrow predefined access pattern to have predictable performance.
If access pattern is not well defined – use SQL
If access patten is narrow – use NoSQL
 Narrow access patten
SELECT * FROM employee WHERE eid=10;
 Wide access patten
SELECT * FROM employee WHERE age BETWEEN 25 AND 35;

8. Database - Non-relational – Types of NoSQL


1. Key Value
 Stored as KV, query by a field
 Stores data which rarely changes
 Use cases
 Session management, user preferences, profile
 Positives
 Horizontal scalability
 Example
 Redis, Memcached
2. Document DB
 Data represented as document e.g., json
 Use cases
 Content Management
 Example
 Elastic search, AWS DocumentDB

3. Graph DB
 Stores data as entity and relationship
 Use cases
 Social networking, fraud management
 Example
 Neo4J, AWS Neptune
4. In memory
 Stores data on RAM storage, no disk access
 Positives
 Minimum latency, no disk access
 Negatives
 Data loss on crash
 Use cases
 Realtime gaming, analytics
 Example
 Redis, Memcached , AWS Elasticache
5. Time Series
 Collect store, process data by timestamp sequence
 Use cases
 Event tracking
6. Columnar DB
 Traditional DB, all columns of a row stored together
 Columnar, data stored in column oriented
 Column wise compression, aggregation, OLAP
 Use cases
 Reporting, data warehouse, OLAP
 Positives
 Faster instead of reading rows by row; directly from column
 Negatives
 Slow writes vs fast writes of row based
 Example
 Cassandra, HBase

9. Database - Non-relational - MongoDB


 Requirement of storing unstructured or semi structured data
 E.g., product catalogs, descriptions, images etc
 E.g., IoT data, large volume sent by IoT devices, sensor data, telemetry data
 E.g., Geospatial Data: MongoDB has built-in support for geospatial data and is used in applications that require
location-based services, such as mapping and location tracking.
 Negatives
 Lack of Transactions: Until recent versions, MongoDB lacked support for multi-document transactions, which
made it challenging to maintain data consistency in certain scenarios. While transactions are now supported,
they may not perform as efficiently as in traditional relational databases.
 Memory Usage: MongoDB relies heavily on memory for performance. Large datasets or indexes can consume a
significant amount of RAM, potentially leading to memory pressure on the system.
 Complexity in Shard Configuration: Horizontal scalability in MongoDB involves sharding, which can be complex
to set up and manage, especially in large-scale deployments.

10. Database - Non-relational - Can we have ACID in NoSQL?


 NoSQL prioritize horizontal scalability and availability over consistency.
 Hence, not usually; BASE and eventual consistency preferred over ACID
 MongoDB supports distributed multi document acid transactions starting v4
 Perform multiple operations on multiple documents within a transaction, MongoDB ensures ACID maintained.
 Keep in mind the performance trade-offs. Prefer non-transactional operations for simple use case.
 RAFT algorithm
 Raft is a distributed consensus algorithm designed to ensure that a distributed system remains consistent
even in the presence of failures.
 A designated transaction coordinator manages the process, ensuring that write operations are committed
only when acknowledged by majority of nodes in the replica set, providing strong consistency and durability
even in the face of failures.
 Like Master-Slave, there is a Primary node responsible to handle all W and coordinate.
 Consensus among secondary nodes before commit. Elect primary in case of failure.
 Choose Relational if we want to have ACID, NoSQL is not made to be ACID compliant

11. Database Optimization/Performance - how to scale DB?


1. Hardware upgrade, simplest
2. Sharding or horizontally partition data across multiple servers. Done usually for NoSQL
DB Partitioning or vertically partition data in single server. Done usually for RDBMS
3. Read/Write replicas, different DB for read and write. Not easy for RDBMS

12. Database Optimization/Performance - Database Partitioning


 More common for Relational DB. Although can be applied for both.
 Large tables split into several partitions based on some range like date range.
 Partitioning = aim is query performance
 Partitioning = single server
 Vertical partitioning
 Divides a database table into multiple tables, each containing a subset of columns (attributes).
 Each partitioned table represents a distinct aspect or subset of the original data.
 E.g employee table with 100 records divided into 2 tables => EMP_BASIC, EMP_CONTACT etc
 Horizontal partitioning
 Sharding
 Horizontal partitioning divides a database table into multiple tables, each containing a subset of rows (records
or tuples). Each partitioned table represents a segment of the original data, often based on a defined criterion
like a range of values or a specific attribute.
 E.g employee table with 100 records divided into EMP_EU, EMP_IN etc

13. Database Optimization/Performance - Sharding


 More common for NoSQL. Although can be applied for both.
 Sharding is a database architecture strategy used to horizontally partition data across multiple servers or nodes to
improve scalability and performance.
 A large database is divided into smaller, more manageable pieces called "shards," and each shard is stored on a
separate server or node. This allows data to be distributed evenly and reduces the load on any single server,
enabling the database to handle high volumes of data and requests.
 Positives
 Scalability
 Performance
 High Availability, No SPOF
 Fault tolerance
 Negatives
 Complexity of data distribution, data consistency etc.
 Query routing
 Sharding key selection
 Data migration and rebalancing shards.
Some shards overutilized, some underutilized.
 For which kind of Database sharding is done?
 Both kind of databases
 Relational database
 Sharding is less common in traditional relational databases due to inherent complexity of splitting data.
 Non-Relational database
 Sharding is more common in NoSQL databases like MongoDB, Cassandra, Couchbase etc.
 NoSQL are made with horizontal scalability in mind.
 Factors to consider while sharding
 Data size, is volume expected to grow?
 Performance, too many RW?
 Latency, is it multi geo app?
 How to shard data, selection of sharding key?
 Geo Sharding, if column like location, region exists in database
 Create consistent hashing on some keys.
 How to know which shard to query?
 Each query will have keys/attributes from which SHARD key will be derived.
 Key should be predictable, immutable and part of request e.g. /search/country=IN&name=abcd
 Based on those keys, create hasing key
ConHash(IN, abcd) = K1
ConHash(UK, xyz) = K2
 Client library request -> Master node -> Calculate hash based on request -> Redirect to respective node

14. Database Backup – DB Replication


Different types of replications
1. Master-Slave (Primary/Secondary)
 Master writes
 Slave copies, replicates from Master
+ No write conflicts
+ High R availability
+ Sync or Async
- Consistency
 Client writes ----> Master --Unidirectional Copy--> Slave
Client reads ----> Slave

2. Master-Master (Multi Master)


 Scale write operations
 Complex and conflicts
+ High R/W availability
- Complex and conflicts
 Client writes ----> Any Node --Bidireactional Copy --> Other Nodes
Client reads ----> Any Node
 DynamoDB

RAID (Redundant Array of Independent Disks)


 Object storage principle
 Store same data in multiple places (HD)
 Commonly used in DB to protect data from failure, performance or fault tolerance.
 RAID CONTROLLER, handles the operations
 Different raid levels (6+), different offers of data loss, capacity
 RAID 0 (Striping)
 data stripped across multiple drives.
 (+) Fastest e.g., 100 rows = (50 rows) HD1 + (50 rows) HD2
 (-) No redundancy, if one drive fails, data lost
 RAID 1 (Mirroring)
 data duplicated in 2+ drives.
 (+) Redundancy, if one drive fails, data exists in another
 RAID 3 (Byte level Stripping with Dedicated parity)
 data is striped at the byte level across multiple data drives, and a dedicated parity drive is used to store
parity information.
 (+) High Data transfer rates because of byte level stripping
 (-) No redundancy for parity drive, if its lost, entire array is unusable.
 Other RAID etc.

15. Database - Database Performance - Caching


Negative: Cache miss storm
 Cache used for database slowness
 But what if cache itself brings down application performance
 Multiple clients simultaneously requesting same data
All will hit DB
 Solutions
 Keep stale data till cache compute is over
 Sleep shortly, then refetch
Negative: In memory cache for user details, downside of it?
 Scalability issue
 If user details are over cache, user must access cache for authorization
 During scaling, this has to be additionally considered

16. Database - Common Databases (AWS, Azure)

AWS
Object Storage S3
RDBMS Aurora
NoSQL/Key ElasticCache
NoSQL/Document ElasticSearch
NoSQL/Graph Neptune
NoSQL/Timeseries Timestream
Ledger AWS Quantum
OLAP Athena

17. Database - Common NoSQL databases


Cassandra
 Columnar Database
 Master Less i.e., no master slave.
 CQL like SQL
 CAP, highly available, partition tolerant, tunable consistency
 Taken concepts from DynamoDB
 Partitioning is hash based (consistent hashing)
 + Superfast Efficient Writes
 Because it does not require disk read or seek. In memory or Memtables, SSTabales (sorted string tables),
Commit log
 Client write -> memTable -> ssTable (no disk access)
 Write written to log file
 All writes are APPEND only. No insert, only update.
 + Highly Available
 + Scalable
 Consistency/Quorum in Cassandra
 5 nodes = N1, N2, N3, N4, N5
 E.g., Replication Factor=3, Quorum = 2, N5 becomes coordinator.
 Client -> N5 -> replicates to any 3 servers -> if success in 2, commit, won’t care for 3rd
 Quorum = (Replication Factor/2)+1

MongoDB
 Document Database
 Data in binary json (bjson)
 2PC on multiple documents
 + Efficient read/search
 - Write overhead

DynamoDB
 Key Value Database, KV less than 1MB
 No Master, user can write on any node. Consistent hashing
 Highly Available / Scalable
 10 million R/W per second
 Primary key has 2 parts
 Partition Key (determine partition, ensures even data distribution)
 Sort key (Optional, sorting within partition)
 Example
 In a table storing books, ISBN can be Partition Key to uniquely identify books, and Publication Date as Sort Key
to organize editions chronologically.

Elastic Search/Open Search


 Document Database
 Analytics engine + search engine
 Enforce schema like RDMBS using – Mappings
POST /localhost/employee “mappings” {“properties” {id:”type:num”}}
 DB -> tables -> rows
Index -> types -> document properties
 Document: smallest unit of storage e.g., row of db
Type: Collection of similar documents
Index: Logical namespace pointing to 1-n shards
Place to store related data
Shard: Index divided into logical shards
Ix1 = S1(R1) + S2(R2) + S3(R3) .…. R=replica, optional
 Node: server
Cluster: group of servers
 Lucene
 Single instance of shard, high performance search engine, in java
 Lucene uses inverted index
 Shard
 How index is stored in cluster
 Shards are container of data
 Shards allocated to nodes
 Shards can be primary or replica
 Each document belongs to 1 primary shard
 Yellow status of shard
 All primary nodes are active, but NO REPLICA
 Cluster is successfully capable of accepting requests
 Clustering
 Automatically assigning shards to nodes.

18. Database - No SQL Performance – Lucene Index, Inverted index?


Normal/Forward Index Inverted Index
 A sentence is broken into terms e.g., “Hello World” has 2 terms =  Stores words as index
“Hello”, “World” Hello = Doc1
 Stores document as index, words are mapped reference World = Doc1, Doc2
Doc1= hello, world hey= Doc1
Doc2=good, morning
+ Indexing is faster because key remains the same, new words added - Indexing is slower, as words must be checked before preparing index
- Slow search, e.g., index at beginning of the book + Fast search, e.g., glossary at end of book
Lucene Index
 Example of inverted index
 Maps words/terms to the document
 data structure used for full-text search in information retrieval systems
 Apache Lucene, open-source search library written in Java.

19. Datawarehouse - OLAP vs OLTP


OLAP vs OLTP
 Transaction vs Aggregation
 Row based vs Column based
 Read/Write vs Read intensive

20. Datawarehouse - Row based vs Column based


 Row based vs Column based
21. Datawarehouse - OLAP – AWS Redshift
Redshift
 Clustered warehouse
 Each cluster can house multiple databases.
 Each database contains multiple objects like tables, views, stored procedures, etc.
Nodes and Slices
 Redshift is clustered services, hence multiple nodes involved
 Nodes have 2 types => Leader + Compute (contain Slices)
 Leader Node
 Manages data distribution and query execution across compute nodes
 No data is stored. Data in compute node only.
 Compute Nodes
 Manages data

 Slices
- Slice is logical partition for disk storage.
- Multiple slices allow parallel processing across slices on each node.
- The number of slices per node depends on the node instance types.
Redshift offers 3 families of instances: Dense Compute(dc2), Dense Storage (ds2) , Managed Storage(ra3).
- Slices can range from 2 per node to 16 per node depending on the instance family and instance type
- Slice is to distribute the workload of queries evenly across all nodes to leverage the parallel compute
Columnar Storage
 Data stored in columnar format
 Disk I/O is reduced significantly e.g., query selecting 5 columns out of 100 column table only access 5% of the
data block space.
 Each block of data contains values from a single column. This means the data type within each block is always
the same
 Compression, Redshift can apply specific and appropriate compression on each block increasing the amount of
data being processed within the same disk and memory space

You might also like