03 Database
03 Database
Analytics
- SQL databases are generally better suited for traditional analytics with structured data and complex queries,
JOINS, while NoSQL databases shine in scenarios requiring scalability, flexibility, and real-time analytics.
Often, a hybrid approach using both types of databases, along with specialized solutions, can provide the
best of both worlds.
Difficult to achieve.
Careful coordination between nodes in 2 phases – Prepare + Commit
Negatives
- Blocking, records blocked until all nodes consent to execute transaction
System unavailable unless finished.
- Network latency
- Difficult to achieve scalability, 2PC is antithesis of Scalability
5. Database - Relational – Indexes, Faster writes
Table scan
Scan all rows by rows
Index Scan
Query optimizer finds a particular page.
Index -> page -> row
Select * from employee where eid<10
S1 = Find rows stored in index
S2 = Go to heap and find 10 records sequentially
Index Only Scan
More efficient, faster
Get column information from index only, avoid going to heap; No need to go to actual table data
All necessary information in index itself, no need to go to table for any additional info.
Index -> page -> row
Select * from employee where eid<10
S1 = find rows stored in index
No S2
3. Graph DB
Stores data as entity and relationship
Use cases
Social networking, fraud management
Example
Neo4J, AWS Neptune
4. In memory
Stores data on RAM storage, no disk access
Positives
Minimum latency, no disk access
Negatives
Data loss on crash
Use cases
Realtime gaming, analytics
Example
Redis, Memcached , AWS Elasticache
5. Time Series
Collect store, process data by timestamp sequence
Use cases
Event tracking
6. Columnar DB
Traditional DB, all columns of a row stored together
Columnar, data stored in column oriented
Column wise compression, aggregation, OLAP
Use cases
Reporting, data warehouse, OLAP
Positives
Faster instead of reading rows by row; directly from column
Negatives
Slow writes vs fast writes of row based
Example
Cassandra, HBase
AWS
Object Storage S3
RDBMS Aurora
NoSQL/Key ElasticCache
NoSQL/Document ElasticSearch
NoSQL/Graph Neptune
NoSQL/Timeseries Timestream
Ledger AWS Quantum
OLAP Athena
MongoDB
Document Database
Data in binary json (bjson)
2PC on multiple documents
+ Efficient read/search
- Write overhead
DynamoDB
Key Value Database, KV less than 1MB
No Master, user can write on any node. Consistent hashing
Highly Available / Scalable
10 million R/W per second
Primary key has 2 parts
Partition Key (determine partition, ensures even data distribution)
Sort key (Optional, sorting within partition)
Example
In a table storing books, ISBN can be Partition Key to uniquely identify books, and Publication Date as Sort Key
to organize editions chronologically.
Slices
- Slice is logical partition for disk storage.
- Multiple slices allow parallel processing across slices on each node.
- The number of slices per node depends on the node instance types.
Redshift offers 3 families of instances: Dense Compute(dc2), Dense Storage (ds2) , Managed Storage(ra3).
- Slices can range from 2 per node to 16 per node depending on the instance family and instance type
- Slice is to distribute the workload of queries evenly across all nodes to leverage the parallel compute
Columnar Storage
Data stored in columnar format
Disk I/O is reduced significantly e.g., query selecting 5 columns out of 100 column table only access 5% of the
data block space.
Each block of data contains values from a single column. This means the data type within each block is always
the same
Compression, Redshift can apply specific and appropriate compression on each block increasing the amount of
data being processed within the same disk and memory space