0% found this document useful (0 votes)
12 views5 pages

Revis Ioin

The document discusses various aspects of query optimization, including plan enumeration, rule-based and cost-based optimization, and execution models. It also covers transactions, concurrency control mechanisms such as locking and timestamp ordering, and database sharding strategies for improved performance and scalability. Key concepts include ACID properties, locking mechanisms, and the importance of maintaining data consistency during concurrent operations.

Uploaded by

hmag425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Revis Ioin

The document discusses various aspects of query optimization, including plan enumeration, rule-based and cost-based optimization, and execution models. It also covers transactions, concurrency control mechanisms such as locking and timestamp ordering, and database sharding strategies for improved performance and scalability. Key concepts include ACID properties, locking mechanisms, and the importance of maintaining data consistency during concurrent operations.

Uploaded by

hmag425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Optimizer to Shrading 12 -> 23

Query Optimizer
The optimizer determines the best way to execute a query by evaluating multiple possible
execution plans and choose the lowest in resources consumption.

●​ Plan Enumeration: the process of generating and considering all possible query
execution plans for a given SQL query.
●​ Rule-Based Optimization: set of predefined rules to select a query plan, like prefer
indexes over full table scan, and run ‘where’ command early to reduce the number of
rows.
●​ Cost-Based Optimization: the optimizer uses statistical information about the data
(e.g., table sizes, index selectivity, data distribution) to estimate the cost of various query
execution plans. It then selects the plan with the lowest cost.
●​ Bottom-Up Approach: builds query execution plans starting from the smallest, most
basic components, start with subqueries and simple joins, the move to harder
components.
●​ Top-Down Approach: starting from higher-level operations (e.g., query root) and
moving to lower-level operations (e.g., table scans).

Cardinality
Cardinality is a term used in databases and data analysis to describe the number of
unique values in a dataset or a column.
●​ Low cardinality: Indicates that a column contains a small range of distinct values.
For example, a column for "Gender" with values like "Male", "Female", and "Other"
has low cardinality.
●​ High cardinality: Indicates that a column contains a large range of distinct values,
such as an "Email Address" or "User ID" column.

Query Optimization
Query Execution is the process by which a database translates a high-level query (e.g.,
SQL) into efficient low-level operations to retrieve or manipulate data. It involves:

1.​ Parsing: Checks syntax/semantics and creates a query tree.


2.​ Optimization: Finds the most efficient execution plan.
3.​ Plan Generation: Defines a sequence of operations (e.g., scans, joins).
4.​ Execution Engine: Executes the plan, interacting with storage and memory.

And it has 4 models (Materialization, Iterator, Vectorization, pull vs push)


1. Materialization Model: In this model, the intermediate results of a query are computed
and stored (or "materialized") explicitly in memory or temporary storage before being passed
to the next operator in the query plan.

2. The Iterator Model is a pull-based approach where query operators process one row at a
time. Each operator uses three key functions: (open, next, close)

3. Vectorization Model: Processes data in batches or vectors (e.g., arrays of tuples) instead
of one tuple at a time.

4. Pull-based vs. Push-based Processing

●​ Pull-based Processing: Operators request data from their inputs when needed (e.g.,
iterator model). Data flows up the query plan.
●​ Push-based Processing: Data is "pushed" from one operator to the next as soon as
it is available. Data flows down the query plan.

Transactions
A transaction is a single unit of work in a database involving operations like insert, update,
or delete, ensuring consistency even during failures. Transactions follow the ACID
properties:

1.​ Atomicity: All operations complete fully or none at all (e.g., transfer fails, rollback
ensures no partial changes).
2.​ Consistency: Transitions the database between valid states while maintaining
constraints (e.g., no negative balances).
3.​ Isolation: Transactions run independently without interference.
4.​ Durability: Committed changes persist despite system failures.

Schedule:​
An ordered sequence of transaction operations (read/write) ensuring consistent database
states.

1.​ Conflicting Operations: Occur when different transactions access the same data,
and at least one writes to it.
2.​ Recoverability: A schedule is recoverable if transactions reading uncommitted data
only commit after the source transaction commits.
3.​ Serializability: Ensures interleaved execution is equivalent to a serial order.

By following these principles, databases maintain correctness, consistency, and reliability,


even during concurrent transaction processing.
Concurrency Control in Databases (locks)
Concurrency Control: Ensures multiple transactions execute without violating data
consistency.

Locks: Control access to database resources.

●​ Shared Lock (S-Lock): Multiple transactions can read but not write.
●​ Exclusive Lock (X-Lock): Only one transaction can read/write.

Two-Phase Locking (2PL): Ensures serializability.

1.​ Growing Phase: Locks are acquired but not released.


2.​ Shrinking Phase: Locks are released but not acquired.
●​ Strict 2PL: Holds all locks until commit/abort to prevent cascading rollbacks.

Cascading Rollback: Chain reaction of rollbacks caused by reading uncommitted data, as


new operation is depending on older uncommitted operation’s output.

Deadlocks: Occur when transactions wait indefinitely for each other, they make circular-like
shapes of dependence on each other’s resources.

Lock Hierarchy: Structured locking from coarse to fine-grained (e.g., table → row) to reduce
contention, by locking the bigger first then smaller.
Concurrency Control (no locks)
1. Timestamp Ordering (TO): Timestamp ordering ensures serializability by assigning each
transaction a unique timestamp and scheduling operations based on these timestamps.

2. Optimistic Concurrency Control (OCC): Optimistic concurrency control assumes that


conflicts are rare and allows transactions to execute without acquiring locks during most of
their execution. Validation occurs at commit time to ensure consistency, If conflicts are
detected, the transaction is aborted and restarted, else if validation is successful, the
transaction's changes are written to the database.

3. Multi-Version Concurrency Control (MVCC)

●​ Keeps multiple versions of data with timestamps.


●​ Readers see a consistent snapshot of the database.
●​ Writers create new versions instead of overwriting data.
Database Sharding
Architecture:​
Database sharding is a horizontal partitioning technique that divides large datasets into
smaller, independent pieces called shards, distributed across multiple servers for better
performance and scalability.

Sharding Strategies

1.​ Directory-Based Sharding:​


Central directory maps data to shards.​

○​ Advantages: Flexible, supports complex partitioning.


○​ Disadvantages: Latency, single point of failure.
2.​ Range-Based Sharding:​
Data divided into range-based shards (e.g., IDs 1–1000).​

○​ Advantages: Simple, efficient for range queries.


○​ Disadvantages: Uneven distribution, complex rebalancing.
3.​ Hash-Based Sharding:​
Shard key hashed to determine shard placement.​

○​ Advantages: Even distribution, scalable.


○​ Disadvantages: Poor range query performance, rebalancing challenges.

Scaling & Redistribution

●​ Shard Splitting/Merging: Adjust shard sizes as data grows/shrinks.


●​ Rebalancing: Migrate data for uniform distribution.
●​ Consistent Hashing: Minimizes redistribution during scaling with a hash ring and
virtual nodes.

Considerations

●​ Indexing: Each shard must maintain its own indexes, but global indexing (e.g.,
across all shards) can be challenging.
●​ Transaction Management: Distributed transactions add latency; sharding is ideal for
localized operations.
●​ Fault Tolerance: High availability can be ensured by replicating each shard across
multiple servers.

You might also like