Revis Ioin
Revis Ioin
Query Optimizer
The optimizer determines the best way to execute a query by evaluating multiple possible
execution plans and choose the lowest in resources consumption.
● Plan Enumeration: the process of generating and considering all possible query
execution plans for a given SQL query.
● Rule-Based Optimization: set of predefined rules to select a query plan, like prefer
indexes over full table scan, and run ‘where’ command early to reduce the number of
rows.
● Cost-Based Optimization: the optimizer uses statistical information about the data
(e.g., table sizes, index selectivity, data distribution) to estimate the cost of various query
execution plans. It then selects the plan with the lowest cost.
● Bottom-Up Approach: builds query execution plans starting from the smallest, most
basic components, start with subqueries and simple joins, the move to harder
components.
● Top-Down Approach: starting from higher-level operations (e.g., query root) and
moving to lower-level operations (e.g., table scans).
Cardinality
Cardinality is a term used in databases and data analysis to describe the number of
unique values in a dataset or a column.
● Low cardinality: Indicates that a column contains a small range of distinct values.
For example, a column for "Gender" with values like "Male", "Female", and "Other"
has low cardinality.
● High cardinality: Indicates that a column contains a large range of distinct values,
such as an "Email Address" or "User ID" column.
Query Optimization
Query Execution is the process by which a database translates a high-level query (e.g.,
SQL) into efficient low-level operations to retrieve or manipulate data. It involves:
2. The Iterator Model is a pull-based approach where query operators process one row at a
time. Each operator uses three key functions: (open, next, close)
3. Vectorization Model: Processes data in batches or vectors (e.g., arrays of tuples) instead
of one tuple at a time.
● Pull-based Processing: Operators request data from their inputs when needed (e.g.,
iterator model). Data flows up the query plan.
● Push-based Processing: Data is "pushed" from one operator to the next as soon as
it is available. Data flows down the query plan.
Transactions
A transaction is a single unit of work in a database involving operations like insert, update,
or delete, ensuring consistency even during failures. Transactions follow the ACID
properties:
1. Atomicity: All operations complete fully or none at all (e.g., transfer fails, rollback
ensures no partial changes).
2. Consistency: Transitions the database between valid states while maintaining
constraints (e.g., no negative balances).
3. Isolation: Transactions run independently without interference.
4. Durability: Committed changes persist despite system failures.
Schedule:
An ordered sequence of transaction operations (read/write) ensuring consistent database
states.
1. Conflicting Operations: Occur when different transactions access the same data,
and at least one writes to it.
2. Recoverability: A schedule is recoverable if transactions reading uncommitted data
only commit after the source transaction commits.
3. Serializability: Ensures interleaved execution is equivalent to a serial order.
● Shared Lock (S-Lock): Multiple transactions can read but not write.
● Exclusive Lock (X-Lock): Only one transaction can read/write.
Deadlocks: Occur when transactions wait indefinitely for each other, they make circular-like
shapes of dependence on each other’s resources.
Lock Hierarchy: Structured locking from coarse to fine-grained (e.g., table → row) to reduce
contention, by locking the bigger first then smaller.
Concurrency Control (no locks)
1. Timestamp Ordering (TO): Timestamp ordering ensures serializability by assigning each
transaction a unique timestamp and scheduling operations based on these timestamps.
Sharding Strategies
Considerations
● Indexing: Each shard must maintain its own indexes, but global indexing (e.g.,
across all shards) can be challenging.
● Transaction Management: Distributed transactions add latency; sharding is ideal for
localized operations.
● Fault Tolerance: High availability can be ensured by replicating each shard across
multiple servers.