Chapter 1 Query Processing
Chapter 1 Query Processing
Chapter 1
Query Processing and Optimization
Introduction
In this chapter we shall discuss the techniques used by a DBMS to process, Optimize and
execute high level queries.
The techniques used to split complex queries into multiple simple operations and methods
of implementing these low-level operations.
The query optimization techniques are used to choose an efficient execution plan that
will minimize the runtime as well as many other types of resources such as number of
disk I/O, CPU time and so on.
What is Query Processing?
The procedure of transforming high level SQL query into a correct and efficient
execution plan expressed in low-level language.
When a database system receives a query for update or retrieval of information, it
goes through a series of compilation steps, called execution plan.
It goes through various phases.
1. First phase is called syntax checking phase: -the system parses the query and
checks that it follows the syntax rules or not. It then matches the objects in the
query syntax with the view tables and columns listed in the system table. This
phase is divided into three: -Scanning, Parsing, Validating
A. Scanner: The scanner identifies the language tokens such as SQL
Keywords, attribute names, and relation names in the text of the query.
B. Parser: The parser checks the query syntax to determine whether it is
formulated according to the syntax rules of the query language.
C. Validation: The query must be validated by checking that all attributes
and relation names are valid and semantically meaningful names in the
schema of the particular database being queried.
The SQL query is decomposed into query blocks (low-level operations), which form
the basic unit. Hence nested queries within a query are identified as separate query
blocks.
The query decomposer goes through five stages of processing for decomposition
into low-level operation and translation into algebraic expressions.
Query Analysis
During the query analysis phase, the query is syntactically analyzed using the
programming language compiler (parser). A syntactically legal query is then
validated, using the system catalog, to ensure that all data objects (relations and
attributes) referred to by the query are defined in the database.
The type specification of the query qualifiers and result is also checked at this stage.
Example: -SELECT emp_nm FROM EMPLOYEE WHERE emp_desg>100
This query will be rejected because the comparison ">100" is
incompatible with the data type of emp_desg which is a variable character
string.
At the end of query analysis phase, the high-level query (SQL) is transformed into
some internal representation that is more suitable for processing. This internal
representation is typically a kind of query tree.
A Query Tree is a tree data structure that corresponds expression.
A Query Tree is also called a relational algebra tree.
Leaf node of the tree, representing the base input relations of the query.
Internal nodes result of applying an operation in the algebra.
Root of the tree representing a result of the query.
SELECT (P.proj_no, P.dept_no, E.name, E.add, E.dob)
FROM PROJECT P, DEPARTMENT D, EMPLOYEE E
WHERE P.dept_no = D.d_no AND D.mgr_id = E.emp_id AND
P.proj_loc = `Mumbai) ;
Query Normalization
Semantic Analyzer
The objective of this phase of query processing is to reduce the number of predicates.
The semantic analyzer rejects the normalized queries that are incorrectly formulated.
A query is incorrectly formulated if components do not contribute to the generation
of result. This happens in case of missing join specification. A query is contradictory if
its predicate cannot satisfy by any tuple in the relation.
The semantic analyzer examine the relational calculus query (SQL) to make sure it contains only data
objects that is table, columns, views, indexes that are defined in the database catalog. It makes sure
that each object in the query is referenced correctly according to its data type.
In case of missing join specifications the components do not contribute to the
generation of the results, and thus, a query may be incorrect formulated.
Query Simplifier
Integrity constraints defines constants which must holds for all state of database, so any
query that contradict an integrity constraints must be avoid and can be rejected without
accessing the database.
Query Restructuring
In the final stage of the query decomposition, the query can be restructured to give a
more efficient implementation. Transformation rules are used to convert one
relational algebra expression into an equivalent form that is more efficient.
The query can now be regarded as a relational algebra program, consisting of a series
of operations on relation.
Query Optimization
The primary goal of query optimization is of choosing an efficient execution strategy for
processing a query. The query optimizer attempts to minimize the use of certain
resources (mainly the number of I/O and CPU time) by selecting a best execution
plan (access plan).
A query optimization start during the validation phase by the system to validate the
user has appropriate privileges. Now an action plan is generate to perform the query.
Relational algebra query tree generated by the query simplifier module of query
decomposer.
Estimation formulas used to determine the cardinality of the intermediate result
table.
A cost Model.
Statistical data from the database catalogue.
The output of the query optimizer is the execution plan in form of optimized
relational algebra query.
A query typically has many possible execution strategies, and the process of choosing
a suitable one for processing a query is known as Query Optimization.
The term query optimization does not mean giving always an optimal (best) strategy
as the execution plan. It is just a responsibly efficient strategy for execution of the
query.
The decomposed query block of SQL is translating into an equivalent extended
relational algebra expression and then optimized.
1. The first technique is based on Heuristic Rules for ordering the operations in a
query execution strategy.
2. The second technique involves the systematic estimation of the cost of the
different execution strategies and choosing the execution plan with the lowest
cost.
3. The third technique is Semantic query optimization: - it is used with the
combination of the heuristic query transformation rules. It uses constraints
specified on the database schema such as unique attributes and other more
complex constraints, in order to modify one query into another query that
is more efficient to execute.
Heuristic Rules
The heuristic rules are used as an optimization technique to modify the internal
representation of query. Usually, heuristic rules are used in the form of query tree of
query graph data structure, to improve its performance.
One of the main heuristic rules is to apply SELECT operation before applying the
JOIN or other BINARY operations. This is because the size of the file resulting
from a binary operation such as JOIN is usually a multi-value function of the sizes of
the input files.
The main idea behind is to reduce intermediate results. This includes performing
SELECT operation to reduce the number of tuples &
PROJECT operation to reduce number of attributes.
The SELECT and PROJECT reduced the size of the file and hence, should be
applied before the JOIN or other binary operation. Heuristic query optimizer
transforms the initial (canonical) query tree into final query tree using equivalence
transformation rules. This final query tree is efficient to execute.
Examples for query Optimization: Identify all managers who work in a London branch
SQL:-
SELECT * FROM Staff s, Branch b WHERE s.branchNo = b.branchNo AND
s.position = ‘Manager’ AND b.city = ‘london’;
Results in these equivalent relational algebra statements
1. S(position =’Manager’) ^(city=’London’) ^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
2. S(position =’Manager’) ^(city=’London’) (Staff Staff.branchNo = Branch.branchNo Branch)
3. [S(position =’Manager’)( Staff)] Staff.branchNo = Branch.branchNo [s(city=‘London’)
(Branch)]
Assume:
1000 tuples in Staff.
50 Managers
50 tuples in Branch.
5 London branches
No indexes or sort keys
All temporary results are written back to disk (memory is small)
Tuples are accessed one at a time (not in blocks)
Query 1 (Bad)
Requires (1000+50) disk accesses to read from Staff and Branch relations
Creates temporary relation of Cartesian Product (1000*50) tuples
Requires (1000*50) disk access to read in temporary relation and test predicate
Total Work = (1000+50) + 2*(1000*50) = 101,050 I/O operations
Query 2 (Better)
Again requires (1000+50) disk accesses to read from Staff and Branch
Joins Staff and Branch on branchNo with 1000 tuples (1 employee : 1 branch )
Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) = 3050 I/O operations
3300% Improvement over Query 1
Query 3 (Best)
1. The main heuristic is to apply first the operations that reduce the size of intermediate
results.
2. Perform select operations as early as possible to reduce the number of tuples and
perform project operations as early as possible to reduce the number of attributes.
(This is done by moving select and project operations as far down the tree as
possible.)
3. The select and join operations that are most restrictive should be executed before
other similar operations. (This is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree appropriately.)
Chapter 2
Database Security and Authorization
Authorization/Privilege
Authorization refers to the process that determines the mode in which a particular
(previously authenticated) client is allowed to access a specific resource controlled by
a server.
Any database access request will have the following three major components.
1. Requested Operation: what kind of operation is requested by a specific query?
2. Requested Object: on which resource or data of the database is the operation sought to
be applied?
3. Requesting User: who is the user requesting the operation on the specified object?
Forms of user authorization
There are different forms of user authorization on the resource of the database.
These include:
1. Read Authorization: the user with this privilege is allowed only to read the
content of the data object.
2. Insert Authorization: the user with this privilege is allowed only to insert new
records or items to the data object.
3. Update Authorization: users with this privilege are allowed to modify content
of attributes but are not authorized to delete the records.
4. Delete Authorization: users with this privilege are only allowed to delete a
record and not anything else.
Note: Different users, depending on the power of the user, can have one or the combination of the
above forms of authorization on different data objects.
Database Administrator
The database administrator (DBA) is the central authority for managing a database
system.
The DBA’s responsibilities include
Account creation
granting privileges to users who need to use the system
Privilege revocation
classifying users and data in accordance with the policy of the organization
Access Protection, User Accounts, and Databases Audits
Whenever a person or group of persons need to access a database system, the
individual or group must first apply for a user account.
The DBA will then create a new account id and password for the user if he/she
believes there is a legitimate need to access the database. The user must log in to the
DBMS by entering account id and password whenever database access is needed.
The database system must also keep track of all operations on the database that are
applied by a certain user throughout each login session.
If any interfering with the database is assumed, a database audit is performed
A database audit consists of reviewing the log to examine all accesses and
operations applied to the database during a certain time period.
A database log that is used mainly for security purposes is sometimes called an audit
trail.
To protect databases against the possible threats two kinds of countermeasures can
be implemented:
1. Access control and
2. Encryption
Statistical databases are used mainly to produce statistics on various populations. The
database may contain confidential data on individuals, which should be protected
from user access. Users are permitted to retrieve statistical information on the
populations, such as averages, sums, counts, maximums, minimums, and standard
deviations.
A population is a set of rows of a relation (table) that satisfy some selection condition.
Statistical queries involve applying statistical functions to a population of rows. For
example, we may want to retrieve the number of individuals in a population or the
average income in the population.
However, statistical users are not allowed to retrieve individual data, such as
the income of a specific person.
Statistical database security techniques must disallow the retrieval of individual data.
This can be achieved by elimination of queries that retrieve attribute values and by
allowing only queries that involve statistical aggregate functions such as, SUM, MIN,
MAX, and Such queries are sometimes called statistical queries.
Encryption
Types of Cryptosystems
Cryptosystems can be categorized into two:
1. Symmetric encryption – uses the same key for both encryption and decryption and
relies on safe communication lines for exchanging the key.
2. Asymmetric encryption – uses different keys for encryption and decryption.
Generally, Symmetric algorithms are much faster to execute on a computer than those
that are asymmetric. Asymmetric algorithms are more secure than symmetric algorithms.
Public Key Encryption algorithm: Asymmetric encryption
This algorithm operates with modular arithmetic – mod n, where n is the product of two
large prime numbers.
Two keys, d and e, are used for decryption and encryption.
n is chosen as a large integer that is a product of two large distinct prime
numbers, p and q.
The encryption key e is a randomly chosen number between 1 and n that is
relatively prime to (p-1) x (q-1).
The plaintext m is encrypted as C= me mod n.
However, the decryption key d is carefully chosen so that C d mod n = m.
The decryption key d can be computed from the condition that d x e -1 is divisible by (p-
1)x(q-1). Thus, the legitimate receiver who knows d simply computes Cd mod n = m
and recovers m.
Simple Example: Asymmetric encryption
1. Select primes p=11, q=3.
2. n = pq = 11*3 = 33
3. find phi which is given by, phi = (p-1)(q-1) = 10*2 = 20
4. Choose e=3 ( 1<e<phi)
5. Check for gcd(e, phi) = gcd(e, (p-1)(q-1)) = gcd(3, 20) = 1
6. Compute d (1<d<phi) such that d *e -1 is divisible by phi
Simple testing (d = 2, 3 ...) gives d = 7
7. Check: ed-1 = 3*7 - 1 = 20, which is divisible by phi (20).
Given
Connected.
SQL> revoke select on tbl from Tolasa;
Revoke succeeded.
SQL>connect Tolasa/Tolasa
SQL> select * from Biniam.tbl;
ERROR at line 1: ORA-00942: table or view does not exist
SQL> drop role assistant cascade;
drop role assistant cascade *
ERROR at line 1: ORA-01031: insufficient privileges
SQL> connect system/auwc
Connected.
SQL> drop role assistant cascade;
Role dropped.
SQL> connect Biniam/Biniam
ERROR: ORA-01045: user Biniam lacks CREATE SESSION privilege;
logon denied
Warning: you are no longer connected to ORACLE.
SQL> connect Tolasa/Tolasa
ERROR:ORA-01045: user Tolasa lacks CREATE SESSION privilege;
logon denied
Warning: you are no longer connected to ORACLE.
SQL> connect system/auwc
Connected.
SQL> drop user Biniam cascade;
User dropped.
SQL> drop user Tolasa cascade;
User dropped.
Let a and b be integers, not both zero. Then the greatest common divisor (GCD) of a
and b is the largest positive integer which is a factor of both a and b. We use gcd (a,
b) to denote this largest positive factor. One can extend this definition by setting
gcd(0,0)=0. Sage uses gcd (a, b) to denote the GCD of a and b. The GCD of any two
distinct primes is 1, and the GCD of 18 and 27 is 9. sage: gcd(3, 59) = 1, sage:
gcd(18, 27) = 9