Analytical Functions
Analytical Functions
This article provides a clear, thorough concept of analytic functions and its
various options by a series of simple yet concept building examples. The article is
intended for SQL coders, who for might be not be using analytic functions due to
unfamiliarity with its cryptic syntax or uncertainty about its logic of operation.
Often I see that people tend to reinvent the feature provided by analytic functions
by native join and sub-query SQL. This article assumes familiarity with basic
Oracle SQL, sub-query, join and group function from the reader. Based on that
familiarity, it builds the concept of analytic functions through a series of
examples.
It is true that whatever an analytic function does can be done by native SQL, with
join and sub-queries. But the same routine done by analytic function is always
faster, or at least as fast, when compared to native SQL. Moreover, I am not
considering here the amount of time that is spent in coding the native SQLs,
testing, debugging and tuning them.
SELECT deptno,
COUNT(*) DEPT_COUNT
FROM emp
WHERE deptno IN (20, 30)
GROUP BY deptno;
DEPTNO DEPT_COUNT
---------------------- ----------------------
20 5
30 6
2 rows selected
Query-1
Consider the Query-1 and its result. Query-1 returns departments and their employee
count. Most importantly it groups the records into departments in accordance with
the GROUP BY clause. As such any non-"group by" column is not allowed in the select
clause.
11 rows selected.
Query-2
Now consider the analytic function query (Query-2) and its result. Note the
repeating values of DEPT_COUNT column.
This brings out the main difference between aggregate and analytic functions.
Though analytic functions give aggregate result they do not group the result set.
They return the group value multiple times with each record. As such any other
non-"group by" column or expression can be present in the select clause, for
example, the column EMPNO in Query-2.
Analytic functions are computed after all joins, WHERE clause, GROUP BY and HAVING
are computed on the query. The main ORDER BY clause of the query operates after the
analytic functions. So analytic functions can only appear in the select list and in
the main ORDER BY clause of the query.
Query-3
SELECT COUNT(*) FROM emp
WHERE deptno IN (10, 20);
COUNT(*)
----------
8
Query-4
How to break the result set in groups or partitions?
It might be obvious from the previous example that the clause PARTITION BY is used
to break the result set into groups. PARTITION BY can take any non-analytic SQL
expression.
Some functions support the <window_clause> inside the partition to further limit
the records they act on. In the absence of any <window_clause> analytic functions
are computed on all the records of the partition clause.
The functions SUM, COUNT, AVG, MIN, MAX are the common analytic functions the
result of which does not depend on the order of the records.
Functions like LEAD, LAG, RANK, DENSE_RANK, ROW_NUMBER, FIRST, FIRST VALUE, LAST,
LAST VALUE depends on order of records. In the next example we will see how to
specify that.
The general syntax of specifying the ORDER BY clause in analytic function is:
8 rows selected.
Query-5
RANK and DENSE_RANK both provide rank to the records based on some column value or
expression. In case of a tie of 2 records at position N, RANK declares 2 positions
N and skips position N+1 and gives position N+2 to the next record. While
DENSE_RANK declares 2 positions N but does not skip position N+1.
Query-6 shows the usage of both RANK and DENSE_RANK. For DEPTNO 20 there are two
contenders for the first position (EMPNO 7788 and 7902). Both RANK and DENSE_RANK
declares them as joint toppers. RANK skips the next value that is 2 and next
employee EMPNO 7566 is given the position 3. For DENSE_RANK there are no such gaps.
8 rows selected.
Query-6
LEAD and LAG
LEAD has the ability to compute an expression on the next rows (rows which are
going to come after the current row) and return the value to the current row. The
general syntax of LEAD is shown below:
The syntax of LAG is similar except that the offset for LAG goes into the previous
rows.
Query-7 and its result show simple usage of LAG and LEAD function.
8 rows selected.
Query-7
FIRST VALUE and LAST VALUE function
The general syntax is:
The FIRST_VALUE analytic function picks the first record from the partition after
doing the ORDER BY. The <sql_expr> is computed on the columns of this first record
and results are returned. The LAST_VALUE function is used in similar context except
that it acts on the last record of the partition.
-- How many days after the first hire of each department were the next
-- employees hired?
11 rows selected.
Query-8
FIRST and LAST function
The FIRST function (or more properly KEEP FIRST function) is used in a very special
situation. Suppose we rank a group of record and found several records in the first
rank. Now we want to apply an aggregate function on the records of the first rank.
KEEP FIRST enables that.
Please note that FIRST and LAST are the only functions that deviate from the
general syntax of analytic functions. They do not have the ORDER BY inside the OVER
clause. Neither do they support any <window> clause. The ranking done in FIRST and
LAST is always DENSE_RANK. The query below shows the usage of FIRST function. The
LAST function is used in similar context to perform computations on last ranked
records.
-- How each employee's salary compare with the average salary of the first
-- year hires of their department?
8 rows selected.
Query-9
How to specify the Window clause (ROW type or RANGE type windows)?
Some analytic functions (AVG, COUNT, FIRST_VALUE, LAST_VALUE, MAX, MIN and SUM
among the ones we discussed) can take a window clause to further sub-partition the
result and apply the analytic function. An important feature of the windowing
clause is that it is dynamic in nature.
The ROW or RANGE window cannot appear together in one OVER clause. The window
clause is defined in terms of the current row. But may or may not include the
current row. The start point of the window and the end point of the window can
finish before the current row or after the current row. Only start point cannot
come after the end point of the window. In case any point of the window is
undefined the default is UNBOUNDED PRECEDING for <start_expr> and UNBOUNDED
FOLLOWING for <end_expr>.
If the end point is the current row, syntax only in terms of the start point can be
can be
[ROW or RANGE] [<start_expr> PRECEDING or UNBOUNDED PRECEDING ]
[ROW or RANGE] CURRENT ROW is also allowed but this is redundant. In this case the
function behaves as a single-row function and acts only on the current row.
For ROW type windows the windowing clause is in terms of record numbers.
The query Query-10 has no apparent real life description (except column FROM_PU_C)
but the various windowing clause are illustrated by a COUNT(*) function. The count
simply shows the number of rows inside the window definition. Note the build up of
the count for each column for the YEAR 1981.
The column FROM_P3_TO_F1 shows an example where start point of the window is before
the current row and end point of the window is after current row. This is a 5 row
window; it shows values less than 5 during the beginning and end.
14 rows selected.
Query-10
The column FROM_PU_TO_CURR shows an example where start point of the window is
before the current row and end point of the window is the current row. This column
only has some real world significance. It can be thought of as the yearly employee
build-up of the organization as each employee is getting hired.
The column FROM_P2_TO_P1 shows an example where start point of the window is before
the current row and end point of the window is before the current row. This is a 3
row window and the count remains constant after it has got 3 previous rows.
The column FROM_F1_TO_F3 shows an example where start point of the window is after
the current row and end point of the window is after the current row. This is a
reverse of the previous column. Note how the count declines during the end.
RANGE Windows
For RANGE windows the general syntax is same as that of ROW:
If <sql_expr> evaluates to a numeric value, then the ORDER BY expr must be a NUMBER
or DATE datatype. If <sql_expr> evaluates to an interval value, then the ORDER BY
expr must be a DATE datatype.
Note the example (Query-11) below which uses RANGE windowing. The important thing
here is that the size of the window in terms of the number of records can vary.
-- For each employee give the count of employees getting half more that their
-- salary and also the count of employees in the departments 20 and 30 getting half
11 rows selected.
Query-11
Order of computation and performance tips
Defining the PARTITOIN BY and ORDER BY clauses on indexed columns (ordered in
accordance with the PARTITION CLAUSE and then the ORDER BY clause in analytic
function) will provide optimum performance. For Query-5, for example, a composite
index on (deptno, hiredate) columns will prove effective.
It is advisable to always use CBO for queries using analytic functions. The tables
and indexes should be analyzed and optimizer mode should be CHOOSE.
Conclusion
The aim of this article is not to make the reader try analytic functions forcibly
in every other complex SQL. It is meant for a SQL coder, who has been avoiding
analytic functions till now, even in complex analytic queries and reinventing the
same feature much painstakingly by native SQL and join query. Its job is done if
such a person finds analytic functions clear, understandable and usable after going
through the article, and starts using them.