22 SQL For Modeling
22 SQL For Modeling
22 SQL For Modeling
The MODEL clause enables you to create a multidimensional array by mapping the
columns of a query into three groups: partitioning, dimension, and measure columns.
These elements perform the following tasks:
Partition columns define the logical blocks of the result set in a way similar to
the partitions of the analytical functions described in Chapter 21, " SQL for
Analysis and Reporting". Rules in the MODELclause are applied to each partition
independent of other partitions. Thus, partitions serve as a boundary point for
parallelizing the MODEL computation.
Dimension columns define the multi-dimensional array and are used to identify
cells within a partition. By default, a full combination of dimensions should
identify just one cell in a partition. In default mode, they can be considered
analogous to the key of a relational table.
Measures are equivalent to the measures of a fact table in a star schema. They
typically contain numeric values such as sales units or cost. Each cell is
accessed by specifying its full combination of dimensions. Note that each
partition may have a cell that matches a given combination of dimensions.
The MODEL clause enables you to specify rules to manipulate the measure values of the
cells in the multi-dimensional array defined by partition and dimension columns.
Rules access and update measure column values by specifying dimension values
symbolically. Such symbolic references used in rules result in a highly readable
model. Rules are concise and flexible, and can use wild cards and looping constructs
for maximum expressiveness. Oracle evaluates the rules in an efficient way,
parallelizes the model computation whenever possible, and provides a seamless
integration of the MODEL clause with other SQL clauses. The MODEL clause, thus, is a
scalable and manageable way of computing business models in the database.
Figure 22-1 offers a conceptual overview of the modeling feature of SQL. The figure
has three parts. The top segment shows the concept of dividing a typical table into
partition, dimension, and measure columns. The middle segment shows two rules that
calculate the value of Prod1 and Prod2 for the year 2002. Finally, the third part shows
the output of a query that applies the rules to such a table with hypothetical data. The
unshaded output is the original data as it is retrieved from the database, while the
shaded output shows the rows calculated by the rules. Note that results in partition A
are calculated independently from results of partition B.
Figure 22-2 shows the flow of processing within a simple MODEL clause. In this case,
we will follow data through a MODEL clause that includes three rules. One of the rules
updates an existing value, while the other two create new values for a forecast. The
figure shows that the rows of data retrieved by a query are fed into the MODEL clause
and rearranged into an array. Once the array is defined, rules are applied one by one to
the data. Finally, the data, including both its updated values and newly created values,
is rearranged into row form and presented as the results of the query.
This query partitions the data in sales_view (which is illustrated in "Base Schema") on
country so that the model computation, as defined by the three rules, is performed on
each country. This model calculates the sales of Bounce in 2002 as the sum of its sales
in 2000 and 2001, and sets the sales for Y Box in 2002 to the same value as they were
in 2001. Also, it introduces a new product category All_Products (sales_view does not
have the product All_Products) for year 2002 to be the sum of sales of Bounce and Y
Box for that year. The output of this query is as follows, where bold text indicates new
values:
COUNTRY PRODUCT YEAR SALES
-------------------- --------------- ---------- ----------
Italy Bounce 1999 2474.78
Italy Bounce 2000 4333.69
Italy Bounce 2001 4846.3
Italy Bounce 2002 9179.99
...
Italy Y Box 1999 15215.16
Italy Y Box 2000 29322.89
Italy Y Box 2001 81207.55
Italy Y Box 2002 81207.55
...
Italy All_Products 2002 90387.54
...
Japan Bounce 1999 2961.3
Japan Bounce 2000 5133.53
Japan Bounce 2001 6303.6
Japan Bounce 2002 11437.13
...
Japan Y Box 1999 22161.91
Japan Y Box 2000 45690.66
Japan Y Box 2001 89634.83
Japan Y Box 2002 89634.83
...
Japan All_Products 2002 101071.96
...
Note that, while the sales values for Bounce and Y Box exist in the input, the values
for All_Products are derived.
Oracle Database provides the following capabilities with the MODEL clause:
You can specify a series of formulas, called rules, to operate on the data. Rules
can invoke functions on individual cells or on a set or range of cells. An
example involving individual cells is the following:
sales[country='Spain',year=2001] = sales['Spain',2000]+
sales['Spain',1999]
This sets the sales in Spain for the year 2001 to the sum of sales in Spain for
1999 and 2000. An example involving a range of cells is the following:
sales[country='Spain',year=2001] =
MAX(sales)['Spain',year BETWEEN 1997 AND 2000]
This sets the sales in Spain for the year 2001 equal to the maximum sales in
Spain between 1997 and 2000.
Using the UPSERT option, which is the default, you can create cell values that do
not exist in the input data. If the cell referenced exists in the data, it is updated.
Otherwise, it is inserted. The UPDATEoption, on the other hand, would not insert
any new cells.
You can specify these options globally, in which case they apply to all rules, or
per each rule. If you specify an option at the rule level, it overrides the global
option. Consider the following rules:
UPDATE sales['Spain', 1999] = 3567.99,
UPSERT sales['Spain', 2001] = sales['Spain', 2000]+ sales['Spain',
1999]
The first rule updates the cell for sales in Spain for 1999. The second rule
updates the cell for sales in Spain for 2001 if it exists, otherwise, it creates a
new cell.
You can use ANY and IS ANY to specify all values in a dimension. As an example,
consider the following statement:
sales[ANY, 2001] = sales['Japan', 2000]
This rule sets the 2001 sales of all countries equal to the sales value of Japan
for the year 2000. All values for the dimension, including nulls, satisfy
the ANY specification. You can specify the same in symbolic form using
an IS ANY predicate as in the following:
sales[country IS ANY, 2001] = sales['Japan', 2000]
You can use the CV function on the right side of a rule to access the value of a
dimension column of the cell referenced on the left side of a rule. It enables you
to combine multiple rules performing similar computation into a single rule,
thus resulting in concise specification. For example, you can combine the
following rules:
sales[country='Spain', year=2002] = 1.2 * sales['Spain', 2001],
sales[country='Italy', year=2002] = 1.2 * sales['Italy', 2001],
sales[country='Japan', year=2002] = 1.2 * sales['Japan', 2001]
Observe that the CV function passes the value for the country dimension from
the left to the right side of the rule.
Ordered computation
For rules updating a set of cells, the result may depend on the ordering of
dimension values. You can force a particular order for the dimension values by
specifying an ORDER BY in the rule. An example is the following rule:
sales[country IS ANY, year BETWEEN 2000 AND 2003] ORDER BY year =
1.05 * sales[CV(country), CV(year)-1]
This ensures that the years are referenced in increasing chronological order.
Reference models
Rules can reference cells from different multi-dimensional arrays. All but one
of the multi-dimensional arrays used in the model are read-only and are called
reference models. Rules can update or insert cells in only one multi-
dimensional array, which is called the main model. The use of reference models
enables you to relate models with different dimensionality. For example,
assume that, in addition to the fact table ft(country, year, sales), you have a
table with currency conversion ratios cr(country, ratio) with country as the
dimension column and ratio as the measure. Each row in this table gives the
conversion ratio of that country's currency to that of US dollar. These two
tables could be used in rules such as the following:
dollar_sales['Spain',2001] = sales['Spain',2000] * ratio['Spain']
Scalable computation
You can partition data and evaluate rules within each partition independent of
other partitions. This enables parallelization of model computation based on
partitions. For example, consider the following model:
MODEL PARTITION BY (c) DIMENSION BY (y) MEASURES (s)
(sales[y=2001] = AVG(s)[y BETWEEN 1990 AND 2000]
The data is partitioned by country and, within each partition, you can compute
the sales in 2001 to be the average of sales in the years between 1990 and 2000.
Partitions can be processed in parallel and this results in a scalable execution of
the model.
Base Schema
MODEL Clause Syntax
Keywords in SQL Modeling
Rules and Restrictions when Using SQL for Modeling
Cell Referencing
Differences Between Update and Upsert
Base Schema
This chapter's examples are based on the following view sales_view, which is derived
from the sh sample schema.
CREATE VIEW sales_view AS
SELECT country_name country, prod_name product, calendar_year year,
SUM(amount_sold) sales, COUNT(amount_sold) cnt,
MAX(calendar_year) KEEP (DENSE_RANK FIRST ORDER BY SUM(amount_sold) DESC)
OVER (PARTITION BY country_name, prod_name) best_year,
MAX(calendar_year) KEEP (DENSE_RANK LAST ORDER BY SUM(amount_sold) DESC)
OVER (PARTITION BY country_name, prod_name) worst_year
FROM sales, times, customers, countries, products
WHERE sales.time_id = times.time_id AND sales.prod_id = products.prod_id AND
sales.cust_id =customers.cust_id AND
customers.country_id=countries.country_id
GROUP BY country_name, prod_name, calendar_year;
This query computes SUM and COUNT aggregates on the sales data grouped by country,
product, and year. It will report for each product sold in a country, the year when the
sales were the highest for that product in that country. This is called the best_year of
the product. Similarly, worst_year gives the year when the sales were the lowest.
The MODEL clause enables you to define multi-dimensional calculations on the data in
the SQL query block. In multi-dimensional applications, a fact table consists of
columns that uniquely identify a row with the rest serving as dependent measures or
attributes. The MODEL clause lets you specify the PARTITION, DIMENSION,
and MEASURE columns that define the multi-dimensional array, the rules that operate on
this multi-dimensional array, and the processing options.
The MODEL clause contains a list of updates representing array computation within a
partition and is a part of a SQL query block. Its structure is as follows:
MODEL
[<global reference options>]
[<reference models>]
[MAIN <main-name>]
[PARTITION BY (<cols>)]
DIMENSION BY (<cols>)
MEASURES (<cols>)
[<reference options>]
[RULES] <rule options>
(<rule>, <rule>,.., <rule>)
Each rule represents an assignment. Its left side references a cell or a set of cells and
the right side can contain expressions involving constants, host variables, individual
cells or aggregates over ranges of cells. For example, consider Example 22-1.
Specifying RETURN UPDATED ROWS makes the preceding query return only those rows that
are updated or inserted by the model computation. By default or if you
use RETURN ALL ROWS, you would get all rows not just the ones updated or inserted by
the MODEL clause. The query produces the following output:
COUNTRY PRODUCT YEAR SALES
-------------------- --------------- ---------- ----------
Italy Bounce 2001 1000
Italy Bounce 2002 5333.69
Italy Y Box 2002 81207.55
Japan Bounce 2001 1000
Japan Bounce 2002 6133.53
Japan Y Box 2002 89634.83
Note that the MODEL clause does not update or insert rows into database tables. The
following query illustrates this be showing that sales_view has not been altered:
SELECT SUBSTR(country,1,20) country, SUBSTR(product,1,15) product, year,
sales
FROM sales_view
WHERE country IN ('Italy', 'Japan');
Observe that the update of the sales value for Bounce in the 2001 done by
this MODEL clause is not reflected in the database. If you want to update or insert rows
in the database tables, you should use theINSERT, UPDATE, or MERGE statements.
In the preceding example, columns are specified in the PARTITION BY, DIMENSION BY,
and MEASURES list. You can also specify constants, host variables, single-row functions,
aggregate functions, analytical functions, or expressions involving them as partition
and dimension keys and measures. However, you need to alias them
in PARTITION BY, DIMENSION BY, and MEASURES lists. You need to use aliases to refer these
expressions in the rules, SELECT list, and the query ORDER BY. The following example
shows how to use expressions and aliases:
SELECT country, p product, year, sales, profits
FROM sales_view
WHERE country IN ('Italy', 'Japan')
MODEL
RETURN UPDATED ROWS
PARTITION BY (SUBSTR(country,1,20) AS country)
DIMENSION BY (product AS p, year)
MEASURES (sales, 0 AS profits)
RULES
(profits['Bounce', 2001] = sales['Bounce', 2001] * 0.25,
sales['Bounce', 2002] = sales['Bounce', 2001] + sales['Bounce', 2000],
profits['Bounce', 2002] = sales['Bounce', 2002] * 0.35)
ORDER BY country, year;
See Oracle Database SQL Reference for more information regarding MODEL clause
syntax.
UPDATE
This updates existing cell values. If the cell values do not exist, no updates are
done.
UPSERT
This updates existing cell values. If the cell values do not exist, they are
inserted.
IGNORE NAV
For numeric cells, this treats nulls and absent values as 0. This means that a cell
not supplied to MODEL by the query result set will be treated as a zero for the
calculation. This can be used at a global level for all measures in a model.
KEEP NAV
This keeps null and absent cell values unchanged. It is useful for making
exceptions when IGNORE NAV is specified at the global level. This is the default,
and can be omitted.
Calculation Definition
MEASURES
RULES
AUTOMATIC ORDER
SEQUENTIAL ORDER
This causes rules to be evaluated in the order they are written. This is the
default.
UNIQUE DIMENSION
This is the default, and it means that the PARTITION BY and DIMENSION BY columns
in the MODEL clause must uniquely identify each and every cell in the model.
This uniqueness is explicitly verified at run time when necessary, in which case
it causes some overhead.
Cell Referencing
This partitions the data by country and defines within each partition, a two-
dimensional array on product and year. The cells of this array contain two
measures: sales and best_year.
Accessing the measure value of a cell by specifying the DIMENSION BY keys constitutes
a cell reference. An example of a cell reference is sales[product= 'Bounce',
year=2000].
Here, we are accessing the sales value of a cell referenced by product Bounce and the
year 2000. In a cell reference, you can specify DIMENSION BY keys either symbolically
as in the preceding cell reference or positionally as in sales['Bounce', 2000].
Based on how they are specified, cell references are either single cell or multi-cell
reference.
A cell reference in which a single value of interest is specified for each dimension
either symbolically or positionally is called a single cell reference. This references a
single cell in the default mode in which dimension keys are unique within a partition.
For example, considering the following cell reference:
sales['Bounce', year=2001]
This is a single cell reference in which a single value is specified for the first
dimension positionally and a single value for second dimension (year) is specified
symbolically.
Multi-Cell References
Cell references that are not single cell references are called multi-cell references.
Examples of multi-cell reference are:
sales[year>=2001],
sales[product='Bounce', year < 2001], AND
sales[product LIKE '%Finding Fido%', year IN (1999, 2000, 2001)]
Rules
Model computation is expressed in rules that manipulate the cells of the multi-
dimensional array defined by PARTITION BY, DIMENSION BY, and MEASURES clauses. A rule
is an assignment statement whose left side represents a cell or a range of cells and
whose right side is an expression involving constants, bind variables, individual cells
or an aggregate function on a range of cells. Rules can use wild cards and looping
constructs for maximum expressiveness. An example of a rule is the following:
sales['Bounce', 2003] = 1.2 * sales['Bounce', 2002]
This rule says that, for the product Bounce, the sales for 2003 are 20% more than that
of 2002.
Note that this rule has single cell references on both left and right side and is relatively
simple. Complex rules can be written with multi-cell references, aggregates, and
nested cell references. These are discussed in the following sections.
This type of rule involves single cell reference on the left side with constants and
single cell references on the right side. Some examples are the following:
sales[product='Finding Fido', year=2003] = 100000
sales['Bounce', 2003] = 1.2 * sales['Bounce', 2002]
sales[product='Finding Fido', year=2004] = 0.8 * sales['Standard Mouse Pad',
year=2003] + sales['Finding Fido', 2003]
Multi-cell references can be used on the right side of rules, in which case an aggregate
function needs to be applied on them to convert them to a single value. All existing
aggregate functions including OLAP aggregates (inverse percentile functions,
hypothetical rank and distribution functions and so on) and statistical aggregates
(correlation, regression slope and so on), and user-defined aggregate functions can be
used. For example, the rule to compute the sales of Bounce for 2003 to be 100 more
than the maximum sales in the period 1998 to 2002 would be:
sales['Bounce', 2003] = 100 + MAX(sales)['Bounce', year BETWEEN 1998 AND
2002]
This shows the usage of regression slope REGR_SLOPE function in rules. This function
computes the slope of the change of a measure with respect to a dimension of the
measure. In the preceding example, it gives the slope of the changes in the sales value
with respect to year. This model projects Finding Fido sales for 2003 to be the sales in
2002 scaled by the growth (or slope) in sales for years less than 2002.
Aggregate functions can appear only on the right side of rules. Arguments to the
aggregate function can be constants, bind variables, measures of the MODEL clause, or
expressions involving them. For example, the rule computes the sales of Bounce for
2003 to be the weighted average of its sales for years from 1998 to 2002 would be:
sales['Bounce', 2003] =
AVG(sales * weight)['Bounce', year BETWEEN 1998 AND 2002]
Rules can have multi-cell references on the left side as in the following:
sales['Standard Mouse Pad', year > 2000] =
0.2 * sales['Finding Fido', year=2000]
This rule accesses a range of cells on the left side (cells for product Standard Mouse
Pad and year greater than 2000) and assigns sales measure of each such cell to the
value computed by the right side expression. Computation by the preceding rule is
described as "sales of Standard Mouse Pad for years after 2000 is 20% of the sales of
Finding Fido for year 2000". This computation is simple in that the right side cell
references and hence the right side expression are the same for all cells referenced on
the left.
sales[product='Standard Mouse Pad', year>2000] = sales[CV(product),
CV(year)] + 0.2 * sales['Finding Fido', 2000]
The CV function provides the value of a DIMENSION BY key of the cell currently
referenced on the left side. When the left side references the cell Standard Mouse Pad
and 2001, the right side expression would be:
sales['Standard Mouse Pad', 2001] + 0.2 * sales['Finding Fido', 2000]
Similarly, when the left side references the cell Standard Mouse Pad and 2002, the
right side expression we would evaluate is:
sales['Standard Mouse Pad', 2002] + 0.2 * sales['Finding Fido', 2000]
The use of the CV function provides the capability of relative indexing where
dimension values of the cell referenced on the left side are used on the right side cell
references. The CV function takes a dimension key as its argument. It is also possible
to use CV without any argument as in CV() and in which case, positional referencing is
implied. CV() may be used outside a cell reference, but when used in this way its
argument must contain the name of the dimension desired. You can also write the
preceding rule as:
sales[product='Standard Mouse Pad', year>2000] =
sales[CV(), CV()] + 0.2 * sales['Finding Fido', 2000]
The first CV() reference corresponds to CV(product) and the latter corresponds
to CV(year). The CV function can be used only in right side cell references. Another
example of the usage of CV function is the following:
sales[product IN ('Finding Fido','Standard Mouse Pad','Bounce'), year
BETWEEN 2002 AND 2004] = 2 * sales[CV(product), CV(year)-10]
This rule says that, for products Finding Fido, Standard Mouse Pad, and Bounce, the
sales for years between 2002 and 2004 will be twice of what their sales were 10 years
ago.
You can use the wild card ANY in cell references to qualify all dimension values
including nulls. ANY maybe be used on both the left and right side of rules. For
example, a rule for the computation "sales of all products for 2003 are 10% more than
their sales for 2002" would be the following:
sales[product IS ANY, 2003] = 1.1 * sales[CV(product), 2002]
Cell references can be nested. In other words, cell references providing dimension
values can be used within a cell reference. An example, assuming best_year is a
measure, for nested cell reference is given as follows:
sales[product='Bounce', year = best_year['Bounce', 2003]]
Here, the nested cell reference best_year['Bounce', 2003] provides value for the
dimension key year and is used in the symbolic reference for year.
Measures best_year and worst_year give, for each year (y) and product (p)
combination, the year for which sales of product p were highest or lowest. The
following rule computes the sales of Standard Mouse Pad for 2003 to be the average
of Standard Mouse Pad sales for the years in which Finding Fido sales were highest
and lowest:
sales['Standard Mouse Pad', 2003] = (sales[CV(), best_year['Finding Fido',
CV(year)]] + sales[CV(), worst_year['Finding Fido', CV(year)]]) / 2
Oracle allows only one level of nesting, and only single cell references can be used as
nested cell references. Aggregates on multi-cell references cannot be used in nested
cell references.
By default, rules are evaluated in the order they appear in the MODEL clause. You can
specify an optional keyword SEQUENTIAL ORDER in the MODEL clause to make such an
evaluation order explicit. SQL models with sequential rule order of evaluation are
called sequential order models. For example, the following RULES specification makes
Oracle evaluate rules in the specified sequence:
RULES SEQUENTIAL ORDER
(sales['Bounce', 2001] =
sales['Bounce', 2000] + sales['Bounce', 1999], --Rule R1
sales['Bounce', 2000] = 50000, --Rule R2
sales['Bounce', 1999] = 40000) --Rule R3
Alternatively, the option AUTOMATIC ORDER enables Oracle to determine the order of
evaluation of rules automatically. Oracle examines the cell references within rules and
constructs a dependency graph based on dependencies among rules. If cells referenced
on the left side of rule R1 are referenced on the right side of another rule R2, then R2 is
considered to depend on R1. In other words, rule R1 should be evaluated before rule R2.
If you specify AUTOMATIC ORDER in the preceding example as in:
RULES AUTOMATIC ORDER
(sales['Bounce', 2001] = sales['Bounce', 2000] + sales['Bounce', 1999],
sales['Bounce', 2000] = 50000,
sales['Bounce', 1999] = 40000)
Rules 2 and 3 are evaluated, in some arbitrary order, before rule 1. This is because
rule 1 depends on rules 2 and 3 and hence need to be evaluated after rules 2 and 3.
The order of evaluation among second and third rules can be arbitrary as they do not
depend on one another. The order of evaluation among rules independent of one
another can be arbitrary. A dependency graph is analyzed to come up with the rule
evaluation order. SQL models with automatic order of evaluation, as in the preceding
fragment, are called automatic order models.
In an automatic order model, multiple assignments to the same cell are not allowed. In
other words, measure of a cell can be assigned only once. Oracle will return an error
in such cases as results would be non-deterministic. For example, the following rule
specification will generate an error as sales['Bounce', 2001] is assigned more than
once:
RULES AUTOMATIC ORDER
(sales['Bounce', 2001] = sales['Bounce', 2000] + sales['Bounce', 1999],
sales['Bounce', 2001] = 50000,
sales['Bounce', 2001] = 40000)
The rules assigning the sales of product Bounce for 2001 do not depend on one
another and hence, no particular evaluation order can be fixed among them. This leads
to non-deterministic results as the evaluation order is arbitrary - sales['Bounce',
2001] can be 40000 or 50000 or sum of Bounce sales for years 1999 and 2000. Oracle
prevents this by disallowing multiple assignments when AUTOMATIC ORDERis specified.
However, multiple assignments are fine in sequential order models.
If SEQUENTIAL ORDER was specified instead of AUTOMATIC ORDER in the preceding
example, the result of sales['Bounce', 2001]would be 40000.
Differences Between Update and Upsert
Rules in the MODEL clause have UPSERT semantics by default. If the cell referenced on
the left side of a rule exists, then its measure is updated with the value of the right side
expression. Otherwise, if a cell reference is positional, a new cell is created (that is,
inserted in to the multi-dimensional array) with the measure value equal to the value
of the right side expression. If a cell reference is not positional, it will not insert cells.
Note that if any column in a cell is symbolic, inserts are not possible. For example,
consider the following rule:
sales['Bounce', 2003] = sales['Bounce', 2001] + sales ['Bounce', 2002]
The cell for product Bounce and year 2003, if it exists, gets updated with the sum of
Bounce sales for years 2001 and 2002, otherwise, it gets created. An
optional UPSERT keyword can be specified in the MODELclause to make this upsert
semantic explicit.
Alternatively, the UPDATE option forces strict update mode. In this mode, the rule is
ignored if the cell it references on the left side does not exist.
You can specify an UPDATE or UPSERT option at the global level in the RULES clause in
which case all rules operate in the respective mode. These options can be specified at
a local level with each rule and in which case, they override the global behavior. For
example, in the following specification:
RULES UPDATE
(UPDATE s['Bounce',2001] = sales['Bounce',2000] + sales['Bounce',1999],
UPSERT s['Y Box', 2001] = sales['Y Box', 2000] + sales['Y Box', 1999],
sales['Mouse Pad', 2001] = sales['Mouse Pad', 2000] +
sales['Mouse Pad',1999])
The UPDATE option is specified at global level so, first and third rules operate in update
mode. The second rule operates in upsert mode as an UPSERT keyword is specified with
that rule. Note that no option was specified for the third rule and hence it inherits the
update behavior from the global option.
Using UPSERT would create a new cell corresponding to the one referenced on the left
side of the rule when the cell is missing, and the cell reference contains only
positional references qualified by constants.
Assuming we do not have cells for years greater than 2003, consider the following
rule:
UPSERT sales['Bounce', year = 2004] = 1.1 * sales['Bounce', 2002]
This would not create any new cell because of the symbolic reference year = 2004.
However, consider the following:
UPSERT sales['Bounce', 2004] = 1.1 * sales['Bounce', 2002]
This would create a new cell for product Bounce for year 2004. On a related note, new
cells will not be created if any of the positional reference is ANY. This is because ANY is
predicate that qualifies all dimensional values including NULL. If there is positional
reference ANY for a dimension d, then it can be considered as a predicate (d IS NOT
NULL OR d IS NULL).
Applications using models would not only have to deal with non-deterministic values
for a cell measure in the form of NULL, but also with non-determinism in the form of
missing cells. A cell, referenced by a single cell reference, that is missing in the data
is called a missing cell. The MODEL clause provides a default treatment for nulls and
missing cells that is consistent with the ANSI SQL standard and also provides options
to treat them in other useful ways according to business logic, for example, to treat
nulls as zero for arithmetic operations.
By default, NULL cell measure values are treated the same way as nulls are treated
elsewhere in SQL. For example, in the following rule:
sales['Bounce', 2001] = sales['Bounce', 1999] + sales['Bounce', 2000]
The right side expression would evaluate to NULL if Bounce sales for one of the years
1999 and 2000 is NULL. Similarly, aggregate functions in rules would treat NULL values
in the same way as their regular behavior where NULL values are ignored during
aggregation.
Missing cells are treated as cells with NULL measure values. For example, in the
preceding rule, if the cell for Bounce and 2000 is missing, then it is treated as
a NULL value and the right side expression would evaluate to NULL.
The functions PRESENTV and PRESENTNNV enable you to identify missing cells and
distinguish them from NULL values. These functions take a single cell reference and
two expressions as arguments as inPRESENTV(cell, expr1, expr2). PRESENTV returns
the first expression expr1 if the cell cell is existent in the data input to
the MODEL clause. Otherwise, it returns the second expression expr2. For example,
consider the following:
PRESENTV(sales['Bounce', 2000], 1.1*sales['Bounce', 2000], 100)
If the cell for product Bounce and year 2000 exists, it returns the corresponding sales
multiplied by 1.1, otherwise, it returns 100. Note that if sales for the product Bounce
for year 2000 is NULL, the preceding specification would return NULL.
The PRESENTNNV function not only checks for the presence of a cell but also whether it
is NULL or not. It returns the first expression expr1 if the cell exists and is not NULL,
otherwise, it returns the second expression expr2. For example, consider the
following:
PRESENTNNV(sales['Bounce', 2000], 1.1*sales['Bounce', 2000], 100)
Applications can use the new IS PRESENT predicate in their model to check the
presence of a cell in an explicit fashion.This predicate returns TRUE if cell exists
and FALSE otherwise. The preceding example using PRESENTNNV can be written
using IS PRESENT as:
CASE WHEN sales['Bounce', 2000] IS PRESENT AND sales['Bounce', 2000] IS NOT
NULL
THEN
1.1 * sales['Bounce', 2000]
ELSE
100
END
The IS PRESENT predicate, like the PRESENTV and PRESENTNNV functions, checks for cell
existence in the input data, that is, the data as existed before the execution of
the MODEL clause. This enables you to initialize multiple measures of a cell newly
inserted by an UPSERT rule. For example, if you want to initialize sales and profit
values of a cell, if it does not exist in the data, for product Bounce and year 2003 to
1000 and 500 respectively, you can do so by the following:
RULES
(UPSERT sales['Bounce', 2003] =
PRESENTV(sales['Bounce', 2003], sales['Bounce', 2003], 1000),
UPSERT profit['Bounce', 2003] =
PRESENTV(profit['Bounce', 2003], profit['Bounce', 2003], 500))
The PRESENTV functions used in this formulation would both return TRUE or FALSE based
on the existence of the cell in the input data. If the cell for Bounce and 2003 gets
inserted by one of the rules, based on their evaluation order, PRESENTV function in the
other rule would still evaluate to FALSE. You can consider this behavior as a
preprocessing step to rule evaluation that evaluates and replaces
all PRESENTV andPRESENTNNV functions and IS PRESENT predicate by their respective
values.
The MODEL clause, by default, treats missing cells as cells with NULL measure values. An
optional KEEP NAV keyword can be specified in the MODEL clause to get this behavior.If
your application wants to default missing cells and nulls to some values, you can do
so by using IS PRESENT, IS NULL predicates and PRESENTV, PRESENTNNV functions. But it
may become cumbersome if you have lot of single cell references and rules. You can
use IGNORE NAV option instead of the default KEEP NAV option to default nulls and
missing cells to:
In this, the input to the MODEL clause does not have a cell for product Bounce and year
2002. Because of IGNORE NAV option, sales['Bounce', 2002] value would default to 0
(as sales is of numeric type) instead of NULL. Thus, sales['Bounce', 2003] value
would be same as that of sales['Bounce', 2001].
Note that symbolic reference sales[product = NULL] would not qualify nulls in
product dimension. This behavior is in conformance with the handling of the
predicate "product= NULL" by SQL.
Reference Models
In addition to the multi-dimensional array on which rules operate, which is called the
main model, one or more read-only multi-dimensional arrays, called reference
models, can be created and referenced in the MODEL clause to act as look-up tables.
Like the main model, a reference model is defined over a query block and
has DIMENSION BY and MEASURES clauses to indicate its dimensions and measures
respectively. A reference model is created by the following subclause:
REFERENCE model_name ON (query) DIMENSION BY (cols) MEASURES (cols)
[reference options]
Like the main model, a multi-dimensional array for the reference model is built before
evaluating the rules. But, unlike the main model, reference models are read-only in
that their cells cannot be updated and no new cells can be inserted after they are built.
Thus, the rules in the main model can access cells of a reference model, but they
cannot update or insert new cells into the reference model. References to the cells of a
reference model can only appear on the right side of rules. You can view reference
models as look-up tables on which the rules of the main model perform look-ups to
obtain cell values. The following is an example using a currency conversion table as a
reference model:
CREATE TABLE dollar_conv_tbl(country VARCHAR2(30), exchange_rate NUMBER);
INSERT INTO dollar_conv_tbl VALUES('Poland', 0.25);
INSERT INTO dollar_conv_tbl VALUES('France', 0.14);
...
Now, to convert the projected sales of Poland and France for 2003 to the US dollar,
you can use the dollar conversion table as a reference model as in the following:
SELECT country, year, sales, dollar_sales
FROM sales_view
GROUP BY country, year
MODEL
REFERENCE conv_ref ON (SELECT country, exchange_rate FROM dollar_conv_tbl)
DIMENSION BY (country) MEASURES (exchange_rate) IGNORE NAV
MAIN conversion
DIMENSION BY (country, year)
MEASURES (SUM(sales) sales, SUM(sales) dollar_sales) IGNORE NAV
RULES
(dollar_sales['France', 2003] = sales[CV(country), 2002] * 1.02 *
conv_ref.exchange_rate['France'],
dollar_sales['Poland', 2003] =
sales['Poland', 2002] * 1.05 * exchange_rate['Poland']);
Growth rates, in this example, are hard coded in the rules. The growth rate for France
is 2% and that of Poland is 5%. But they could come from a separate table and you
can have a reference model defined on top of that. Assume that you have
a growth_rate(country, year, rate) table defined as the following:
CREATE TABLE growth_rate_tbl(country VARCHAR2(30),
year NUMBER, growth_rate NUMBER);
INSERT INTO growth_rate_tbl VALUES('Poland', 2002, 2.5);
INSERT INTO growth_rate_tbl VALUES('Poland', 2003, 5);
...
INSERT INTO growth_rate_tbl VALUES('France', 2002, 3);
INSERT INTO growth_rate_tbl VALUES('France', 2003, 2.5);
Then the following query computes the projected sales in dollars for 2003 for all
countries:
SELECT country, year, sales, dollar_sales
FROM sales_view
GROUP BY country, year
MODEL
REFERENCE conv_ref ON
(SELECT country, exchange_rate FROM dollar_conv_tbl)
DIMENSION BY (country c) MEASURES (exchange_rate) IGNORE NAV
REFERENCE growth_ref ON
(SELECT country, year, growth_rate FROM growth_rate_tbl)
DIMENSION BY (country c, year y) MEASURES (growth_rate) IGNORE NAV
MAIN projection
DIMENSION BY (country, year) MEASURES (SUM(sales) sales, 0 dollar_sales)
IGNORE NAV
RULES
(dollar_sales[ANY, 2003] = sales[CV(country), 2002] *
growth_rate[CV(country), CV(year)] *
exchange_rate[CV(country)]);
This query shows the capability of the MODEL clause in dealing with and relating
objects of different dimensionality. Reference model conv_ref has one dimension
while the reference model growth_ref and the main model have two dimensions.
Dimensions in the single cell references on reference models are specified using
the CV function thus relating the cells in main model with the reference model. This
specification, in effect, is performing a relational join between main and reference
models.
Reference models also help you convert keys to sequence numbers, perform
computations using sequence numbers (for example, where a prior period would be
used in a subtraction operation), and then convert sequence numbers back to keys. For
example, consider a view that assigns sequence numbers to years:
CREATE or REPLACE VIEW year_2_seq (i, year) AS
SELECT ROW_NUMBER() OVER (ORDER BY calendar_year), calendar_year
FROM (SELECT DISTINCT calendar_year FROM TIMES);
This view can define two lookup tables: integer-to-year i2y, which maps sequence
numbers to integers, and year-to-integer y2i, which performs the reverse mapping.
The references y2i.i[year] andy2i.i[year] - 1 return sequence numbers of the
current and previous years respectively and the reference i2y.y[y2i.i[year]-
1] returns the year key value of the previous year. The following query demonstrates
such a usage of reference models:
SELECT country, product, year, sales, prior_period
FROM sales_view
MODEL
REFERENCE y2i ON (SELECT year, i FROM year_2_seq) DIMENSION BY (year y)
MEASURES (i)
REFERENCE i2y ON (SELECT year, i FROM year_2_seq) DIMENSION BY (i)
MEASURES (year y)
MAIN projection2 PARTITION BY (country)
DIMENSION BY (product, year)
MEASURES (sales, CAST(NULL AS NUMBER) prior_period)
(prior_period[ANY, ANY] = sales[CV(product), i2y.y[y2i.i[CV(year)]-1]])
ORDER BY country, product, year;
Nesting of reference model cell references is evident in the preceding example. Cell
reference on the reference model y2i is nested inside the cell reference on i2y which,
in turn, is nested in the cell reference on the main SQL model. There is no limitation
on the levels of nesting you can have on reference model cell references. However,
you can only have two levels of nesting on the main SQL model cell references.
Finally, the following are restrictions on the specification and usage of reference
models:
FOR Loops
Iterative Models
Ordered Rules
Unique Dimensions Versus Unique Single References
FOR Loops
The MODEL clause provides a FOR construct that can be used inside rules to express
computations more compactly. It can be used on both the left and right side of a rule.
For example, consider the following computation, which estimates the sales of several
products for 2004 to be 10% higher than their sales for 2003:
RULES UPSERT
(sales['Bounce', 2004] = 1.1 * sales['Bounce', 2003],
sales['Standard Mouse Pad', 2004] = 1.1 * sales['Standard Mouse Pad', 2003],
...
sales['Y Box', 2004] = 1.1 * sales['Y Box', 2003])
The UPSERT option is used in this computation so that cells for these products and 2004
will be inserted if they are not previously present in the multi-dimensional array. This
is rather bulky as you have to have as many rules as there are products. Using
the FOR construct, this computation can be represented compactly and with exactly the
same semantics as in:
RULES UPSERT
(sales[FOR product IN ('Bounce', 'Standard Mouse Pad', ..., 'Y Box'), 2004] =
1.1 * sales[CV(product), 2003])
If you write a specification similar to this, but without the FOR keyword as in the
following:
RULES UPSERT
(sales[product IN ('Bounce', 'Standard Mouse Pad', ..., 'Y Box'), 2004] =
1.1 * sales[CV(product), 2003])
You would get UPDATE semantics even though you have specified UPSERT. In other
words, existing cells will be updated but no new cells will be created by this
specification. This is because of the symbolic multi-cell reference on product that is
treated as a predicate. You can view a FOR construct as a macro that generates multiple
rules with positional references from a single rule, thus preserving
the UPSERTsemantics. Conceptually, the following rule:
sales[FOR product IN ('Bounce', 'Standard Mouse Pad', ..., 'Y Box'),
FOR year IN (2004, 2005)] = 1.1 * sales[CV(product),
CV(year)-1]
The FOR construct in the preceding examples is of type FOR dimension IN (list of
values). Values in the list should either be constants or expressions involving
constants. In this example, there are separate FORconstructs on product and year. It is
also possible to specify all dimensions using one FOR construct. Consider for example,
we want only to estimate sales for Bounce in 2004, Standard Mouse Pad in 2005 and
Y Box in 2004 and 2005. This can be formulated as the following:
sales[FOR (product, year) IN (('Bounce', 2004), ('Standard Mouse Pad', 2005),
('Y Box', 2004), ('Y Box', 2005))] =
1.1 * sales[CV(product), CV(year)-1]
This FOR construct should be of form FOR (d1, ..., dn) IN ((d1_val1, ...,
dn_val1), ..., (d1_valm, ..., dn_valm)] when there are n dimensions d1, ...,
dn and m values in the list.
In some cases, the list of values for a dimension in FOR can be stored in a table or they
can be the result of a subquery. Oracle Database provides a flavor of FOR construct as
in FOR dimension in (subquery) to handle these cases. For example, assume that the
products of interest are stored in a table interesting_products, then the following rule
estimates their sales in 2004 and 2005:
sales[FOR product IN (SELECT product_name FROM interesting_products)
FOR year IN (2004, 2005)] = 1.1 * sales[CV(product), CV(year)-1]
As another example, consider the scenario where you want to introduce a new
country, called new_country, with sales that mimic those of Poland. This is
accomplished by issuing the following statement:
SELECT country, product, year, s
FROM sales_view
MODEL
DIMENSION BY (country, product, year)
MEASURES (sales s) IGNORE NAV
RULES UPSERT
(s[FOR (country, product, year) IN (SELECT DISTINCT 'new_country', product,
year
FROM sales_view
WHERE country = 'Poland')] = s['Poland',CV(),CV()])
ORDER BY country, year, product;
If the FOR construct is used to densify sparse data on multiple dimensions, it is possible
to encounter the 10,000 rule limit. In those cases, densification can be done outside
the MODEL clause using a partitioned outer join. See Chapter 21, " SQL for Analysis
and Reporting" for further information.
If you know that the values of interest come from a discrete domain, you can
use FOR construct FOR dimension FROM value1 TO value2 [INCREMENT | DECREMENT]
value3. This specification results in values between value1 and value2 by starting
from value1 and incrementing (or decrementing) by value3. The
values value1, value2, and value3 should be constants or expressions involving
constants. For example, the following rule:
sales['Bounce', FOR year FROM 2001 TO 2005 INCREMENT 1] =
sales['Bounce', year=CV(year)-1] * 1.2
This kind of FOR construct can be used for dimensions of numeric, date and datetime
datatypes. The increment/decrement expression value3 should be numeric for numeric
dimensions and can be numeric or interval for dimensions of date or datetime types.
Also, value3 should be positive. Oracle will return an error if you use FOR year FROM
2005 TO 2001 INCREMENT -1. You should use either FOR year FROM 2005 TO 2001
DECREMENT 1 or FOR year FROM 2001 TO 2005 INCREMENT 1. Oracle will also report an
error if the domain (or the range) is empty, as in FOR year FROM 2005 TO 2001
INCREMENT 1.
To generate string values, you can use the FOR construct FOR dimension LIKE string
FROM value1 TO value2 [INCREMENT | DECREMENT] value3. The string string should
contain only one % character. This specification results in string by replacing % with
values between value1 and value2 with appropriate increment/decrement value value3.
For example, the following rule:
sales[FOR product LIKE 'product-%' FROM 1 TO 3 INCREMENT 1, 2003] =
sales[CV(product), 2002] * 1.2
In SEQUENTIAL ORDER models, rules represented by a FOR construct are evaluated in the
order they are generated. On the contrary, rule evaluation order would be dependency
based if AUTOMATIC ORDER is specified. For example, the evaluation order for the rules
represented by the rule:
sales['Bounce', FOR year FROM 2004 TO 2001 DECREMENT 1] =
1.1 * sales['Bounce', CV(year)-1]
Iterative Models
Using the ITERATE option of the MODEL clause, you can evaluate rules iteratively for a
certain number of times, which you can specify as an argument to
the ITERATE clause. ITERATE can be specified only forSEQUENTIAL ORDER models and
such models are referred to as iterative models. For example, consider the following:
SELECT x, s FROM DUAL
MODEL
DIMENSION BY (1 AS x) MEASURES (1024 AS s)
RULES UPDATE ITERATE (4)
(s[1] = s[1]/2);
In Oracle, the table DUAL has only one row. Hence this model defines a 1-dimensional
array, dimensioned by x with a measure s, with a single element s[1] = 1024. The
rule s[1] = s[1]/2 evaluation will be repeated four times. The result of this query will
be a single row with values 1 and 64 for columns x and s respectively. The number of
iterations arguments for the ITERATE clause should be a positive integer constant.
Optionally, you can specify an early termination condition to stop rule evaluation
before reaching the maximum iteration. This condition is specified in
the UNTIL subclause of ITERATE and is checked at the end of an iteration. So, you will
have at least one iteration when ITERATE is specified. The syntax of the ITERATE clause
is:
ITERATE (number_of_iterations) [ UNTIL (condition) ]
Iterative evaluation will stop either after finishing the specified number of iterations or
when the termination condition evaluates to TRUE, whichever comes first.
In some cases, you may want the termination condition to be based on the change,
across iterations, in value of a cell. Oracle provides a mechanism to specify such
conditions in that it enables you to access cell values as they existed before and after
the current iteration in the UNTIL condition. Oracle's PREVIOUS function takes a single
cell reference as an argument and returns the measure value of the cell as it existed
after the previous iteration. You can also access the current iteration number by using
the system variable ITERATION_NUMBER, which starts at value 0 and is incremented after
each iteration. By usingPREVIOUS and ITERATION_NUMBER, you can construct complex
termination conditions.
Consider the following iterative model that specifies iteration over rules till the
change in the value of s[1] across successive iterations falls below 1, up to a
maximum of 1000 times:
SELECT x, s, iterations FROM DUAL
MODEL
DIMENSION BY (1 AS x) MEASURES (1024 AS s, 0 AS iterations)
RULES ITERATE (1000) UNTIL ABS(PREVIOUS(s[1]) - s[1]) < 1
(s[1] = s[1]/2, iterations[1] = ITERATION_NUMBER);
The absolute value function (ABS) can be helpful for termination conditions because
you may not know if the most recent value is positive or negative. Rules in this model
will be iterated over 11 times as after 11th iteration the value of s[1] would be 0.5.
This query results in a single row with values 1, 0.5, 10 for x, s and iterations
respectively.
You can use the PREVIOUS function only in the UNTIL condition.
However, ITERATION_NUMBER can be anywhere in the main model. In the following
example, ITERATION_NUMBER is used in cell references:
SELECT country, product, year, sales
FROM sales_view
MODEL
PARTITION BY (country) DIMENSION BY (product, year) MEASURES (sales sales)
IGNORE NAV
RULES ITERATE(3)
(sales['Bounce', 2002 + ITERATION_NUMBER] = sales['Bounce', 1999
+ ITERATION_NUMBER]);
This statement achieves an array copy of sales of Bounce from cells in the array 1999-
2001 to 2002-2005.
However, there is no self-cycle in the following rule as different measures are being
accessed on the left and right side:
projected_sales['Bounce', 2002] = 25000 / sales['Bounce', 2002]
When the analysis of an AUTOMATIC ORDER model finds that the rules have no circular
dependencies (that is, the dependency graph is acyclic), Oracle Database will evaluate
the rules in their dependency order. For example, in the
following AUTOMATIC ORDER model:
MODEL DIMENSION BY (prod, year) MEASURES (sale sales) IGNORE NAV
RULES AUTOMATIC ORDER
(sales['SUV', 2001] = 10000,
sales['Standard Mouse Pad', 2001] = sales['Finding Fido', 2001]
* 0.10 + sales['Boat', 2001] * 0.50,
sales['Boat', 2001] = sales['Finding Fido', 2001]
* 0.25 + sales['SUV', 2001]* 0.75,
sales['Finding Fido', 2001] = 20000)
Rule 2 depends on rules 3 and 4, while rule 3 depends on rules 1 and 4, and rules 1
and 4 do not depend on any rule. Oracle, in this case, will find that the rule
dependencies are acyclic and will evaluate rules in one of the possible evaluation
orders (1, 4, 3, 2) or (4, 1, 3, 2). This type of rule evaluation is called
an ACYCLIC algorithm.
In some cases, Oracle Database may not be able to ascertain that your model is acyclic
even though there is no cyclical dependency among the rules. This can happen if you
have complex expressions in your cell references. Oracle Database assumes that the
rules are cyclic and employs a CYCLIC algorithm that evaluates the model iteratively
based on the rules and data. Iteration will stop as soon as convergence is reached and
the results will be returned. Convergence is defined as the state in which further
executions of the model will not change values of any of the cell in the model.
Convergence is certain to be reached when there are no cyclical dependencies.
If your AUTOMATIC ORDER model has rules with cyclical dependencies, Oracle will
employ the earlier mentioned CYCLIC algorithm. Results are produced if convergence
can be reached within the number of iterations Oracle is going to try the algorithm.
Otherwise, Oracle will report a cycle detection error. You can circumvent this
problem by manually ordering the rules and specifying SEQUENTIAL ORDER.
Ordered Rules
An ordered rule is one that has ORDER BY specified on the left side. It accesses cells in
the order prescribed by ORDER BY and applies the right side computation. When you
have a positional ANY or symbolic references on the left side of a rule but without
the ORDER BY clause, Oracle might return an error saying that the rule's results depend
on the order in which cells are accessed and hence are non-deterministic. Consider the
following SEQUENTIAL ORDER model:
SELECT t, s
FROM sales, times
WHERE sales.time_id = times.time_id
GROUP BY calendar_year
MODEL
DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s)
RULES SEQUENTIAL ORDER
(s[ANY] = s[CV(t)-1]);
This query attempts to set, for all years t, sales s value for a year to the sales value of
the prior year. Unfortunately, the result of this rule depends on the order in which the
cells are accessed. If cells are accessed in the ascending order of year, the result would
be that of column 3 in Table 22-1. If they are accessed in descending order, the result
would be that of column 4.
t s If ascending If descending
1998 1210000982 null null
1999 1473757581 null 1210000982
2000 2376222384 null 1473757581
2001 1267107764 null 2376222384
If you want the cells to be considered in descending order and get the result given in
column 4, you should specify:
SELECT t, s
FROM sales, times
WHERE sales.time_id = times.time_id
GROUP BY calendar_year
MODEL
DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s)
RULES SEQUENTIAL ORDER
(s[ANY] ORDER BY t DESC = s[CV(t)-1]);
In general, you can use any ORDER BY specification as long as it produces a unique
order among cells that qualify the left side cell reference. Expressions in
the ORDER BY of a rule can involve constants, measures and dimension keys and you
can specify the ordering options [ASC | DESC] [NULLS FIRST | NULLS LAST] to get the
order you want.
You can also specify ORDER BY for rules in an AUTOMATIC ORDER model to make Oracle
consider cells in a particular order during rule evaluation. Rules are never considered
self-cyclic if they have ORDER BY. For example, to make the
following AUTOMATIC ORDER model with a self-cyclic formula:
MODEL
DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s)
RULES AUTOMATIC ORDER
(s[ANY] = s[CV(t)-1])
acyclic, you need to provide the order in which cells need to be accessed for
evaluation using ORDER BY. For example, you can say:
s[ANY] ORDER BY t = s[CV(t) - 1]
Then Oracle will pick an ACYCLIC algorithm (which is certain to produce the result) for
formula evaluation.
This would return a uniqueness violation error as the rowset input to model is not
unique on country and product:
ERROR at line 2:ORA-32638: Non unique addressing in MODEL dimensions
Another difference between UNIQUE DIMENSION and UNIQUE SINGLE REFERENCE semantics
is the number of cells that can be updated by a rule with a single cell reference on left
side. In the case of UNIQUEDIMENSION, such a rule can update at most one row as only
one cell would qualify the single cell reference on the left side. This is because the
input rowset would be unique on PARTITION BY and DIMENSION BYkeys.
With UNIQUE SINGLE REFERENCE, all cells that qualify the left side single cell reference
would be updated by the rule.
The following rules and restrictions apply when using the MODEL clause:
The only columns that can be updated are the columns specified in
the MEASURES subclause of the main SQL model. Measures of reference models
cannot be updated.
The MODEL clause is evaluated after all clauses in the query block
except SELECT DISTINCT, and ORDER BY clause are evaluated. These clauses and
expressions in the SELECT list are evaluated after the MODELclause.
If your query has a MODEL clause, then the query's SELECT and ORDER BY lists
cannot contain aggregates or analytic functions. If needed, these can be
specified in PARTITION BY, DIMENSION BY, and MEASURESlists and need to be
aliased. Aliases can then be used in the SELECT or ORDER BY clauses. In the
following example, the analytic function RANK is specified and aliased in
the MEASURES list of the MODELclause, and its alias is used in the SELECT list so that
the outer query can order resulting rows based on their ranks.
SELECT country, product, year, s, RNK
FROM (SELECT country, product, year, s, rnk
FROM sales_view
MODEL
PARTITION BY (country) DIMENSION BY (product, year)
MEASURES (sales s, year y, RANK() OVER (ORDER BY sales) rnk)
RULES UPSERT
(s['Bounce Increase 90-99', 2001] =
REGR_SLOPE(s, y) ['Bounce', year BETWEEN 1990 AND 2000],
s['Bounce', 2001] = s['Bounce', 2000] *
(1+s['Bounce increase 90-99', 2001])))
WHERE product <> 'Bounce Increase 90-99'
ORDER BY country, year, rnk, product;
When there is a multi-cell reference on the right hand side of a rule, you need
to apply a function to aggregate the measure values of multiple cells referenced
into a single value. You can use any kind of aggregate function for this
purpose: regular, OLAP aggregate (inverse percentile, hypothetical rank and
distribution), or user-defined aggregate.
You cannot use analytic functions (functions that use the OVER clause) in rules.
Only rules with positional single cell references on the left side
have UPSERT semantics. All other rules have UPDATE semantics, even when you
specify the UPSERT option for them.
Negative increments are not allowed in FOR loops. Also, no empty FOR loops are
allowed. FOR d FROM 2005 TO 2001 INCREMENT -1 is illegal. You should use FOR
d FROM 2005 TO 2001 DECREMENT 1instead. FOR d FROM 2005 TO 2001 INCREMENT
1 is illegal as it designates an empty loop.
You cannot use nested query expressions (subqueries) in rules except in
the FOR construct. For example, it would be illegal to issue the following:
SELECT *
FROM sales_view WHERE country = 'Poland'
MODEL DIMENSION BY (product, year)
MEASURES (sales sales)
RULES UPSERT
(sales['Bounce', 2003] = sales['Bounce', 2002] +
(SELECT SUM(sales) FROM sales_view));
This is because the rule has a subquery on its right side. Instead, you can
rewrite the preceding query in the following legal way:
SELECT *
FROM sales_view WHERE country = 'Poland'
MODEL DIMENSION BY (product, year)
MEASURES (sales sales, (SELECT SUM(sales) FROM sales_view) AS
grand_total)
RULES UPSERT
(sales['Bounce', 2003] =sales['Bounce', 2002] +
grand_total['Bounce', 2002]);
You can also use subqueries in the FOR construct specified on the left side of a
rule. However, they:
Cannot be correlated
Must return fewer than 10000 rows
Cannot be a query defined in the WITH clause
Will make the cursor unsharable
Nested cell references must be single cell references. Aggregates on nested cell
references are not supported. So, it would be illegal to say s['Bounce',
MAX(best_year)['Bounce', ANY]].
Only one level of nesting is supported for nested cell references on the main
model. So, for example, s['Bounce', best_year['Bounce', 2001]] is legal,
but s['Bounce', best_year['Bounce', best_year['Bounce', 2001]]] is not.
Nested cell references appearing on the left side of rules in
an AUTOMATIC ORDER model should not be updated in any rule of the model. This
restriction ensures that the rule dependency relationships do not arbitrarily
change (and hence cause non-deterministic results) due to updates to reference
measures.
Parallel Execution
Aggregate Computation
Using EXPLAIN PLAN to Understand Model Queries
Parallel Execution
MODEL clause computation is scalable in terms of the number of processors you have.
Scalability is achieved by performing the MODEL computation in parallel across the
partitions defined by the PARTITION BYclause. Data is distributed among processing
elements (also called parallel query slaves) based on the PARTITION BY key values such
that all rows with the same values for the PARTITION BY keys will go to the same slave.
Note that the internal processing of partitions will not create a one-to-one match of
logical and internally processed partitions. This way, each slave can finish MODEL
clause computation independent of other slaves. The data partitioning can be hash
based or range based. Consider the following MODEL clause:
MODEL
PARTITION BY (country) DIMENSION BY (product, time) MEASURES (sales)
RULES UPDATE
(sales['Bounce', 2002] = 1.2 * sales['Bounce', 2001],
sales['Car', 2002] = 0.8 * sales['Car', 2001])
Parallelism of the model computation is governed or limited by the way you specify
the MODEL clause. If your MODEL clause has no PARTITION BY keys, then the computation
cannot be parallelized (with exceptions mentioned in the following).
If PARTITION BY keys have very low cardinality, then the degree of parallelism will be
limited. In such cases, Oracle identifies the DIMENSION BY keys that can used for
partitioning. For example, consider a MODEL clause equivalent to the preceding one, but
without PARTITION BY keys as in the following:
MODEL
DIMENSION BY (country, product, time) MEASURES (sales)
RULES UPDATE
(sales[ANY, 'Bounce', 2002] = 1.2 * sales[CV(country), 'Bounce', 2001],
sales[ANY, 'Car', 2002] = 0.8 * sales[CV(country), 'Car', 2001])
Aggregate Computation
The MODEL clause processes aggregates in two different ways: first, the regular
fashion in which data in the partition is scanned and aggregated and second, an
efficient window style aggregation. The first type as illustrated in the following
introduces a new dimension member ALL_2002_products and computes its value to
be the sum of year 2002 sales for all products:
MODEL PARTITION BY (country) DIMENSION BY (product, time) MEASURES (sale
sales)
RULES UPSERT
(sales['ALL_2002_products', 2002] = SUM(sales)[ANY, 2002])
To evaluate the aggregate sum in this case, each partition will be scanned to find the
cells for 2002 for all products and they will be aggregated. If the left side of the rule
were to reference multiple cells, then Oracle will have to compute the right side
aggregate by scanning the partition for each cell referenced on the left. For example,
consider the following example:
MODEL PARTITION BY (country) DIMENSION BY (product, time)
MEASURES (sale sales, 0 avg_exclusive)
RULES UPDATE
(avg_exclusive[ANY, 2002] = AVG(sales)[product <> CV(product), CV(time)])
This rule calculates a measure called avg_exclusive for every product in 2002. The
measure avg_exclusive is defined as the average sales of all products excluding the
current product. In this case, Oracle scans the data in a partition for every product in
2002 to calculate the aggregate, and this may be expensive
Oracle Database will optimize the evaluation of such aggregates in some scenarios
with window-style computation as used in analytic functions. These scenarios involve
rules with multi-cell references on their left side and computing window computations
such as moving averages, cumulative sums and so on. Consider the following
example:
MODEL PARTITION BY (country) DIMENSION BY (product, time)
MEASURES (sale sales, 0 mavg)
RULES UPDATE
(mavg[product IN ('Bounce', 'Y Box', 'Mouse Pad'), ANY] =
AVG(sales)[CV(product), time BETWEEN CV(time)
AND CV(time) - 2])
It computes the moving average of sales for products Bounce, Y Box, and Mouse Pad
over a three year period. It would be very inefficient to evaluate the aggregate by
scanning the partition for every cell referenced on the left side. Oracle identifies the
computation as being in window-style and evaluates it efficiently. It sorts the input on
product, time and then scans the data once to compute the moving average. You can
view this rule as an analytic function being applied on the sales data for products
Bounce, Y Box, and Mouse Pad:
AVG(sales) OVER (PARTITION BY product ORDER BY time
RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)
This computation style is called WINDOW (IN MODEL) SORT. This style of aggregation is
applicable when the rule has a multi-cell reference on its left side with no ORDER BY,
has a simple aggregate (SUM, COUNT,MIN, MAX, STDEV, and VAR) on its right side, only one
dimension on the right side has a boolean predicate (<, <=, >, >=, BETWEEN), and all other
dimensions on the right are qualified with CV.
You will see a line in the main explain plan output showing the model and the
algorithm used. Reference models are tagged with the keyword REFERENCE in the plan
output. Also, Oracle annotates the plan with WINDOW (IN MODEL) SORT if any of the
rules qualify for window-style aggregate computation.
By examining an explain plan, you can find out the algorithm chosen to evaluate your
model. If your model has SEQUENTIAL ORDER semantics, then ORDERED is displayed.
For AUTOMATIC ORDER models, Oracle displays ACYCLIC or CYCLIC based on whether it
chooses ACYCLIC or CYCLIC algorithm for evaluation. In addition, the plan output will
have an annotation FAST in case of ORDERED and ACYCLIC algorithms if all left side cell
references are single cell references and aggregates, if any, on the right side of rules
are simple arithmetic non-distinct aggregates like SUM, COUNT, AVG, and so on. Rule
evaluation in this case would be highly efficient and hence the annotation FAST. Thus,
the output you will see in the explain plan would be MODEL {ORDERED [FAST] |
ACYCLIC [FAST] | CYCLIC}.
This model has only single cell references on the left side of rules and the
aggregate AVG on the right side of first rule is a simple non-distinct aggregate:
EXPLAIN PLAN FOR
SELECT country, prod, year, sales
FROM sales_view
WHERE country IN ('Italy', 'Japan')
MODEL UNIQUE DIMENSION
PARTITION BY (country) DIMENSION BY (prod, year) MEASURES (sale sales)
RULES UPSERT
(sales['Bounce', 2003] = AVG(sales)[ANY, 2002] * 1.24,
sales['Y Box', 2003] = sales['Bounce', 2003] * 0.25);
Using ORDERED: Example
Because the left side of the second rule is a multi-cell reference, the FAST method will
not be chosen in the following:
EXPLAIN PLAN FOR
SELECT country, prod, year, sales
FROM sales_view
WHERE country IN ('Italy', 'Japan')
MODEL UNIQUE DIMENSION
PARTITION BY (country) DIMENSION BY (prod, year) MEASURES (sale sales)
RULES UPSERT
(sales['Bounce', 2003] = AVG(sales)[ANY, 2002] * 1.24
sales[prod <> 'Bounce', 2003] = sales['Bounce', 2003] * 0.25);
Rules in this model are not cyclic and the explain plan will show ACYCLIC.
The FAST method is chosen in this case as well.
EXPLAIN PLAN FOR
SELECT country, prod, year, sales
FROM sales_view
WHERE country IN ('Italy', 'Japan')
MODEL UNIQUE DIMENSION
PARTITION BY (country) DIMENSION BY (prod, year) MEASURES (sale sales)
RULES UPSERT AUTOMATIC ORDER
(sales['Y Box', 2003] = sales['Bounce', 2003] * 0.25,
sales['Bounce', 2003] = sales['Bounce', 2002] / SUM(sales)[ANY, 2002] * 2
* sales['All Products', 2003],
sales['All Products', 2003] = 200000);
Rules in this model are not cyclic. The PERCENTILE_DISC aggregate that gives the
median sales for year 2002, in the second rule is not a simple aggregate function.
Therefore, Oracle will not choose the FASTmethod, and the explain plan will just
show ACYCLIC.
SELECT country, prod, year, sales
FROM sales_view
WHERE country IN ('Italy', 'Japan')
MODEL UNIQUE DIMENSION
PARTITION BY (country) DIMENSION BY (prod, year) MEASURES (sale sales)
RULES UPSERT AUTOMATIC ORDER
(sales['Y Box', 2003] = sales['Bounce', 2003] * 0.25,
sales['Bounce',2003] = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY
sales)[ANY,2002] / SUM(sales)[ANY, 2002] * 2 * sales['All Products',
2003],
sales['All Products', 2003] = 200000);
Using CYCLIC: Example
Oracle chooses CYCLIC algorithm for this model as there is a cycle among second and
third rules.
EXPLAIN PLAN FOR
SELECT country, prod, year, sales
FROM sales_view
WHERE country IN ('Italy', 'Japan')
MODEL UNIQUE DIMENSION
PARTITION BY (country) DIMENSION BY (prod, year) MEASURES (sale sales)
IGNORE NAV RULES UPSERT AUTOMATIC ORDER
(sales['All Products', 2003] = 200000,
sales['Y Box', 2003] = sales['Bounce', 2003] * 0.25,
sales['Bounce', 2003] = sales['Y Box', 2003] +
(sales['Bounce', 2002] / SUM(sales)[ANY, 2002] * 2 *
sales['All Products', 2003]));
Show the sales for Italy and Spain and the difference between the two for each
product. The difference should be placed in a new row with country = 'Diff Italy-
Spain'.
If sales for each product in each country grew (or declined) at the same monthly rate
from November 2000 to December 2000 as they did from October 2000 to November
2000, what would the fourth quarter's sales be for the whole company and for each
country?
SELECT country, SUM(sales)
FROM (SELECT product, country, month, sales
FROM sales_view2 WHERE year=2000 AND month IN
('October','November')
MODEL
PARTITION BY (product, country) DIMENSION BY (month) MEASURES (sale sales)
RULES
(sales['December']=(sales['November'] /sales['October'])
*sales['November']))
GROUP BY GROUPING SETS ((),(country));
You want to calculate the net present value (NPV) of a series of periodic cash flows.
Your scenario involves two projects, each of which starts with an initial investment at
time 0, represented as a negative cash flow. The initial investment is followed by three
years of positive cash flow. First, create a table (cash_flow) and populate it with some
data, as in the following statements:
CREATE TABLE cash_flow (year DATE, i INTEGER, prod VARCHAR2(3), amount
NUMBER);
INSERT INTO cash_flow VALUES (TO_DATE('1999', 'YYYY'), 0, 'vcr', -100.00);
INSERT INTO cash_flow VALUES (TO_DATE('2000', 'YYYY'), 1, 'vcr', 12.00);
INSERT INTO cash_flow VALUES (TO_DATE('2001', 'YYYY'), 2, 'vcr', 10.00);
INSERT INTO cash_flow VALUES (TO_DATE('2002', 'YYYY'), 3, 'vcr', 20.00);
INSERT INTO cash_flow VALUES (TO_DATE('1999', 'YYYY'), 0, 'dvd', -200.00);
INSERT INTO cash_flow VALUES (TO_DATE('2000', 'YYYY'), 1, 'dvd', 22.00);
INSERT INTO cash_flow VALUES (TO_DATE('2001', 'YYYY'), 2, 'dvd', 12.00);
INSERT INTO cash_flow VALUES (TO_DATE('2002', 'YYYY'), 3, 'dvd', 14.00);
To calculate the NPV using a discount rate of 0.14, issue the following statement:
SELECT year, i, prod, amount, npv
FROM cash_flow
MODEL PARTITION BY (prod)
DIMENSION BY (i)
MEASURES (amount, 0 npv, year)
RULES
(npv[0] = amount[0],
npv[i !=0] ORDER BY i =
amount[CV()]/ POWER(1.14,CV(i)) + npv[CV(i)-1]);
You want your interest expenses to equal 30% of your net income (net=pay minus tax
minus interest). Interest is tax deductible from gross, and taxes are 38% of salary and
28% capital gains. You have salary of $100,000 and capital gains of $15,000. Net
income, taxes, and interest expenses are unknown. Observe that this is a simultaneous
equation (net depends on interest, which depends on net), thus the ITERATEclause is
included.
The sales of Bounce in 2001 will increase in comparison to 2000 as they did in the
last three years (between 1998 and 2000). Sales of Shaving Cream in 2001 will also
increase in comparison to 2000 as they did between 1998 and 2000. To calculate the
increase, use the regression function REGR_SLOPE as follows. Because we are
calculating the next period's value, it is sufficient to add the slope to the 2000 value.
SELECT * FROM
(SELECT country, product, year, projected_sale, sales
FROM sales_view
WHERE country IN ('Italy', 'Japan') AND product IN ('Shaving Cream',
'Bounce')
MODEL
PARTITION BY (country) DIMENSION BY (product, year)
MEASURES (sales sales, year y, CAST(NULL AS NUMBER) projected_sale) IGNORE
NAV
RULES UPSERT
(projected_sale[FOR product IN ('Bounce', 'Shaving Cream'), 2001] =
sales[CV(), 2000] +
REGR_SLOPE(sales, y)[CV(), year BETWEEN 1998 AND 2000]))
ORDER BY country, product, year;
This example creates mortgage amortization tables for any number of customers,
using information about mortgage loans selected from a table of mortgage facts. First,
create two tables and insert needed data:
mortgage_facts
Holds information about individual customer loans, including the name of the
customer, the fact about the loan that is stored in that row, and the value of that
fact. The facts stored for this example are loan (Loan), annual interest rate
(Annual_Interest), and number of payments (Payments) for the loan. Also, the
values for two customers, Smith and Jones, are inserted.
CREATE TABLE mortgage_facts (customer VARCHAR2(20), fact VARCHAR2(20),
amount NUMBER(10,2));
INSERT INTO mortgage_facts VALUES ('Smith', 'Loan', 100000);
INSERT INTO mortgage_facts VALUES ('Smith', 'Annual_Interest', 12);
INSERT INTO mortgage_facts VALUES ('Smith', 'Payments', 360);
INSERT INTO mortgage_facts VALUES ('Smith', 'Payment', 0);
INSERT INTO mortgage_facts VALUES ('Jones', 'Loan', 200000);
INSERT INTO mortgage_facts VALUES ('Jones', 'Annual_Interest', 12);
INSERT INTO mortgage_facts VALUES ('Jones', 'Payments', 180);
INSERT INTO mortgage_facts VALUES ('Jones', 'Payment', 0);
mortgage
Holds output information for the calculations. The columns are customer,
payment number (pmt_num), principal applied in that payment (principalp),
interest applied in that payment (interestp), and remaining loan balance
(mort_balance). In order to upsert new cells into a partition, you need to have at
least one row pre-existing per partition. Therefore, we seed the mortgage table
with the values for the two customers before they have made any payments.
This seed information could be easily generated using a SQL Insert statement
based on the mortgage_fact table.
CREATE TABLE mortgage_facts (customer VARCHAR2(20), fact VARCHAR2(20),
amount NUMBER(10,2));
2 through 4: These lines mark the start and end of the reference model labeled R. This
model defines defines a SELECT statement that calculates the monthly payment amount
for each customer's loan. TheSELECT statement uses its own MODEL clause starting at the
line labeled 3 with a single rule that defines the amt value based on information from
the mortgage_facts table. The measure returned by reference model R is amt,
dimensioned by customer name cust and fact value fact as defined in the line labeled
4.
The reference model is computed once and the values are then used in the main model
for computing other calculations. Reference model R will return a row for each
existing row of mortgage_fact, and it will return the newly calculated rows for each
customer where the fact type is Payment and the amt is the monthly payment amount. If
we wish to use a specific amount from the R output, we address it with the
expression r.amt[<customer_name>,<fact_name>].
5: This is the continuation of the main model definition. We will partition the output
by customer, aliased as c.
7: Four measures are defined: principalp (pp) is the principal amount applied to the
loan in the month, interestp (ip) is the interest paid that month, mort_balance (m) is
the remaining mortgage value after the payment of the loan, and customer (mc) is used
to support the partitioning.
8: This begins the rules block. It will perform the rule calculations up to 1000 times.
Because the calculations are performed once for each month for each customer, the
maximum number of months that can be specified for a loan is 1000. Iteration is
stopped when the ITERATION_NUMBER+1 equals the amount of payments derived from
reference R. Note that the value from reference R is the amt (amount) measure defined
in the reference clause. This reference value is addressed
as r.amt[<customer_name>,<fact>]. The expression used in the iterate
line, "r.amt[mc[0], 'Payments']" is resolved to be the amount from reference R,
where the customer name is the value resolved by mc[0]. Since each partition contains
only one customer, mc[0] can have only one value. Thus "r.amt[mc[0],
'Payments']" yields the reference clause's value for the number of payments for the
current customer. This means that the rules will be performed as many times as there
are payments for that customer.
9 through 11: The first two rules in this block use the same type of r.amt reference
that was explained in 8. The difference is that the ip rule defines the fact value
as Annual_Interest. Note that each rule refers to the value of one of the other
measures. The expression used on the left side of each
rule, "[ITERATION_NUMBER+1]" will create a new dimension value, so the measure will
be upserted into the result set. Thus the result will include a monthly amortization row
for all payments for each customer.
The final line of the example sorts the results by customer and loan payment number.