0% found this document useful (0 votes)
85 views16 pages

Tunning Dss Queries

The document discusses tuning decision support system (DSS) queries for IBM Informix databases. It describes how DSS queries differ from online transaction processing (OLTP) queries in that they typically scan entire tables rather than specific rows and return large volumes of aggregated data. The document outlines various monitoring tools, such as explain plans, Xtree, and onstat commands, that can help analyze and optimize DSS queries by identifying inefficient access methods, I/O bottlenecks, and thread allocation. Tuning techniques like ensuring parallel scans and sufficient thread resources can significantly reduce query response times.

Uploaded by

Airc Smtp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views16 pages

Tunning Dss Queries

The document discusses tuning decision support system (DSS) queries for IBM Informix databases. It describes how DSS queries differ from online transaction processing (OLTP) queries in that they typically scan entire tables rather than specific rows and return large volumes of aggregated data. The document outlines various monitoring tools, such as explain plans, Xtree, and onstat commands, that can help analyze and optimize DSS queries by identifying inefficient access methods, I/O bottlenecks, and thread allocation. Tuning techniques like ensuring parallel scans and sufficient thread resources can significantly reduce query response times.

Uploaded by

Airc Smtp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Tuning Informix DSS Queries

by: Jack Parker


Arten Technology Group
May 2002

This article originally appeared in Informix Tech Notes Volume 10, Issue 1, 2000. It was written for IBM
Informix Dynamic Server (IDS). The contents hold true for IBM Informix Foundation as well. I have
made some additional comments for IBM Informix Extended Parallel Server XPS.

Introduction
Over the past several months I have found myself repeatedly explaining the concept of tuning a decision
support system (DSS) query for an IBM Informix database.

Data warehouses and very large databases tend to be used differently from traditional WRITE OUT
(OLTP) database systems. In an OLTP environment, the goal is to execute many discrete transactions in a
short timeframe. OLTP administrators support hundreds or thousands of users and tend to focus on very
specific portions of the data (for example, we must be able to update item four of the customer’s order as
well as the balance in their order master record). Such an OLTP transaction might be performed with one
or more UPDATE statements WHERE cust_id/ord_no = ?.

In a DSS environment, you are more likely to have only a handful of users, each of whom are making
queries against a large volume of data. In an OLTP model, users like this would be severely chastised.
However, in the world of DSS, this sort of activity is encouraged. Typically, a DSS query will scan every
row of one or more tables. Also typically, where the size of an OLTP system might be measured in terms
of hundreds of megabytes, a DSS system will be measured in terms of hundreds of gigabytes.

DSS systems tend to be queried by front-end tools such as Business Objects or SAS, where the SQL to
perform the query is generated by the tool; you frequently do not have much control over the query itself.

If you try to support a DSS query using the same data model, indexed reads, and engine tuning used in an
OLTP system, it is likely that your response time to such a query could be measured in days. In a recent
case, the response time for such a query was 40 days. In this particular case, using the tuning methods
detailed in this article, I was able to bring the query result time down to about 15 minutes. This was
possible without changing the query or changing the schema. This article describes specific tuning
procedures for shortening the run time of DSS queries.

Monitoring tools
Before examining tuning procedures to optimize this DSS query, you need to understand how to use
specific monitoring tools available to help you understand how the query will behave. The following
monitoring tools are very helpful for examining queries:
• Explain plan
• Xtree
• onstat –g mgm command
• onstat –g lsc command
• onstat –g xqp and xqs (XPS-specific)
• xmp_mon
Explain plan
The first monitoring tool is the explain plan. This can be obtained by including a SET EXPLAIN ON;
command at the start of your query. The optimizer will then create, or append to, a file called sqexplain.out
in which it explains how it plans to resolve the query, as shown in Listing 1.
Listing 1. Explain plan output
QUERY:
------
select a.state_cd, b.last_name, count(*)
from tab1 a, tab2 b
where a.key = b.key
group by 1,2

Estimated Cost: 121157


Estimated # of Rows Returned: 1

1) informix.b: SEQUENTIAL SCAN (Serial, fragments: ALL)

2) informix.a: SEQUENTIAL SCAN (Serial, fragments: ALL)

Temporary table required for group operation

DYNAMIC HASH JOIN (Build Outer)


Dynamic Hash Filters: informix.b.hh_id = informix.a.hh_id

Notice that the estimated cost is very low; somebody forgot to run UPDATE STATISTICS.

The initial section of this EXPLAIN plan is the query itself followed by an estimated cost, which is derived
from the number of instructions the optimizer thinks will be required to resolve the query. The value given
here is way too low for this query (a value in the billions would be more normal).

The estimated number of rows returned is an indication that the optimizer thinks that at least one of the
tables is empty. A minimum of one row is always returned from a count(*) operation.

Each of the tables will be read using a sequential scan, probably because it costs nothing to read an empty
table. The “Serial” is an indication that each fragment of disk that the table resides on will be read
sequentially as opposed to the desired “Parallel.” No fragments have been eliminated from the query
(hence, the “fragments: ALL”).

The temporary table is, of course, required to perform the actual group by operation. A hash join will be
used to join the two tables, again because the optimizer thinks one of the tables is empty and because there
are no indices on these tables.

What we would really like to see is a higher estimated cost. We like the sequential scan, although we really
want a parallel, not serial, scan. If we had been able to eliminate some of the fragments through a WHERE
condition (a filter), we might also have seen only a list of the fragments to be scanned. Finally, we would
also like to see a line like:

Maximum Threads: 21

We would actually like to see a higher value indicating more parallelism, but for the sake of this example,
we will say 21. This would indicate that we are using PDQPRIORITY. The exact number of threads will
vary with the number of dbspaces each table is spread across and the number of threads assigned to each
additional task. As can be seen in the output from the onstat -g sql command shown in Listing 2, we have
nine scan threads for one table, 11 scan threads for the second table and a single sqlexec thread to pull it all
together. With a higher PDQPRIORITY setting we would see more specialized threads for grouping and
sorting.
These key pieces of information can tell you that something is wrong with the way the optimizer has
chosen to solve the query.

Xtree
While the explain plan can show you how the query will be answered, Xtree allows you to monitor the
query as it is being executed. You can see how many threads have been initiated for each segment of the
process as well as determine a run rate for the query.

Figure 1 below shows an Xtree screen shot for a typical hash join. At the bottom of this figure are two sets
of scan threads: the one on the right has nine threads, the one on the left is running with 11 threads. The
table on the right was read into memory in the hash table, and the table on the left is being scanned and
matched against this hash table while being read.

The numbers to the right of each box indicate how many threads are in use to perform this portion of the
query. Within the box, below the meter, is a number that indicates how many rows have been processed by
this portion of the query in the last second. At the top of the box is the number of rows processed so far.
About 114,000 rows were read from the table on the left in the last second, with about 89,000 matches
performed in that same second. When a hash table swap occurs, all of the process-per-second counters will
drop to zero; by watching these you can figure out how often and how long the process is spending in a
hash table swap.

The numbers next to the scan boxes should match the number of dbspaces the table is fragmented across.
On this machine, I would expect to see on the order of 50,000 to 70,000 rows scanned per fragment per
second. If this number drops off at the end then one or more of the fragments probably completed before
the others. If this occurs in the last 5-10 seconds,that is fine, but if this drop-off occurs half way through the
query, you might want to check the fragmentation scheme for data skew.

The filter box should reduce the number of rows scanned if you are using a filter (e.g., where state_cd
= ‘MA’). The total number filtered should keep pace easily with the scan rate.

Figure 1: Xtree hash join


Figure 2 shows a typical nested loop join. The table on the left is being sequentially scanned; a probe into
the second table is then performed for each row from this first table. As you can see, during the second that
this snapshot was taken, the query joined all of 1373 rows, which is not very good. Here, the join itself
limits the scan rate. What is also interesting is that you can see that more rows have been joined than read,
an indication of duplicate rows, which is acceptable because multiple people live at the same address.

Figure 2: Xtree Nested Loop Join


onstat –g ses and onstat –g sql commands
The commands onstat –g ses and onstat –g sql (Listing 2) reveal some key numbers—specifically, how
many and what type of threads will be used to execute the query, as well as how much memory your query
has been allocated.

Listing 2. Output from the onstat –g ses and onstat –g sql commands
session #RSAM total used
id user tty pid hostname threads memory memory
320 informix 6 25979 foo 21 352256 278408

tid name rstcb flags curstk status


48671 sqlexec e041518 ---P--- 1984 e041518 ready

Controlling sqlexec thread

48672 scan_1.0 e03fb98 ------- 984 e03fb98 sleeping(secs: 3)


48673 scan_1.1 e040c98 ------- 984 e040c98 running
48674 scan_1.2 e045918 ------- 984 e045918 running
48675 scan_1.3 e046e58 ------- 984 e046e58 sleeping(secs: 3)
48676 scan_1.4 e0443d8 ------- 984 e0443d8 ready
48677 scan_1.5 e044818 ----R-- 984 e044818 running
48678 scan_1.6 e044c58 ------- 984 e044c58 ready
48679 scan_1.7 e045098 Y------ 936 e045098 cond wait(await_MC1)
48680 scan_1.8 e048398 Y------ 936 e048398 cond wait(await_MC1)

9 Scan threads (0-8) for Table 1

48681 scan_2.0 e042618 Y------ 936 e042618 cond wait(await_MC2)


48682 scan_2.1 e043718 Y------ 936 e043718 cond wait(await_MC2)
48683 scan_2.2 e0410d8 Y------ 936 e0410d8 cond wait(await_MC2)
48684 scan_2.3 e0421d8 Y------ 936 e0421d8 cond wait(await_MC2)
48685 scan_2.4 e046a18 Y------ 936 e046a18 cond wait(await_MC2)
48686 scan_2.5 e041d98 Y------ 936 e041d98 cond wait(await_MC2)
48687 scan_2.6 e047298 Y------ 936 e047298 cond wait(await_MC2)
48688 scan_2.7 e045d58 Y------ 936 e045d58 cond wait(await_MC2)
48689 scan_2.8 e0454d8 Y------ 936 e0454d8 cond wait(await_MC2)
48690 scan_2.9 e03eed8 Y------ 936 e03eed8 cond wait(await_MC2)
48691 scan_2.1 e0487d8 Y------ 936 e0487d8 cond wait(await_MC2)

11 Scan threads (0-10) for Table 2

Memory pools count 1


name class addr totalsize freesize #allocfrag #freefrag
320 V e2de018 352256 73848 1420 14

name free used name free used


overhead 0 120 scb 0 96
opentable 0 31208 filetable 0 8384
log 0 44856 temprec 0 17680
ralloc 0 38560 gentcb 0 18728
ostcb 0 2024 sqscb 0 7664
rdahead 0 1600 xchg_desc 0 7392
xchg_port 0 2312 xchg_packet 0 5280
xchg_group 0 144 xchg_priv 0 1352
hashfiletab 0 5880 osenv 0 1672
buft_buffer 0 19224 sqtcb 0 24192
fragman 0 36720 shmblklist 0 3320

Sess SQL Current Iso Lock SQL ISAM F.E.


Id Stmt type Database Lvl Mode ERR ERR Vers
320 SELECT hercules NL Not Wait 0 0 7.24

Current statement name : slctcur


select a.state_cd, b.last_name, count(*)
from tab1 a, tab2 b
where a.key = b.key

Current SQL Statement

Last parsed SQL statement :


select a.state_cd, b.last_name, count(*)
from tab1 a, tab2 b
where a.key = b.key

The output contains a lot of information-- probably more than you need-- but there are several key things to
look for. The current statement name provides you details on the SQL of the query any session is actually
running. You can see how many threads and what types were kicked off (e.g., scan_2.3 et al., we got
nine scan threads against one table, 11 against the other and a single sqlexec thread that will perform the
hash join with a PDQPRIORITY=1 (since I started the query, I know this). Use onstat –g mgm to see
PDQPRIORITY).

onstat –g mgm command


The onstat –g mgm (rgm under XPS) command also shows you memory use, but it is especially
important for multiple queries, because it shows which query is waiting for resources. Memory (Resource)
Grant Manager will shows which queries are actually running and with what actual PDQ setting, as well as
which queries have been “gated” or are waiting for resources (memory or PDQ for example) to run. See
Listing 3.

Memory Grant Manager (MGM)

MAX_PDQPRIORITY: 100
DS_MAX_QUERIES: 20
DS_MAX_SCANS: 40
DS_TOTAL_MEMORY: 20000 KB

Queries: Active Ready Maximum


1 0 20

Memory: Total Free Quantum


(KB) 20000 0 1000

Scans: Total Free Quantum


40 39 1

Load Control: (Memory) (Scans) (Priority) (Max Queries)


Gate 1 Gate 2 Gate 3 Gate 4
(Queue Length) 0 0 0 0

Active Queries:
Session Query Priority Thread Memory Scans Gate
33 105e4060 100 106335d70 2500/2500 2/1 -

Ready Queries: None


This is actually output from a 9.20 engine; it contains the same basic elements as the 7.x and 8.x
memory/resource grant manager output. In this case the instance is running on a Linux with limited
memory. As you can see there is 20MB of DS memory, which is assigned in Quantums of 1MB. There are
40 Scans allowed, of which one is running. The query in question (session 33 at the bottom) is using all of
the memory and 1 scan of the two it needs. Of particular interest in a multi-user environment are the Gates.
If a second query had been submitted, it would be listed under “Ready Queries.” The queue lengths under
Memory and Priority would stand at 1 because the query could not run without memory or
PDQPRIORITY.

onstat –g lsc command


The onstat –g lsc (scn for XPS) command will show if there are any light scans. If there is no output from
this command, there are no light scans. Light scans are frequently crucial to DSS queries, as we shall see
below.

onstat –g xqp and xqs commands (XPS- specific)


Under XPS, there is no xtree tool. There is, however, much more detailed information available than from
xtree. From the onstat –g rgm outpu,t you will see a plan number. Plug this into onstat –g xqp plan_no and
out will come the plan that will be used to resolve the query, as shown in Listing 4:

Listing 4. Output from onstat –g xqp command


XMP Query Plan

oper segid brid width misc info


-----------------------------------------
scan 4 0 16 tab1
group 4 0 16
group 3 0 16
group 3 0 16
group 2 0 16
insert 1 0 32

The output in Listing 4 is referring to the following query:

insert into tab_2 select col1, count(unique col2) from tab1 group by
1

This is running on an XPS 8.30.UC2 with 16 processors and 8 GB of DS memory.

Operations are performed in descending order based on segid. So the first operation is the scan/group. The
group itself is reported twice, once for an input and once for an output. In this case, the ‘unique’ clause
forced a second group. Finally the data is inserted into a final table.

Listings 5-10 show partial output for the command onstat –g xqs for this plan_id.
In the interests of brevity, I have deleted all but the first and last three lines of each segment.
Listing 5. Partial output of command onstat –g xqs (part 1)
XMP Query Statistics

Cosvr_ID: 1
Plan_ID: 11961

type segid brid information


---- ----- ---- -----------
scan 4 0 inst cosvr time rows_prod rows_scan
---- ----- ---- --------- ---------
0 1 1916 39433808 39433808
1 1 1919 39471825 39471825
2 1 1903 39422536 39422536
……
13 4 1930 39410387 39410387
14 4 1936 39364283 39364283
15 4 1936 39375704 39375704
--------------------------------------
16 630463773 630463773

In this segment, there were 16 threads initiated. Each thread read 39.4 million rows in about 32 minutes.
(The time column is displayed in units of seconds.) Not great. There was no filter, so the number of rows
produced ??(rows_prod) is equal to the number of rows scanned (rows_scan)

Listing 6. Partial output of command onstat –g xqs (part 2)


group 4
inst cosvr time rows_prod rows_cons mem ovfl
---- ----- ---- --------- --------- --- ----
0 1 1916 39433808 39433808 104 0
1 1 1919 39471825 39471825 104 0
2 1 1903 39422536 39422536 104 0
…..
13 4 1930 39410387 39410387 104 0
14 4 1936 39364283 39364283 104 0
15 4 1936 39375704 39375704 104 0
--------------------------------------------------------
16 630463770 630463773 (2048)

Each ‘group’ operation is broken into two phases, a collection of data and the actual group. Hence the
segment shown in Listing 6 indicates that we pushed rows into the first group (unique) operation. Again,
16 threads took the same amount of time,which was concurrent with the scan time. The number of rows
produced is identical to the number of rows pushed in (rows_cons). The amount of memory for this
operation was 2K (across 16 threads).

Listing 7. Partial output of command onstat –g xqs (part 3)


group 3
inst cosvr time rows_prod rows_cons mem ovfl
---- ----- ---- --------- --------- --- ----
0 1 12639 39059408 39432594 49345600 199
1 2 12676 39071879 39442561 49328000 199
2 3 12203 39054649 39429975 49336000 199
…..
13 2 12672 38998788 39393276 49350400 199
14 3 12103 39003511 39390257 49345600 199
15 4 11942 39011955 39381624 49329600 199
--------------------------------------------------------
16 624401945 630463770 (3958528)

Listing 7 is the second half of the group segment shown in listing 6, and is where the unique rows come
out. Notice that this phase used almost 4GB of memory. The ovfl column indicates that memory
overflowed. The ovfl is not a count of how many pages overflowed, but of how many segments. It’s hard
to tell from this whether the overflow was serious. However, given that the time for this segment took
three hours, we can surmise it was seriousThis is probably what gated the scan.

Listing 8. Partial output of command onstat –g xqs (part 4)


group 3
inst cosvr time rows_prod rows_cons mem ovfl
---- ----- ---- --------- --------- --- ----
0 1 12639 39059408 39059408 104 0
1 2 12676 39071879 39071879 104 0
2 3 12203 39054649 39054649 104 0
…..
13 2 12672 38998788 38998788 104 0
14 3 12103 39003511 39003511 104 0
15 4 11942 39011955 39011955 104 0
--------------------------------------------------------
16 624401944 624401945 (2048)

Listing 8 contains the second group necessitated by the count(*)/group operation. Again, this is just
pushing rows into the group operation. All of this time is concurrent with previous processes.

Listing 9. Partial output of command onstat –g xqs (part 5)


group 2
inst cosvr time rows_prod rows_cons mem ovfl
---- ----- ---- --------- --------- --- ----
0 1 13203 21639904 39058139 43422720 199
1 2 13136 20410129 39069022 40452896 199
2 3 13201 22846215 39073721 46132152 199
…..
13 2 13225 21678939 39022613 43424128 199
14 3 13180 21694541 39009538 43659528 199
15 4 13184 22908683 39014754 46380352 199
--------------------------------------------------------
16 352937467 624401944 (3958528)

Listing 9 shows the second phase of the second group operation. Because of the way this query was
crafted, the other 4GB of memory was used for this operation. One strategy to solve this performance
problem would be to separate the two groups – perform a select unique into one table, and then perform the
count off of that second table.

Listing 10. Partial output of command onstat –g xqs (part 6)

insert 1 0 inst cosvr time it_count


---- ----- ---- --------
0 1 2380 10979878
1 2 2380 10979878
2 3 2380 10979878
…..
29 2 2380 10979878
30 3 2380 10979878
31 4 2380 10979878
--------------------------
32 351356096

Listing 10 shows the final insert phase of the query, writing to the temp table. This phase of the operation
was not started until the second group started issuing output, hence the time is much lower.

The xmp_mon script


Output from onstat –g xqs is also written to the explain plan. With XPS 8.31, xqs output is only written as
each thread completes. For realtime output, you must use the onstat –g xmp command, which can be
overwhelming. I have my own script which gathers this sort of information into a usable format. (This
script is available from www.artentech.com/downloads/xmp_mon)
Sample output from xmp_mon appears in Listing 11.

Listing 11. Output from xmp_mon


===Sat Apr 27 10:48:57 EDT 2002===================================
User: informix Isolation: DR PDQ 50.00-50.00 Plan : 15 Actual Memory:
8000000 (50%)
Scheduling level : 50

Current SQL Statement :


insert into tab_2 select col1, count(unique col2) from tab1 group by 1

scan 4 tab1
group 4
Operation Phase Threads Rows
scan next 16 304067921

group 3
group 3
Operation Phase Threads Rows
xchg next 16 302983421
group create 16 0

group 2
insert 1

The output shows that the query is in the process of scanning the tab1 table. 300 million rows have been
scanned and passed into the first group operation where they are waiting. At the top are a few key
elements, Query, Isolation level, PDQ and memory.

An example query
Imagine you have a query where you join two tables to return a count of last name by state, where name is
in Table 1 and address (state) is in Table 2:

select a.state_cd, b.last_name, count(*)


from tab1 a, tab2 b
where a.key = b.key
group by 1,2

Table 1 has 29 million rows and Table 2 has 27 million rows. Table 1 is 4.5 GB and Table 2 is 6 GB. Each
table has been placed into a single dbspace. There is no index on either key in either table. The machine in
use is an 8- processor Sun 6500 with 4 GB of memory. If this query were to run in an OLTP environment,
it could take upwards of 70 or 80 hours to execute.

Frequently, tuning a DSS query is merely a matter of shifting from “OLTP-think” to “DSS-think,” which is
exactly what we will do in this case.

The optimizer should choose to solve the example query with a hash join. If it does not, then there may be
another issue, like the optimizer thinks one of the tables is empty (so you need to run UPDATE
STATISTICS) as is the case in our query here. Let’s assume that PDQPRIORITY is not set to anything,
which is typical of an OLTP environment. This means that it is effectively turned off. What will happen
when the query is executed?

With these settings, the engine first reads the smaller table (Table 2), applies any filters, and builds a hash
table entry for each record. It builds this hash table in memory. It then reads the large table and matches it
up against the hash table. As it finds matches, it pushes the matched rows out the other side of the join.
It is possible that a filter on the ‘larger’ of two tables will make the result set from that table smaller than
the result set from the ‘smaller’ table. If the optimizer knows this it will properly choose to read the larger
table first applying the filter, so that the hash table doesn’t take up so much memory. The optimizer needs
distribution information to make this determination more accurately, and this can only come by running
UPDATE STATISTICS high or medium.

After 37 minutes into this query, Xtree shows both tables have been scanned and have moved into the hash
join. Because insufficient memory was allocated, the hash join is stalled while it swaps hash table pages
from the temp disk. As more and more memory is needed to build the hash table, more and more memory
is needed. Because the query is executing in an OLTP environment, the LRU buffers are filled with all the
data from the table, and with the associate buffer management overhead, spoiling everybody’s day (or
rather week or month). Because the hash table memory requirements are so large, the query overflowed to
the temp spaces, and if these weren’t set up well, it may have taken away the temp space for everyone else
as well. After running for several days, you would probably just kill the query anyway. This is where a
typical 40-day query starts from.

Basic tuning procedures


There are three basic time-consumptive elements to the example query: a table scan, a second table scan,
and the join. When a nested loop is used (where the larger table is read using the index for every record in
the smaller table), the join is part of the second read. The join from a hash join occurs while the second
table is read. In most DSS cases we want to use a hash join.

When tuning this query, it helps to approach it in two parts: the table scans and the join. To begin, let’s
obtain a benchmark of how long it takes just to scan each table by using the following queries. This will
not only give us the ability to estimate query time but also highlight when a query slows down, as well as
ensure that we can read the tables themselves fairly quickly.

SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;


SELECT SUM(some_column) from tab1;
SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;
SELECT SUM(some_other_column) from tab2;
SELECT CURRENT HOUR TO SECOND FROM SYSTABLES WHERE TABID = 1;

The results show: 17 minutes for Table 1 and 20 minutes for Table 2, which is about 30,000 rows per
second. These results are interpolated. The actual read time was 37 minutes for both tables.

Spread the data out


The first tuning objective is to improve the read performance. With all of the data being in one dbspace,
the engine is limited to a single thread to read the entire table. If the data is spread over multiple dbspaces,
the database can initiate one thread per dbspace, or fragment. Using ipload you can generate and run a job
that will unload (and reload) each table. You can then drop each table, recreate them fragmented over
several disks, and reload them. In this particular instance, the fragmentation expression is not that
important, so you can just use round robin. If it takes 20 minutes to read a table in one dbspace, spreading
the table across four dbspaces should bring the scan time down to about five minutes. Spreading the table
across eight dbspaces should lower the time to about 2.5 minutes.

Enable parallelism with PDQPRIORITY


Now that we have fixed the tables, the most obvious next step is PDQPRIORITY. With PDQPRIORITY
off, Informix Dynamic Server is doing nothing in parallel. One thread is doing all of the reading. If you
assume that it takes five minutes to read the data from one disk, and that there are 16 dbspaces per table,
then the query is automatically starting off with 80 minutes of overhead per table. Just by setting
PDQPRIORITY to 1 the engine can kick off one read thread per disk (don’t worry, it won’t try to read both
tables at once).
(Note: With XPS parallel scans are automatic. With 8.31 and dynamic memory allocation,
PDQPRIORITY loses some of its importance, becoming more of a guideline than a hard and fast rule).

Running the benchmark queries again with PDQPRIORITY set to 1, either from the operating system
(EXPORT PDQPRIORITY=1) or from dbaccess (SET PDQPRIORITY 1), the length of time required to
scan each table greatly improves.

Results: Three minutes and 21 seconds for Table 1, and three minutes and 52 seconds for Table 2, which is
roughly 150,000 rows per second.

Table scan times improve five-fold just by fragmenting the data. However, it can still be improved.

Forcing light scans to eliminate buffer overhead


In an OLTP environment, you want to maximize the cache read and write rates. To do this, you want the
engine to find the row it needs in the buffer cache and not have to read it from disk. OLTP buffer cache
management has some minimal overhead. However, in a DSS environment when you want to scan the
entire table, the odds of the data being in the cache are dramatically reduced. The overhead associated with
managing the DSS buffer cache becomes onerous, thus necessitating the light scan, a critical element in
DSS queries.

The light scan is designed for a DSS query. Rather than use the normal resident memory buffer cache, the
light scan uses its own light scan buffers, for which there is much less overhead. Each query gets its own
set of buffer pools. This can have a dramatic effect on our read rate.

To force a light scan the trick is to ensure that the table being read is larger than the resident memory buffer
size and to have the ISOLATION mode set to DIRTY READ, OR Repeatable/Commiteed Read Isolation
Level with a shared table lock. You can also set an environment variable (export
LIGHT_SCANS=FORCE). (With IDS, light scans cannot be enabled against a table with varchars in it.)

The number of light scan buffers assigned is a factor of your RA (read ahead) settings. From the Informix
Masters Series training manual:

Light_scan_buffers = roundup (( RA_PAGES + RA_THRESHOLD) /


(MAXAIOSIZE/PAGESIZE))

MAXAIOSIZE is a constant for each port.It is generally 16, except for HP, Solaris and Sequent, which are
60, 60 and 64 respectively.

Bumping the RA_PAGES and RA_THRESHOLD values to their maximum of 128 and 120, respectively,
gives you more light scan buffer pools, which is a good thing.

Ensure that you also have a large Virtual Segment (SHMVIRTSIZE) to accommodate the light scan
buffers.

In the previous benchmark runs, the engine did not use light scans (this was intentional to show the
performance impact of light scans). By running UPDATE STATISTICS on the engine, the optimizer will
recognize that the tables being read in our example query are larger than the resident memory buffer pool
and will use light scans when the benchmark is run again.

UPDATE STATISTICS is of course key to getting the light scan to work, but to what degree? UPDATE
STATISTICS medium or high will tend to run for hours—on a large data warehouse, potentially days. The
results of this are very useful if you plan to do indexed joins or filtering, but for this hash join you are
looking to scan the entire table anyway, so just run the quick UPDATE STATISTICS low and leave the
rest for the hash join.
Setting the environment variable DBUPSPACE can dramatically reduce the time require to run UPDATE
STATISTICS For other hints on tuning UPDATE STATISTICS, see
http://www7b.boulder.ibm.com/dmdd/zones/informix/library/techarticle/miller/0203miller.html

When we run the benchmark queries again, the results show: 43 seconds for Table 1, and 50 seconds for
Table 2, which is roughly 700,000 rows per second.

I have seen scan rates of up to two million rows per second on a table with 32 fragments. We could
fragment the table over more dbspaces to improve this read rate even more, but is cutting another 20 to 40
seconds off the query time worth the added administrative headache? You decide. The number of
fragments you can allocate may be limited by some other factors as well, like the business requirement for
a particular fragmentation scheme. For the purposes of this example, these results are sufficient.

I am told that there is a limit of three scan threads per CPUVP. Hence the upper limit on number of
dbspaces should be 3 x CPUVP.

If we were to rerun the query now retaining the PDQPRIORITY=1 setting, there is not much difference
that we would see. The tables would scan faster, but the query would still run out of memory and appear to
die.

The next tuning efforts focus on the join itself. There are several options available. Let’s first try running
the query with a nested loop join using the indices.

Consider adding indices


By adding an index, you are trying to do one of two things: either set up a nested loop join, or limit the size
of the data read to just the index pages. In other words, you are trying to put into the index whatever the
query needs to be successful.

The addition of an index gives the optimizer a different path it can travel to get at the data. If it feels that
one of the tables is small enough, it will choose to read that table and, for every row, probe the large table
(using the index) to perform the join. At issue is the speed with which the engine can perform probes,
which can be measured in thousands per second. 30 million rows at 3000 rows/second, which is optimistic
for the machine we are using here, comes out to 10,000 seconds or almost three hours. This is too slow.
Even fragmenting the indices and detaching them from the tables will not change the probe rate
dramatically enough to make a nested loop faster. So in this case adding indices is not going to help us and
may in fact hurt us if we confuse the optimizer.

Let’s try a dynamic hash join.

Forcing a hash join


In a hash join the smaller of the two tables is scanned and stored in a hash table in memory. If memory fills
up, which may happen unless the tables in question are small, this hash table is swapped out to the temp
disk. After the first table has been read, the second table is read and matched up against the hash table.
Periodically, the query will have to swap in the hash table entries that are out on temp disk. This temp-to-
memory swap is expensive and is generally the slowest portion of a hash join. Hence allocating enough
memory to the hash join is very important.

PDQPRIORITY controls memory allocation. Your ultimate PDQPRIORITY setting is (PDQPRIORITY


(PDQ_PRIORITY/100 * MAX_PDQPRIORITY/100) * DS_TOTALMEMORY. If you have a
DS_TOTALMEMORY of 1.5GB and take 1% of that (PDQPRIORITY=1) you get 15MB of memory in
which to build the hash table. 15MB/20 bytes (our hash table entry) = 786,000 or so entries. That’s a
rough guess, but after that many entries, the database will have to push the hash table entries out to temp
disk. With 16 scan threads, the database will probably be reading around 160,000 rows per second, so in
five seconds it will fill the hash table and have to push to temp disk with the above memory allocation.

The actual formula for the size of the hash table is 32 + keysize + rowsize. The hash table size would
therefore be 32 + 2 + 155 or 189 bytes per tab1 record. * 29 million = 5.4GB. A key improvement here
would be to lower the size of the tab1 record to reduce the size of the hash table. I have been informed that
the ‘rowsize’ in the calculation is the rowsize of the retrieved columns, but have yet to try to prove that.

The first thing to note is that building the hash table actually slows down the scan rate to about 300,000
rows per second (results vary).

At 40 PDQ, this query took 22 minutes 12 seconds, and the match rate was about 40,000 rows per second.
Not only can more memory lower the amount of swapping done to and from temp disk, but it also increases
the size of the hash table in memory and allows more matches per second.

Utilizing a hash join method, it turns out that the example query join runs at about 90,000 rows per second
(at 80 PDQ) for a projected total of about 5.5 minutes. This would be great except that the entire hash table
does not fit into memory, causing the query to have to swap portions of the table from disk, which is why
the actual results of this query are 8 minutes 40 seconds (PDQ = 80). During the match phase of the query
the match rate was 90,000 rows/second. After 17 million matches, the database started swapping. It
performed 10 total swaps, each swap about 10 seconds long (100 seconds or one minute and 40 seconds).
Because the entire table did not fit into memory, the query also could not keep up the 90,000 rows/second
rate, which is why (sans the swap time) the query runs in seven minutes not 5.5 minutes (as first projected).
If the hash join has enough memory it will not need to swap.

At times, the optimizer can, frustratingly, keep pushing your query down the nested loop path, despite your
best intentions and even the use of optimizer hints. In these cases, you may be able to trick the optimizer
into a hash join by giving it something it won’t find in the index:

SELECT …
FROM tab1, tab2
WHERE tab1.key + 0 = tab2.key + 0

The “+ 0” is inexpensive and forces the query to perform as a hash join. This is because tab1.key is no
longer recognizable as an index to the optimizer.

Temp disks
Not surprising, temp disk is written to and read from in the same manner as normal tables: one thread per
fragment. It is therefore important to have multiple temp disks set up to maximize the number of read and
write threads. It is best to have multiple dbspaces for your temp disks as well. I recommend two temp
dbspaces per CPU as was used in the query above, although as long as you keep the number at some
multiple you can go up to three temp spaces per cpu.

Balance
It is important to balance all that you do against how many processors you have so that there is minimal
task swapping between processors. In the example here, the tables were laid out over nine and 11
dbspaces, which meant that eight processors were handling nine scan threads in one case and 11 in the
other. Each processor probably wasted some time swapping from one scan thread to another. Business
requirements do not always match what makes sense technically.

Model changes
There are a few changes you can make to your onconfig file to better support decision support systems.
The following configurations are from the Informix Masters Series training manual:
BUFFERS Low 2000
SHMVIRTSIZE High 75% of available memory
SHMADD Whatever 32000
SHMTOTAL Maximize Set to available memory
RA_PAGES Maximize 128
RA_THRESHOLD Maximize 120
DS_TOTAL_MEMORY Maximize 90% of available memory

I actually prefer BUFFERS to be 20000 to provide support for OLTP activity as required; the tables are
large enough to dwarf the buffers in all cases, and there is still OLTP style work to be done to the data.

Summary
Any DSS query similar to the example described in this article will have three principle time-consuming
components. A read of Table 1, a read of Table 2, and a join step. If the query limits the selection
statement with some sort of filter (for example: a key between 100 and 200) then the read components
become easy to handle with an index. For example, read Table 1 for these rows and probe Table 2 for the
matching rows, where the actual join component occurs during the probe.

If the query is not limited with a filter, then a full join of both tables is required. This is very inefficient
when performed with an indexed read. It is far better to scan both tables and perform the join in memory.
A hash join is the best way to perform this. Tuning, therefore, becomes a matter of shortening the three
stages of the query (scan1, scan2, and the join itself). With a light scan the database can read from the first
table while building the hash table. With the second table, the database can read only marginally less
quickly including the hash join. And with a high enough PDQPRIORITY the database has enough memory
to perform the hash join in memory without expending time to swap the hash table out to temp disk. If the
hash tables are pushed to temp disk, it is not that expensive if the temp disks are laid out well.

If you can spread the read load over multiple disks, reserve enough memory, and set aside enough temp
disks, you can bring any query time down to within a five-minute window to scan each table and,
potentially, another minute or two to wrap up the join. If you are getting a 40-hour, or 40day, query then
one or more of these things isn’t happening.

About the author

Jack Parker is a Systems Architect who has been building and managing Informix-based solutions for the
past sixteen years. For the past seven of these he has been involved in the data warehousing industry. He
is an occasional writer, speaker, and contributor to comp.databases.informix. He is a partner with Arten
Technology Group, a consulting company in Southern New Hampshire. You can reach him at
[email protected].

You might also like