Tunning Dss Queries
Tunning Dss Queries
This article originally appeared in Informix Tech Notes Volume 10, Issue 1, 2000. It was written for IBM
Informix Dynamic Server (IDS). The contents hold true for IBM Informix Foundation as well. I have
made some additional comments for IBM Informix Extended Parallel Server XPS.
Introduction
Over the past several months I have found myself repeatedly explaining the concept of tuning a decision
support system (DSS) query for an IBM Informix database.
Data warehouses and very large databases tend to be used differently from traditional WRITE OUT
(OLTP) database systems. In an OLTP environment, the goal is to execute many discrete transactions in a
short timeframe. OLTP administrators support hundreds or thousands of users and tend to focus on very
specific portions of the data (for example, we must be able to update item four of the customer’s order as
well as the balance in their order master record). Such an OLTP transaction might be performed with one
or more UPDATE statements WHERE cust_id/ord_no = ?.
In a DSS environment, you are more likely to have only a handful of users, each of whom are making
queries against a large volume of data. In an OLTP model, users like this would be severely chastised.
However, in the world of DSS, this sort of activity is encouraged. Typically, a DSS query will scan every
row of one or more tables. Also typically, where the size of an OLTP system might be measured in terms
of hundreds of megabytes, a DSS system will be measured in terms of hundreds of gigabytes.
DSS systems tend to be queried by front-end tools such as Business Objects or SAS, where the SQL to
perform the query is generated by the tool; you frequently do not have much control over the query itself.
If you try to support a DSS query using the same data model, indexed reads, and engine tuning used in an
OLTP system, it is likely that your response time to such a query could be measured in days. In a recent
case, the response time for such a query was 40 days. In this particular case, using the tuning methods
detailed in this article, I was able to bring the query result time down to about 15 minutes. This was
possible without changing the query or changing the schema. This article describes specific tuning
procedures for shortening the run time of DSS queries.
Monitoring tools
Before examining tuning procedures to optimize this DSS query, you need to understand how to use
specific monitoring tools available to help you understand how the query will behave. The following
monitoring tools are very helpful for examining queries:
• Explain plan
• Xtree
• onstat –g mgm command
• onstat –g lsc command
• onstat –g xqp and xqs (XPS-specific)
• xmp_mon
Explain plan
The first monitoring tool is the explain plan. This can be obtained by including a SET EXPLAIN ON;
command at the start of your query. The optimizer will then create, or append to, a file called sqexplain.out
in which it explains how it plans to resolve the query, as shown in Listing 1.
Listing 1. Explain plan output
QUERY:
------
select a.state_cd, b.last_name, count(*)
from tab1 a, tab2 b
where a.key = b.key
group by 1,2
Notice that the estimated cost is very low; somebody forgot to run UPDATE STATISTICS.
The initial section of this EXPLAIN plan is the query itself followed by an estimated cost, which is derived
from the number of instructions the optimizer thinks will be required to resolve the query. The value given
here is way too low for this query (a value in the billions would be more normal).
The estimated number of rows returned is an indication that the optimizer thinks that at least one of the
tables is empty. A minimum of one row is always returned from a count(*) operation.
Each of the tables will be read using a sequential scan, probably because it costs nothing to read an empty
table. The “Serial” is an indication that each fragment of disk that the table resides on will be read
sequentially as opposed to the desired “Parallel.” No fragments have been eliminated from the query
(hence, the “fragments: ALL”).
The temporary table is, of course, required to perform the actual group by operation. A hash join will be
used to join the two tables, again because the optimizer thinks one of the tables is empty and because there
are no indices on these tables.
What we would really like to see is a higher estimated cost. We like the sequential scan, although we really
want a parallel, not serial, scan. If we had been able to eliminate some of the fragments through a WHERE
condition (a filter), we might also have seen only a list of the fragments to be scanned. Finally, we would
also like to see a line like:
Maximum Threads: 21
We would actually like to see a higher value indicating more parallelism, but for the sake of this example,
we will say 21. This would indicate that we are using PDQPRIORITY. The exact number of threads will
vary with the number of dbspaces each table is spread across and the number of threads assigned to each
additional task. As can be seen in the output from the onstat -g sql command shown in Listing 2, we have
nine scan threads for one table, 11 scan threads for the second table and a single sqlexec thread to pull it all
together. With a higher PDQPRIORITY setting we would see more specialized threads for grouping and
sorting.
These key pieces of information can tell you that something is wrong with the way the optimizer has
chosen to solve the query.
Xtree
While the explain plan can show you how the query will be answered, Xtree allows you to monitor the
query as it is being executed. You can see how many threads have been initiated for each segment of the
process as well as determine a run rate for the query.
Figure 1 below shows an Xtree screen shot for a typical hash join. At the bottom of this figure are two sets
of scan threads: the one on the right has nine threads, the one on the left is running with 11 threads. The
table on the right was read into memory in the hash table, and the table on the left is being scanned and
matched against this hash table while being read.
The numbers to the right of each box indicate how many threads are in use to perform this portion of the
query. Within the box, below the meter, is a number that indicates how many rows have been processed by
this portion of the query in the last second. At the top of the box is the number of rows processed so far.
About 114,000 rows were read from the table on the left in the last second, with about 89,000 matches
performed in that same second. When a hash table swap occurs, all of the process-per-second counters will
drop to zero; by watching these you can figure out how often and how long the process is spending in a
hash table swap.
The numbers next to the scan boxes should match the number of dbspaces the table is fragmented across.
On this machine, I would expect to see on the order of 50,000 to 70,000 rows scanned per fragment per
second. If this number drops off at the end then one or more of the fragments probably completed before
the others. If this occurs in the last 5-10 seconds,that is fine, but if this drop-off occurs half way through the
query, you might want to check the fragmentation scheme for data skew.
The filter box should reduce the number of rows scanned if you are using a filter (e.g., where state_cd
= ‘MA’). The total number filtered should keep pace easily with the scan rate.
Listing 2. Output from the onstat –g ses and onstat –g sql commands
session #RSAM total used
id user tty pid hostname threads memory memory
320 informix 6 25979 foo 21 352256 278408
The output contains a lot of information-- probably more than you need-- but there are several key things to
look for. The current statement name provides you details on the SQL of the query any session is actually
running. You can see how many threads and what types were kicked off (e.g., scan_2.3 et al., we got
nine scan threads against one table, 11 against the other and a single sqlexec thread that will perform the
hash join with a PDQPRIORITY=1 (since I started the query, I know this). Use onstat –g mgm to see
PDQPRIORITY).
MAX_PDQPRIORITY: 100
DS_MAX_QUERIES: 20
DS_MAX_SCANS: 40
DS_TOTAL_MEMORY: 20000 KB
Active Queries:
Session Query Priority Thread Memory Scans Gate
33 105e4060 100 106335d70 2500/2500 2/1 -
insert into tab_2 select col1, count(unique col2) from tab1 group by
1
Operations are performed in descending order based on segid. So the first operation is the scan/group. The
group itself is reported twice, once for an input and once for an output. In this case, the ‘unique’ clause
forced a second group. Finally the data is inserted into a final table.
Listings 5-10 show partial output for the command onstat –g xqs for this plan_id.
In the interests of brevity, I have deleted all but the first and last three lines of each segment.
Listing 5. Partial output of command onstat –g xqs (part 1)
XMP Query Statistics
Cosvr_ID: 1
Plan_ID: 11961
In this segment, there were 16 threads initiated. Each thread read 39.4 million rows in about 32 minutes.
(The time column is displayed in units of seconds.) Not great. There was no filter, so the number of rows
produced ??(rows_prod) is equal to the number of rows scanned (rows_scan)
Each ‘group’ operation is broken into two phases, a collection of data and the actual group. Hence the
segment shown in Listing 6 indicates that we pushed rows into the first group (unique) operation. Again,
16 threads took the same amount of time,which was concurrent with the scan time. The number of rows
produced is identical to the number of rows pushed in (rows_cons). The amount of memory for this
operation was 2K (across 16 threads).
Listing 7 is the second half of the group segment shown in listing 6, and is where the unique rows come
out. Notice that this phase used almost 4GB of memory. The ovfl column indicates that memory
overflowed. The ovfl is not a count of how many pages overflowed, but of how many segments. It’s hard
to tell from this whether the overflow was serious. However, given that the time for this segment took
three hours, we can surmise it was seriousThis is probably what gated the scan.
Listing 8 contains the second group necessitated by the count(*)/group operation. Again, this is just
pushing rows into the group operation. All of this time is concurrent with previous processes.
Listing 9 shows the second phase of the second group operation. Because of the way this query was
crafted, the other 4GB of memory was used for this operation. One strategy to solve this performance
problem would be to separate the two groups – perform a select unique into one table, and then perform the
count off of that second table.
Listing 10 shows the final insert phase of the query, writing to the temp table. This phase of the operation
was not started until the second group started issuing output, hence the time is much lower.
scan 4 tab1
group 4
Operation Phase Threads Rows
scan next 16 304067921
group 3
group 3
Operation Phase Threads Rows
xchg next 16 302983421
group create 16 0
group 2
insert 1
The output shows that the query is in the process of scanning the tab1 table. 300 million rows have been
scanned and passed into the first group operation where they are waiting. At the top are a few key
elements, Query, Isolation level, PDQ and memory.
An example query
Imagine you have a query where you join two tables to return a count of last name by state, where name is
in Table 1 and address (state) is in Table 2:
Table 1 has 29 million rows and Table 2 has 27 million rows. Table 1 is 4.5 GB and Table 2 is 6 GB. Each
table has been placed into a single dbspace. There is no index on either key in either table. The machine in
use is an 8- processor Sun 6500 with 4 GB of memory. If this query were to run in an OLTP environment,
it could take upwards of 70 or 80 hours to execute.
Frequently, tuning a DSS query is merely a matter of shifting from “OLTP-think” to “DSS-think,” which is
exactly what we will do in this case.
The optimizer should choose to solve the example query with a hash join. If it does not, then there may be
another issue, like the optimizer thinks one of the tables is empty (so you need to run UPDATE
STATISTICS) as is the case in our query here. Let’s assume that PDQPRIORITY is not set to anything,
which is typical of an OLTP environment. This means that it is effectively turned off. What will happen
when the query is executed?
With these settings, the engine first reads the smaller table (Table 2), applies any filters, and builds a hash
table entry for each record. It builds this hash table in memory. It then reads the large table and matches it
up against the hash table. As it finds matches, it pushes the matched rows out the other side of the join.
It is possible that a filter on the ‘larger’ of two tables will make the result set from that table smaller than
the result set from the ‘smaller’ table. If the optimizer knows this it will properly choose to read the larger
table first applying the filter, so that the hash table doesn’t take up so much memory. The optimizer needs
distribution information to make this determination more accurately, and this can only come by running
UPDATE STATISTICS high or medium.
After 37 minutes into this query, Xtree shows both tables have been scanned and have moved into the hash
join. Because insufficient memory was allocated, the hash join is stalled while it swaps hash table pages
from the temp disk. As more and more memory is needed to build the hash table, more and more memory
is needed. Because the query is executing in an OLTP environment, the LRU buffers are filled with all the
data from the table, and with the associate buffer management overhead, spoiling everybody’s day (or
rather week or month). Because the hash table memory requirements are so large, the query overflowed to
the temp spaces, and if these weren’t set up well, it may have taken away the temp space for everyone else
as well. After running for several days, you would probably just kill the query anyway. This is where a
typical 40-day query starts from.
When tuning this query, it helps to approach it in two parts: the table scans and the join. To begin, let’s
obtain a benchmark of how long it takes just to scan each table by using the following queries. This will
not only give us the ability to estimate query time but also highlight when a query slows down, as well as
ensure that we can read the tables themselves fairly quickly.
The results show: 17 minutes for Table 1 and 20 minutes for Table 2, which is about 30,000 rows per
second. These results are interpolated. The actual read time was 37 minutes for both tables.
Running the benchmark queries again with PDQPRIORITY set to 1, either from the operating system
(EXPORT PDQPRIORITY=1) or from dbaccess (SET PDQPRIORITY 1), the length of time required to
scan each table greatly improves.
Results: Three minutes and 21 seconds for Table 1, and three minutes and 52 seconds for Table 2, which is
roughly 150,000 rows per second.
Table scan times improve five-fold just by fragmenting the data. However, it can still be improved.
The light scan is designed for a DSS query. Rather than use the normal resident memory buffer cache, the
light scan uses its own light scan buffers, for which there is much less overhead. Each query gets its own
set of buffer pools. This can have a dramatic effect on our read rate.
To force a light scan the trick is to ensure that the table being read is larger than the resident memory buffer
size and to have the ISOLATION mode set to DIRTY READ, OR Repeatable/Commiteed Read Isolation
Level with a shared table lock. You can also set an environment variable (export
LIGHT_SCANS=FORCE). (With IDS, light scans cannot be enabled against a table with varchars in it.)
The number of light scan buffers assigned is a factor of your RA (read ahead) settings. From the Informix
Masters Series training manual:
MAXAIOSIZE is a constant for each port.It is generally 16, except for HP, Solaris and Sequent, which are
60, 60 and 64 respectively.
Bumping the RA_PAGES and RA_THRESHOLD values to their maximum of 128 and 120, respectively,
gives you more light scan buffer pools, which is a good thing.
Ensure that you also have a large Virtual Segment (SHMVIRTSIZE) to accommodate the light scan
buffers.
In the previous benchmark runs, the engine did not use light scans (this was intentional to show the
performance impact of light scans). By running UPDATE STATISTICS on the engine, the optimizer will
recognize that the tables being read in our example query are larger than the resident memory buffer pool
and will use light scans when the benchmark is run again.
UPDATE STATISTICS is of course key to getting the light scan to work, but to what degree? UPDATE
STATISTICS medium or high will tend to run for hours—on a large data warehouse, potentially days. The
results of this are very useful if you plan to do indexed joins or filtering, but for this hash join you are
looking to scan the entire table anyway, so just run the quick UPDATE STATISTICS low and leave the
rest for the hash join.
Setting the environment variable DBUPSPACE can dramatically reduce the time require to run UPDATE
STATISTICS For other hints on tuning UPDATE STATISTICS, see
http://www7b.boulder.ibm.com/dmdd/zones/informix/library/techarticle/miller/0203miller.html
When we run the benchmark queries again, the results show: 43 seconds for Table 1, and 50 seconds for
Table 2, which is roughly 700,000 rows per second.
I have seen scan rates of up to two million rows per second on a table with 32 fragments. We could
fragment the table over more dbspaces to improve this read rate even more, but is cutting another 20 to 40
seconds off the query time worth the added administrative headache? You decide. The number of
fragments you can allocate may be limited by some other factors as well, like the business requirement for
a particular fragmentation scheme. For the purposes of this example, these results are sufficient.
I am told that there is a limit of three scan threads per CPUVP. Hence the upper limit on number of
dbspaces should be 3 x CPUVP.
If we were to rerun the query now retaining the PDQPRIORITY=1 setting, there is not much difference
that we would see. The tables would scan faster, but the query would still run out of memory and appear to
die.
The next tuning efforts focus on the join itself. There are several options available. Let’s first try running
the query with a nested loop join using the indices.
The addition of an index gives the optimizer a different path it can travel to get at the data. If it feels that
one of the tables is small enough, it will choose to read that table and, for every row, probe the large table
(using the index) to perform the join. At issue is the speed with which the engine can perform probes,
which can be measured in thousands per second. 30 million rows at 3000 rows/second, which is optimistic
for the machine we are using here, comes out to 10,000 seconds or almost three hours. This is too slow.
Even fragmenting the indices and detaching them from the tables will not change the probe rate
dramatically enough to make a nested loop faster. So in this case adding indices is not going to help us and
may in fact hurt us if we confuse the optimizer.
The actual formula for the size of the hash table is 32 + keysize + rowsize. The hash table size would
therefore be 32 + 2 + 155 or 189 bytes per tab1 record. * 29 million = 5.4GB. A key improvement here
would be to lower the size of the tab1 record to reduce the size of the hash table. I have been informed that
the ‘rowsize’ in the calculation is the rowsize of the retrieved columns, but have yet to try to prove that.
The first thing to note is that building the hash table actually slows down the scan rate to about 300,000
rows per second (results vary).
At 40 PDQ, this query took 22 minutes 12 seconds, and the match rate was about 40,000 rows per second.
Not only can more memory lower the amount of swapping done to and from temp disk, but it also increases
the size of the hash table in memory and allows more matches per second.
Utilizing a hash join method, it turns out that the example query join runs at about 90,000 rows per second
(at 80 PDQ) for a projected total of about 5.5 minutes. This would be great except that the entire hash table
does not fit into memory, causing the query to have to swap portions of the table from disk, which is why
the actual results of this query are 8 minutes 40 seconds (PDQ = 80). During the match phase of the query
the match rate was 90,000 rows/second. After 17 million matches, the database started swapping. It
performed 10 total swaps, each swap about 10 seconds long (100 seconds or one minute and 40 seconds).
Because the entire table did not fit into memory, the query also could not keep up the 90,000 rows/second
rate, which is why (sans the swap time) the query runs in seven minutes not 5.5 minutes (as first projected).
If the hash join has enough memory it will not need to swap.
At times, the optimizer can, frustratingly, keep pushing your query down the nested loop path, despite your
best intentions and even the use of optimizer hints. In these cases, you may be able to trick the optimizer
into a hash join by giving it something it won’t find in the index:
SELECT …
FROM tab1, tab2
WHERE tab1.key + 0 = tab2.key + 0
The “+ 0” is inexpensive and forces the query to perform as a hash join. This is because tab1.key is no
longer recognizable as an index to the optimizer.
Temp disks
Not surprising, temp disk is written to and read from in the same manner as normal tables: one thread per
fragment. It is therefore important to have multiple temp disks set up to maximize the number of read and
write threads. It is best to have multiple dbspaces for your temp disks as well. I recommend two temp
dbspaces per CPU as was used in the query above, although as long as you keep the number at some
multiple you can go up to three temp spaces per cpu.
Balance
It is important to balance all that you do against how many processors you have so that there is minimal
task swapping between processors. In the example here, the tables were laid out over nine and 11
dbspaces, which meant that eight processors were handling nine scan threads in one case and 11 in the
other. Each processor probably wasted some time swapping from one scan thread to another. Business
requirements do not always match what makes sense technically.
Model changes
There are a few changes you can make to your onconfig file to better support decision support systems.
The following configurations are from the Informix Masters Series training manual:
BUFFERS Low 2000
SHMVIRTSIZE High 75% of available memory
SHMADD Whatever 32000
SHMTOTAL Maximize Set to available memory
RA_PAGES Maximize 128
RA_THRESHOLD Maximize 120
DS_TOTAL_MEMORY Maximize 90% of available memory
I actually prefer BUFFERS to be 20000 to provide support for OLTP activity as required; the tables are
large enough to dwarf the buffers in all cases, and there is still OLTP style work to be done to the data.
Summary
Any DSS query similar to the example described in this article will have three principle time-consuming
components. A read of Table 1, a read of Table 2, and a join step. If the query limits the selection
statement with some sort of filter (for example: a key between 100 and 200) then the read components
become easy to handle with an index. For example, read Table 1 for these rows and probe Table 2 for the
matching rows, where the actual join component occurs during the probe.
If the query is not limited with a filter, then a full join of both tables is required. This is very inefficient
when performed with an indexed read. It is far better to scan both tables and perform the join in memory.
A hash join is the best way to perform this. Tuning, therefore, becomes a matter of shortening the three
stages of the query (scan1, scan2, and the join itself). With a light scan the database can read from the first
table while building the hash table. With the second table, the database can read only marginally less
quickly including the hash join. And with a high enough PDQPRIORITY the database has enough memory
to perform the hash join in memory without expending time to swap the hash table out to temp disk. If the
hash tables are pushed to temp disk, it is not that expensive if the temp disks are laid out well.
If you can spread the read load over multiple disks, reserve enough memory, and set aside enough temp
disks, you can bring any query time down to within a five-minute window to scan each table and,
potentially, another minute or two to wrap up the join. If you are getting a 40-hour, or 40day, query then
one or more of these things isn’t happening.
Jack Parker is a Systems Architect who has been building and managing Informix-based solutions for the
past sixteen years. For the past seven of these he has been involved in the data warehousing industry. He
is an occasional writer, speaker, and contributor to comp.databases.informix. He is a partner with Arten
Technology Group, a consulting company in Southern New Hampshire. You can reach him at
[email protected].