Performance Tuning SQL Server
Performance Tuning SQL Server
One of the best ways to boost JOIN performance is to limit how many rows need to be
JOINed. This is especially beneficial for the outer table in a JOIN. Only return absolutely only
those rows needed to be JOINed, and no more.
************************************************************************
If you perform regular joins between two or more tables in your queries, performance will
be optimized if each of the joined columns have their own indexes. This includes adding indexes
to the columns in each table used to join the tables. Generally speaking, a clustered key is
better than a non-clustered key for optimum JOIN performance.
************************************************************************
If you have two or more tables that are frequently joined together, then the columns
used for the joins on all tables should have an appropriate index. If the columns used for the
joins are not naturally compact, then considering adding surrogate keys to the tables that are
compact in order to reduce the size of the keys, thus decreasing read I/O during the join
process, increasing overall performance.
************************************************************************
JOIN performance has a lot to do with how many rows you can stuff in a data page.
For example, let's say you want to JOIN two tables. Most likely, one of these two tables will be
smaller than the other, and SQL Server will most likely select the smaller of the two tables to be
the inner table of the JOIN. When this happens, SQL Server tries to put the relevant contents of
this table into the buffer cache for faster performance. If there is not enough room to put all the
relevant data into cache, then SQL Server will have to use additional resources in order to get
data into and out of the cache as the JOIN is performed.
If all of the data can be cached, the performance of the JOIN will be faster than if it is not. This
comes back to the original statement, that the number of rows in a table can affect JOIN
performance. In other words, if a table has no wasted space, it is much more likely to get all of
the relevant inner table data into cache, boosting speed. The moral to this story is to try to get
as much data stuffed into a data page as possible. This can be done through the use of a high
fillfactor, rebuilding indexes often to get rid of empty space, and to optimize datatypes and
widths when creating columns in tables.
************************************************************************
Keep in mind that when you create foreign keys, an index is not automatically
created at the same time. If you ever plan to join a table to the table with the foreign key,
using the foreign key as the linking column, then you should consider adding an index to the
foreign key column. An index on a foreign key column can substantially boost the performance
of many joins.
************************************************************************
Avoid joining tables based on columns with few unique values. If columns used for
joining aren’t mostly unique, then the SQL Server optimizer may not be able to use an existing
index in order to speed up the join. Ideally, for best performance, joins should be done on
columns that have unique indexes.
************************************************************************
For best join performance, the indexes on the columns being joined should ideally be
numeric data types, not CHAR or VARCHAR, or other non-numeric data types. The overhead is
lower and join performance is faster.
************************************************************************
For maximum performance when joining two or more tables, the indexes on the columns to
be joined should have the same data type, and ideally, the same width.
This also means that you shouldn't mix non-Unicode and Unicode datatypes when using SQL
Server 7.0 or later. (e.g. VARCHAR and NVARCHAR). If SQL Server has to implicitly convert the
data types to perform the join, this not only slows the joining process, but it also could mean
that SQL Server may not use available indexes, performing a table scan instead.
************************************************************************
When you create joins using Transact-SQL, you can choose between two different
types of syntax: either ANSI or Microsoft. ANSI refers to the ANSI standard for writing joins,
and Microsoft refers to the old Microsoft style of writing joins. For example:
ANSI JOIN Syntax
SELECT fname, lname, department
FROM names INNER JOIN departments ON names.employeeid = departments.employeeid
Former Microsoft JOIN Syntax
SELECT fname, lname, department
FROM names, departments
WHERE names.employeeid = departments.employeeid
If written correctly, either format will produce identical results. But that is a big if. The older
Microsoft join syntax lends itself to mistakes because the syntax is a little less obvious. On the
other hand, the ANSI syntax is very explicit and there is little chance you can make a mistake.
For example, I ran across a slow-performing query from an ERP program. After reviewing the
code, which used the Microsoft JOIN syntax, I noticed that instead of creating a LEFT JOIN, the
developer had accidentally created a CROSS JOIN instead. In this particular example, less than
10,000 rows should have resulted from the LEFT JOIN, but because a CROSS JOIN was used, over
11 million rows were returned instead. Then the developer used a SELECT DISTINCT to get rid of
all the unnecessary rows created by the CROSS JOIN. As you can guess, this made for a very
lengthy query. Unfortunately, all I could do was notify the vendor's support department about it,
and they fixed their code.
The moral of this story is that you probably should be using the ANSI syntax, not the old
Microsoft syntax. Besides reducing the odds of making silly mistakes, this code is more portable
between database, and eventually, I imagine Microsoft will eventually stop supporting the old
format, making the ANSI syntax the only option.
************************************************************************
If you have to regularly join four or more tables to get the recordset you need, consider
denormalizing the tables so that the number of joined tables is reduced. Often, by adding one or
two columns from one table to another, the number of joins can be reduced, boosting
performance.
************************************************************************
If your join is slow, and currently includes hints, remove the hints to see if the optimizer
can do a better job on the join optimization than you can. This is especially important if your
application has been upgraded from version 6.5 to 7.0, or from 7.0 to 2000.
************************************************************************
One of the best ways to boost JOIN performance is to ensure that the JOINed tables
include an appropriate WHERE clause to minimize the number of rows that need to be JOINed.
For example, I have seen many developers perform a simple JOIN on two tables, which is not all
that unusual. The problem is that each table may contain over a million rows each. Instead of
just JOINing the tables, appropriate restrictive clauses needed to be added to the WHERE clause
of each table in order to reduce the total number of rows to be JOINed. This simple step can
really boost the performance of a JOIN of two large tables.
************************************************************************
In the SELECT statement that creates your JOIN, don't use an * (asterisk) to return all of
the columns in both tables. This is bad form for a couple of reasons. First, you should only
return those columns you need, as the less data you return, the faster your query will run. It
would be rare that you would need all of the columns in all of the tables you have joined.
Second, you will be returning two of each column used in your JOIN condition, which ends up
returning way more data that you need, and hurting performance.
Take a look at these two queries:
USE Northwind
SELECT *
FROM Orders
INNER JOIN [Order Details]
ON Orders.OrderID = [Order Details].OrderID
and
USE Northwind
SELECT Orders.OrderID, Orders.OrderDate,
[Order Details].UnitPrice, [Order Details].Quantity,
[Order Details].Discount
FROM Orders
INNER JOIN [Order Details]
ON Orders.OrderID = [Order Details].OrderID
Both of these queries perform essentially the same function. The problem with the first one is
that it returns not only too many columns (they aren't all needed by the application), but the
OrderID column is returned twice, which doesn't provide any useful benefits. Both of these
problems contribute to unnecessary server overhead, hurting performance. The moral of this
story is never to use the * in your joins.
************************************************************************
While high index selectivity is generally an important factor that the Query Optimizer uses to
determine whether or not to use an index, there is one special case where indexes with low
selectivity can be useful speeding up SQL Server. This is in the case of indexes on foreign keys.
Whether an index on a foreign key has either high or low selectivity, an index on a
foreign key can be used by the Query Optimizer to perform a merge join on the
tables in question. A merge join occurs when a row from each table is taken and then they
are compared to see if they match the specified join criteria. If the tables being joined have the
appropriate indexes (no matter the selectivity), a merge join can be performed, which is often
much faster than a join to a table with a foreign key that does not have an index.
************************************************************************
For very large joins, consider placing the tables to be joined in separate physical files in
the same filegroup. This allows SQL Server to spawn a separate thread for each file being
accessed, boosting performance.
************************************************************************
Don't use CROSS JOINS, unless this is the only way to accomplish your goal. What
some inexperienced developers do is to join two tables using a CROSS JOIN, and then they use
either the DISTINCT or the GROUP BY clauses to "clean up" the mess they have created. This, as
you might imagine, can be a huge waste of SQL Server resources.
************************************************************************
If you have the choice of using a JOIN or a subquery to perform the same task, generally
the JOIN (often an OUTER JOIN) is faster. But this is not always the case. For example, if the
returned data is going to be small, or if there are no indexes on the joined columns, then a
subquery may indeed be faster.
The only way to really know for sure is to try both methods and then look at their query plans. If
this operation is run often, you should seriously consider writing the code both ways, and then
select the most efficient code.
************************************************************************
We have a query that contains two subselects containing an aggregate function (SUM, Count,
etc.) in the SELECT part. The query was performing sluggishly. We were able to isolate the
problem down to the aggregate function in the subselect.
To rectify the problem, we reorganized the query so that there was still an aggregate function in
the SELECT part, but replaced the subselects with a series of JOINS. The query executed much
faster.
So, if this holds true — developers, as a rule, should use JOINS in lieu of subselects
when the subselect contains aggregate functions.
************************************************************************
If you have a query with many joins, one alternative to de-normalizing a table to boost
performance is to use an Indexed View to pre-join the tables. An Indexed View, which is
only available from SQL Server 2000 Enterprise Edition, allows you to create a view that is
actually a physical object that has its own clustered index. Whenever a base table of the
Indexed View is updated, the Indexed View is also updated. As you can imagine, this can
potentially reduce INSERT, UPDATE, and DELETE performance on the base tables. You will have
to perform tests, comparing the pros and cons of performance in order to determine whether or
not using an Indexed View to avoid joins in query is worth the extra performance cost caused by
using them.
************************************************************************
If you have a query that uses a LEFT OUTER JOIN, check it carefully to be sure that is the type
of join you really want to use. As you may know, a LEFT OUTER JOIN is used to create a result
set that includes all of the rows from the left table specified in the clause, not just the ones in
which the joined columns match. In addition, when a row in the left table has no matching rows
in the right table, the result set row contains NULL values for all the selected columns coming
from the right table. If this is what you want, then use this type of join.
The problem is that in the real world, a LEFT OUTER JOIN is rarely needed, and many developers
use them by mistake. While you may end up with the data you want, you may also end up with
more than the data you want, which contributes to unnecessary overhead and poor
performance. Because of this, always closely examine why you are using a LEFT OUTER JOIN in
your queries, and only use them if they are exactly what you need. Otherwise, use a JOIN that is
more appropriate to your needs.
************************************************************************
If you are having difficulty tuning the performance of a poorly performing query that
has one or more JOINs, check to see if they query plan created by the query optimizer is
using a hash join. When the query optimizer is asked to join two tables that don't have
appropriate indexes, it will often perform a hash join.
A hash join is resource intensive (especially CPU and I/O) and can slow the performance of your
join. If the query in question is run often, you should consider adding appropriate indexes. For
example, if you are joining column1 in table1 to column5 in table2, then column1 in table1 and
column5 in table2 need to have indexes.
Once indexes are added to the appropriate columns used in the joins in your query, the query
optimizer will most likely be able to use these indexes, performing a nested-loop join instead of
a hash join, and performance will improve.
How do you know if the Query Optimizer has automatically created column statistics on a
column in a table? Actually, this is quite easy to find out. Run the following query from Query
Analyzer, which is pointing to a user database.
SELECT name
FROM sysindexes
WHERE (name LIKE '%_WA_Sys%')
This query will return all of the columns from the tables in your database that have column
statistics on them that have been added automatically by the Query Optimizer. The value that is
in the "name" column of the sysindexes table is the name assigned to the statistics that SQL
Server keeps track of for the named column. This information provide you a starting point from
which to explore whether or not adding indexes to these columns will be useful or not.
************************************************************************
Indexes cannot be created in a vacuum. In other words, before you can identify and create
optimal indexes for your tables, you must thoroughly understand the kinds of queries
that will be run against them. This is not an easy task, especially if you are attempting to
add indexes to a new database.
Whether you are optimizing the indexes for the first time for a new database, or for a current
production database, you need to identify what queries are run, and how often they are run.
Obviously, you will want to spend more time creating and tuning indexes for queries that are
run very often than for queries that are seldom run. In addition, you will want to identify those
queries that are the most resource intensive, even if they aren't run the most often.
Once you know which queries run the most often, and which are the most resource intensive,
you can begin to better allocate your time in order to get the biggest bang for your time
invested.
But there is still one little question. How do you identify which queries are run the most often,
and which are the most resource intensive? The easiest solution is to capture a Profiler trace,
which can be configured to identify which queries run the most often, and to identify which
queries use the most resources. How you configure Profiler won't be discussed now, as it would
take a large article to explain all the options. The point here is to make you aware that the
Profiler is the tool of choice to identify the queries that are being run against your database.
If you are adding indexes to a production database, capturing the data you need is simple. But if
your database is new, what you will need to do is to simulate actual activity as best as possible,
perhaps during beta testing of the application, and capture this activity. While it may not be
perfect data, it will at least give you a head start. And once production begins, you can continue
your index tuning efforts on an on-going basis until you are relatively satisfied that you have
identified and tuned the indexes to the best of your ability.
Once you have identified the key queries, your next job is to identify the best indexes for them.
This is also a process to big to describe in this single tip, although there are many tips on this
website that relate directly to this issue. Essentially, what you need to do is to run each query
you need to analyze in Query Analyzer, examining how it works, and examining its execution
plan. Then based on your knowledge of the query, and your knowledge of indexing and how it
works in SQL Server, you begin the art of adding and optimizing indexes for your application.
************************************************************************
As a rule of thumb, every table should have at least a clustered index. Generally, but
not always, the clustered index should be on a column that monotonically increases — such as
an identity column, or some other column where the value is increasing — and is unique. In
many cases, the primary key is the ideal column for a clustered index. See this url for more
details about clustered indexes.
************************************************************************
Indexes should be considered on all columns that are frequently accessed by the
WHERE, ORDER BY, GROUP BY, TOP, and DISTINCT clauses. Without an index, each of
these operations will require a table scan of your table, potentially hurting performance.
Keep in mind the word "considered." An index created to support the speed of a particular query
may not be the best index for another query on the same table. Sometimes you have to balance
indexes to attain acceptable performance on all the various queries that are run against a table.
************************************************************************
An index on a column can often be created different ways, some of which are more
optimal that others. What this means that just because you create a useful index on a column
doesn't mean that it automatically is the optimum version of that index. It is quite possible that
a different variation of the same index is faster.
The most obvious example of this is that an index can be a clustered or non-clustered. Another
example of how an index is created that can affect its performance is the FILLFACTOR and
PAD_INDEX settings used to create it. Also, whether the index is also a composite index or not
(and what columns it contains) can affect an index's performance.
Unfortunately, there is no easy answer as to which variation of the same index is the fastest in
your situation, as the data and queries run against the data are different.
While I can't offer you specific rules that fit in all cases, the index tips you find on this website
should help you decide which variation of an index is best in your particular circumstance. You
may also need to test variations of the same index to see which variation works best for you.
************************************************************************
Don't automatically add indexes on a table because it seems like the right thing to do. Only
add indexes if you know that they will be used by the queries run against the table.
************************************************************************
Static tables (those tables that change very little, or not at all) can be more heavily indexed
that dynamic tables (those that are subject to many INSERTs, UPDATES, or DELETES) without
negative effect. This doesn't mean you should index every column. Only those columns that
need an index should have them. But at least you don't have to worry about the overhead of
indexes when they are added to static tables, as you must keep in mind when adding indexes to
dynamic tables.
In addition, for these tables, create the indexes with a FILLFACTOR and a PAD_INDEX of 100 to
ensure there is no wasted space. This reduces disk I/O, helping to boost overall performance.
************************************************************************
Point queries, queries than return a single row, are just as fast using a clustered
index as a non-clustered index. If you will be creating an index to speed the retrieval of a
single record, you may want to consider making it a non-clustered index, and saving the
clustering index (you can only have one) for a more complex query
************************************************************************
To help identify which tables in your database may need additional or improved
indexes, use the SQL Server Profiler Create Trace Wizard to run the "Identify Scans of Large
Tables" trace. This trace will tell which tables are being scanned by queries instead of using an
index to seek the data. This should provide you data you can use to help you identify which
tables may need additional or better indexes.
************************************************************************
Don't over index your OLTP tables, as every index you add increases the time it takes to
perform INSERTS, UPDATES, and DELETES. There must be a fine line drawn between having the
ideal number of indexes (for SELECTs) and the ideal number for data modifications.
************************************************************************
Don't accidentally add the same index twice on a table. This is easier to happen than you
think. For example, you add a unique or primary key to a column, which of course creates an
index to enforce what you want to happen. But without thinking about it when evaluating the
need for indexes on a table, you decide to add a new index, and this new index happens to be
on the same column as the unique or primary key. As long as you give indexes different names,
SQL Server will allow you to create the same index over and over.
************************************************************************
Drop indexes that are never used by the Query Optimizer. Unused indexes slow data
modifications, causes unnecessary I/O reads when reading pages, and wastes space in your
database, increasing the amount of time it takes to backup and restore databases. Use the
Index Wizard (7.0 and 2000) to help identify indexes that are not being used.
************************************************************************
Generally, you probably won't want to add an index to a table under these conditions:
If the index is not used by the query optimizer. Use Query Analyzer's "Show Execution
Plan" option to see if your queries against a particular table use an index or not. If the
table is small, most likely indexes will not be used.
If the column values exhibit low selectivity, often less than 90%-95% for non-clustered
indexes.
If the column(s) to be indexed are very wide.
If the column(s) are defined as TEXT, NTEXT, or IMAGE data types.
If the table is rarely queried.
Creating an index under any of these conditions will most likely results in an index that is rarely,
no not used at all.
************************************************************************
While high index selectivity is generally an important factor that the Query Optimizer uses to
determine whether or not to use an index, there is one special case where indexes with low
selectivity can be useful speeding up SQL Server. This is the case for indexes on foreign keys.
Whether an index on a foreign key has either high or low selectivity, an index on a
foreign key can be used by the Query Optimizer to perform a merge join on the
tables in question. A merge join occurs when a row from each table is taken and then they
are compared to see if they match the specified join criteria. If the tables being joined have the
appropriate indexes (no matter the selectivity), a merge join can be performed, which is
generally much faster than a join to a table with a foreign key that does not have an index.
************************************************************************
On data warehousing databases, which are essentially read-only, having an many indexes as
necessary for covering virtually any query is not normally a problem.
************************************************************************
To provide the up-to-date statistics the query optimizer needs to make smart query
optimization decisions, you will generally want to leave the "Auto Update Statistics"
database option on. This helps to ensure that the optimizer statistics are valid, helping to
ensure that queries are properly optimized when they are run.
But this option is not a panacea. When a SQL Server database is under very heavy load,
sometimes the auto update statistics feature can update the statistics at inappropriate times,
such as the busiest time of the day.
If you find that the auto update statistics feature is running at inappropriate times, you may
want to turn it off, and then manually update the statistics (using UPDATE STATISTICS or
sp_updatestats) when the database is under a less heavy load.
But again, consider what will happen if you do turn off the auto update statistics feature? While
turning this feature off may reduce some stress on your server by not running at inappropriate
times of the day, it could also cause some of your queries not to be properly optimized, which
could also put extra stress on your server during busy times.
Like many optimization issues, you will probably need to experiment to see if turning this option
on or off is more effective for your environment. But as a rule of thumb, if your server is not
maxed out, then leaving this option on is probably the best decision. [
************************************************************************
Keep the "width" of your indexes as narrow as possible. This reduces the size of the index
and reduces the number of disk I/O reads required to read the index, boosting performance.
************************************************************************
If possible, try to create indexes on columns that have integer values instead of characters.
Integer values have less overhead than character values.
************************************************************************
If you have two or more tables that are frequently joined together, then the columns
used for the joins should have an appropriate index. If the columns used for the joins are not
naturally compact, then considering adding surrogate keys to the tables that are compact in
order to reduce the size of the keys, thus decreasing I/O during the join process, which increases
overall performance.
************************************************************************
When creating indexes, try to make them unique indexes if at all possible. SQL Server
can often search through a unique index faster than a non-unique index because in a unique
index, each row is unique, and once the needed record is found, SQL Server doesn't have to look
any further.
************************************************************************
If a particular query against a table is run infrequently, and the addition of an index greatly
speeds the performance of the query, but the performance of INSERTS, UPDATES, and DELETES
is negatively affected by the addition of the index, consider creating the index for the table for
the duration of when the query is run, then dropping the index. An example of this is when
monthly reports are run at the end of the month on an OLTP application.
************************************************************************
If you like to get under the cover of SQL Server to learn more about indexing, take a look at the
sysindex system table that is found in every database. Here, you can find a wealth of
information on the indexes and tables in your database. To view the data in this table, run this
query from the database you are interested in:
SELECT *
FROM sysindexes
Here are some of the more interesting fields found in this table:
dpages: If the indid value is 0 or 1, then dpages is the count of the data pages used
for the index. If the indid is 255, then dpages equals zero. In all other cases, dpages is
the count of the non-clustered index pages used in the index.
id: Refers to the id of the table this index belongs to.
indid: This column indicates the type of index. For example, 1 is for a clustered table,
a value greater than 1 is for a non-clustered index, and a 255 indicates that the table
has text or image data.
OrigFillFactor: This is the original fillfactor used when the index was first created, but
it is not maintained over time.
statversion: Tracks the number of times that statistics have been updated.
status: 2 = unique index, 16 = clustered index, 64 = index allows duplicate rows,
2048 = the index is used to enforce the Primary Key constraint, 4096 = the index is
used to enforce the Unique constraint. These values are additive, and the value you
see in this column may be a sum of two or more of these options.
used: If the indid value is 0 or 1, then used is the number of total pages used for all
index and table data. If indid is 255, used is the number of pages for text or image
data. In all other cases, used is the number of pages in the index.
************************************************************************
Avoid using FLOAT or REAL data types as primary keys, as they add unnecessary
overhead that can hurt performance.
************************************************************************
If you want to boost the performance of a query that includes an AND operator in the
WHERE clause, consider the following:
Of the search criterions in the WHERE clause, at least one of them should be based on
a highly selective column that has an index.
If at least one of the search criterions in the WHERE clause is not highly selective,
consider adding indexes to all of the columns referenced in the WHERE clause.
If none of the column in the WHERE clause are selective enough to use an index on
their own, consider creating a covering index for this query.
************************************************************************
The Query Optimizer will always perform a table scan or a clustered index scan on a
table if the WHERE clause in the query contains an OR operator and if any of the referenced
columns in the OR clause are not indexed (or does not have a useful index). Because of this, if
you use many queries with OR clauses, you will want to ensure that each referenced column in
the WHERE clause has an index.
************************************************************************
A query with one or more OR clauses can sometimes be rewritten as a series of
queries that are combined with a UNION statement in order to boost the performance of
the query. For example, let's take a look at the following query:
SELECT employeeID, firstname, lastname
FROM names
WHERE dept = 'prod' or city = 'Orlando' or division = 'food'
This query has three separate conditions in the WHERE clause. In order for this query to use an
index, then there must be an index on all three columns found in the WHERE clause.
This same query can be written using UNION instead of OR, like this example:
SELECT employeeID, firstname, lastname FROM names WHERE dept = 'prod'
UNION ALL
SELECT employeeID, firstname, lastname FROM names WHERE city = 'Orlando'
UNION ALL
SELECT employeeID, firstname, lastname FROM names WHERE division = 'food'
Each of these queries will produce the same results. If there is only an index on dept, but not
the other columns in the WHERE clause, then the first version will not use any index and a table
scan must be performed. But in the second version of the query will use the index for part of the
query, but not for all of the query.
Admittedly, this is a very simple example, but even so, it does demonstrate how rewriting a
query can affect whether or not an index is used or not. If this query was much more complex,
then the approach of using UNION might be must more efficient, as it allows you to tune each
part of the index individually, something that cannot be done if you use only ORs in your query.
If you have a query that uses ORs and it not making the best use of indexes, consider rewriting
it as a UNION, and then testing performance. Only through testing can you be sure that one
version of your query will be faster than another.
************************************************************************
The Query Optimizer converts the Transact-SQL IN clause to the OR operator when
parsing your code. Because of this, keep in mind that if the referenced column in your query
doesn't include an index, then the Query Optimizer will perform a table scan or clustered index
scan on the table.
************************************************************************
If you use the SOUNDEX function against a table column in a WHERE clause, the
Query Optimizer will ignore any available indexes and perform a table scan. If your
table is large, this can present a major performance problem. If you need to perform SOUNDEX
type searches, one way around this problem is to pre-calculate the SOUNDEX code for the
column you are searching and then place this value in a column of its own, and then place an
index on this column in order to speed searches.
************************************************************************
If you need to create indexes on large tables in SQL Server 2000, you may be able to
speed up their creation by using the SORT_IN_TEMPDB option available with the CREATE INDEX
command. This option tells SQL Server to use the tempdb database, instead of the current
database, to sort data while creating indexes.
Assuming your tempdb database is isolated on its own separate disk or disk array, then the
process of creating the index can be sped up.
The only slight downside to using this option is that it takes up slightly more disk space than if
you didn't use it, but this shouldn't be much of an issue in most cases. If your tempdb database
is not on its own disk or disk array, then don't use this option, as it can actually slow
performance.
************************************************************************
SQL Server 2000 Enterprise Edition (not the standard edition) offers the ability to create
indexes in parallel, greatly speeding index creation. Assuming your server has multiple CPUs,
SQL Server 2000 uses near-linear scaling to boost index creation speed. For example, using two
CPUs instead of one CPU almost halves the speed it takes to create indexes.
************************************************************************
SQL Server 2000 offers a function called CHECKSUM. The main purpose for this function is
to create what are called hash indices. A hash indices is an index built on a column that stores
the checksum of the data found in another column in the table. The CHECKSUM function takes
data from another column and creates a checksum value. In other words, the CHECKSUM
function is used to create a mostly unique value that represents other data in your table. In
most cases, the CHECKSUM value will be much smaller than the actual value. For the most part,
checksum values are unique, but this is not guaranteed. It is possible that two slightly different
values may produce the same identical CHECKSUM value.
Here's how this works using our music database example. Say we have a song with the title "My
Best Friend is a Mule from Missouri". As you can see, this is a rather long value, and adding an
index to the song title column would make for a very wide index. But in this same table, we can
add a CHECKSUM column that takes the title of the song and creates a checksum based on it. In
this case, the checksum would be 1866876339. The CHECKSUM function always works the
same, so if you perform the CHECKSUM function on the same value many different times, you
would always get the same result.
So how does the CHECKSUM help us? The advantage of the CHECKSUM function is that instead
of creating a wide index by using the song title column, we create an index on the CHECKSUM
column instead. "That's fine and dandy, but I thought you wanted to search by the song's title?
How can anybody ever hope to remember a checksum value in order to perform a search?"
Here's how. Take a moment to review this code:
SELECT title, artist, composer
FROM songs
WHERE title = 'My Best Friend is a Mule from Missouri'
AND checksum_title = CHECKSUM('My Best Friend is a Mule from Missouri')
In this example, it appears that we are asking the same question twice, and in a sense, we are.
The reason we have to do this is because there may be checksum values that are identical,
even though the names of the songs are different. Remember, unique checksum values are not
guaranteed.
Here's how the query works. When the Query Optimizer examines the WHERE clause, it
determines that there is an index on the checksum_title column. And because the
checksum_title column is highly selective (minimal duplicate values) the Query Optimizer
decides to use the index. In addition, the Query Optimizer is able to perform the CHECKSUM
function, converting the song's title into a checksum value and using it to locate the matching
records in the index. Because an index is used, SQL Server can quickly locate the rows that
match the second part of the WHERE clause. Once the rows have been narrowed down by the
index, then all that has to be done is to compare these matching rows to the first part of the
WHERE clause, which will take very little time.
This may seem a lot of work to shorten the width of an index, but in many cases, this extra work
will pay off in better performance in the long run.
Because of the nature of this tip, I suggest you experiment using this method, and the more
conventional method of creating an index on the title column itself. Since there are so many
variables to consider, it is tough to know which method is better in your particular situation
unless you give them both a try.
************************************************************************
Some queries can be very complex, involving many tables, joins, and other conditions. I have
seen some queries run over 1000 lines of code (I didn't write them). This can make them
difficult to analyze in order to identify what indexes might be used to help the query perform
better.
For example, perhaps you want to create a covering index for the query and you need to
identify the columns to include in the covering index. Or, perhaps you want to identify those
columns that are used in joins in order to check to see that you have indexes on those columns
used in the joins in order to maximize performance.
To make complex queries easier to analyze, consider breaking them down into their
smaller constituent parts. One way to do this is to simply create lists of the key components
of the query, such as:
List all of the columns that are to be returned
List all of the columns that are used in the WHERE clause
List all of the columns used in the JOINs (if applicable)
List all the tables used in JOINs (if applicable)
Once you have the above information organized in this easy-to-comprehend form, it is must
easier to identify those columns that could potentially make use of indexes when executed.
************************************************************************
Queries that include either the DISTINCT or the GROUP BY clauses can be optimized
by including appropriate indexes. Any of the following indexing strategies can be used:
Include a covering, non-clustered index (covering the appropriate columns) of the
DISTINCT or GROUP BY clauses.
Include a clustered index on the columns in the GROUP BY clause.
Include a clustered index on the columns found in the SELECT clause.
Adding appropriate indexes to queries that include DISTINCT or GROUP BY is most important for
those queries that run often. If a query is rarely ran, then adding an index for its benefit may
cause more performance problems than it prevents.
************************************************************************
Computed columns in SQL Server 2000 can be indexed if they meet all of the following
criteria:
The computed column's expression is deterministic. This means that the computed
value must always be the same given the same inputs.
The ANSI_NULL connection-level object was on when the table was created.
TEXT, NTEXT, or IMAGE data types are not used in the computed column.
The physical connection used to create the index, and all connections used to INSERT,
UPDATE, or DELETE rows in the table must have these six SET options properly
configured: ANSI_NULLS = ON, ANSI_PADDINGS = ON, ANSI_WARNINGS = ON,
ARITHABORT = ON, CONCAT_NULL_YIELDS_NULL = ON, QUOTED_IDENTIFIER = ON,
NUMERIC_ROUNDABORT = OFF.
If you create a clustered index on a computed column, the computed values are stored in the
table, just like with any clustered index. If you create a non-clustered index, the computed value
is stored in the index, not in the actual table.
While adding an index to a computed column is possible, it is rarely advisable. The biggest
problem with doing so is that if the computed column changes, then the index (clustered or non-
clustered) has to also be updated, which contributes to overhead. If there are many computed
values changing, this overhead can significantly hurt performance.
The most common reason you might consider adding an index to a computed column is if you
are using the CHECKSUM() function on a large character column in order to reduce the size of an
index. By using the CHECKSUM() of a large character column, and indexing it instead of the
large character column itself, the size of the index can be reduced, helping to save space and
boost overall performance.
************************************************************************
Many databases experience both OLTP and OLAP queries. As you probably already know,
it is nearly impossible to optimize the indexing of a database that has both type of queries. This
is because in order for OLTP queries to be fast, there should not be too many indexes as to
hinder INSERT, UPDATE, or DELETE operations. And for OLAP queries to be fast, there should be
as many indexes as needed to speed SELECT queries.
While there are many options for dealing with this dilemma, one option that may work for some
people is a strategy where OLAP queries are mostly (if not all) are run during off hours
(assuming the database has any off hours), and take advantage of indexes that are added each
night before the OLAP queries begin, and then are dropped once the DSS queries are complete.
This way, those indexes needed for fast performing OLAP queries will minimally interfere with
OLTP transactions (especially during busy times).
As you can imagine, this strategy can take a lot of planning and work, but in some cases, it can
offer the best performance for databases that experience both OLTP and OLAP queries. Because
it is hard to guess if this strategy will work for you, you will want to test it before putting it into
production.
************************************************************************
Be aware that the MIN() or MAX() functions can take advantage of appropriate
indexes. If you find that you are using these functions often, and your current query is not
taking advantage of current indexes to speed up these functions, consider adding appropriate
indexes.
************************************************************************
If you know that a particular column will be subject to many sorts, consider adding a
unique index to that column. This is because unique columns generally sort faster in SQL Server
than if there are duplicate column data present.
************************************************************************
Whenever you upgrade software that affects SQL Server, be sure that you rerun the
Index Wizard to catch any obvious missing indexes. Application software that is updated often
changes the Transact-SQL code (and SPs, if used) that accesses SQL Server. In many cases, the
vendor supplying the upgraded code may not have taken into account how index use might
have changed after the upgrade was made. Because of this, it is very important to get a good
Profiler trace of the new code interacting with your database, and then use Index Wizard to help
identify any missing indexes. You may be surprised what you find.
************************************************************************
DELETE operations can sometimes be time- and space-consuming. In some environments you
might be able to increase the performance of this operation when using TRUNCATE instead of
DELETE. TRUNCATE will almost instantly be executed. However, TRUNCATE will not work when
there are Foreign Key references present for that table. A workaround is to DROP the constraints
before firing the TRUNCATE. Here's a generic script that will drop all existing Foreign Key
constraints on a specific table:
CREATE TABLE dropping_constraints
(
cmd VARCHAR(8000)
)
If you have any experience with performance tuning SQL Server 6.5, you may have heard that is
not a good idea to add a clustered index to a column that monotonically increases because it
can cause a "hotspot" on the disk that can cause performance problems. That advice is true in
SQL Server 6.5.
In SQL Server 7.0 and 2000, "hotspots" aren't generally a problem. You would have to have over
1,000 transactions a second before a "hotspot" were to negatively affect performance. In fact, a
"hotspot" can be beneficial under these circumstances because it eliminates page splits.
Here's why. If you are inserting new rows into a table that has a clustered index as its primary
key, and the key monotonically increases, this means that each INSERT will physically occur one
after another on the disk. Because of this, page splits won't occur during INSERTs, which in itself
saves overhead. This is because SQL Server has the ability to determine if data being inserted
into a table has a monotonically increasing sequence, and won't perform page splits when this
happens.
If you are inserting a lot of rows into a heap (a table without a clustered index), data is not
inserted in any particular order onto data pages, whether the data is monotonically or not
monotonically increasing. This results in SQL Server having to work harder (more reads) to
access the data when requested from disk. On the other hand, if a clustered index is added to a
table, data is inserted sequentially on data pages, and generally less disk I/O is required to
retrieve the data when requested from disk.
If data is inserted into a clustered index in more or less random order, data is often inserted
randomly into physical data pages, which is similar to the problem of inserting any data into a
heap.
So again, often, the best overall recommendation is to add a clustered index to a column that
monotonically increases (assuming there is a column that does so). This is especially true if the
table is subject to many INSERTS, UPDATES, and DELETES. But if a table is subject to few data
modification, but to many SELECT statements, then this advice is less useful, and other options
for the clustered index should be considered. Read the other tips on this page to learn of
situations where you should place the clustered index on other columns.
************************************************************************
Here are some more good reasons to add a clustered index to every table.
Keep in mind that a clustered index physically orders the data in a table based on a single or
composite column. Second, the data in a heap (a table without a clustered index) is not stored
in any particular physical order.
Whenever you need to query the column or columns used for the clustered index, SQL Server
has the ability to sequentially read the data in a clustered index an extent (8 data pages, or
64K) at a time. This makes it very easy for the disk subsystem to read the data quickly from
disk, especially if there is a lot of data to be retrieved.
But if a heap is used, even if you add a non-clustered index on an appropriate column or
columns, because the data is not physically ordered (unless you are using a covering index),
SQL Server has to read the data from disk randomly using 8K pages. This creates a lot of extra
work for the disk subsystem to retrieve the same data, hurting performance.
Another disadvantage of a heap is that when you rebuild indexes to reduce fragmentation,
heaps are not defragmented, because they are not indexes. This means that over time, a heap
will become more and more fragmented, further hurting performance. Adding a cluster index
will insure that the table can be defragmented when indexes are rebuilt.
This are just several of many reasons why a clustered index should be added to virtually all
tables.
************************************************************************
Since you can only create one clustered index per table, take extra time to carefully
consider how it will be used. Consider the type of queries that will be used against the table,
and make an educated guess as to which query (the most common one run against the table,
perhaps) is the most critical, and if this query will benefit from having a clustered index.
************************************************************************
Clustered indexes are useful for queries that meet these specifications:
For queries that SELECT by a large range of values or where you need sorted results.
This is because the data is already presorted in the index for you. Examples of this
include when you are using BETWEEN, <, >, GROUP BY, ORDER BY, and aggregates
such as MAX, MIN, and COUNT in your queries.
For queries that look up a record with a unique value (such as an employee number)
and you need to retrieve most or all of the data in the record. This is because the
query is covered by the index. In other words, the data you need is in the index itself,
and SQL Server does not have to read any additional pages to retrieve the data you
want.
For queries that access columns with a limited number of distinct values, such as a
columns that holds country or state data. But if column data has little distinctiveness,
such as columns with a yes or no, or male or female, then won't want to "waste" your
clustered index on them.
For queries that use the JOIN or GROUP BY clauses.
For queries where you want to return a lot of rows, just not a few. Again, this is
because the data is in the index and does not have to be looked up elsewhere.
************************************************************************
If you run into a circumstance where you need to have a single wide index (a composite index
of three or more columns) in a table, and the rest of the indexes in this table (assuming there
are two or more) will only be one column wide, then consider making the wide index a clustered
index and the other indexes non-clustered indexes.
Why? If the wide index is a clustered index, this means that the entire table is the index itself,
and a large amount of additional disk space will not be required to create the index. But if the
wide index is a non-clustered index, this means SQL Server will have to create a "relatively
large" index, which will indeed consume a large amount of additional disk space. Whenever the
index needs to be used by the query processor, it will be more efficient to access the clustered
index than the non-clustered index, and performance will be better.
************************************************************************
Avoid clustered indexes on columns that are already "covered" by non-clustered
indexes. A clustered index on columns that are already "covered" is redundant. Use the
clustered index for columns that can better make use of it.
************************************************************************
When selecting a column to base your clustered index on, try to avoid columns that are
frequently updated. Every time that a column used for a clustered index is modified, all of the
non-clustered indexes must also be updated, creating additional overhead.
************************************************************************
When selecting a column or columns to include in your clustered index, select the
column (or the first column in a composite index) that contains the data that will most often be
searched in your queries.
************************************************************************
If a table has a clustered index and non-clustered indexes, then performance will be best
optimized if the clustered index is based on a single column that is as narrow as possible. This is
because non-clustered indexes use the clustered index to locate data rows and because non-
clustered indexes must hold the clustered keys within their B-tree structures. This helps to
reduce not only the size of the clustered index, but all non-clustered indexes on the table as
well.
************************************************************************
The primary key you select for your table should not always be a clustered index . If
you create the primary key and don't specify otherwise, this is the default. Only make the
primary key a clustered index if you will be regularly performing range queries on the primary
key or need your results sorted by the primary key.
************************************************************************
If you drop or change a clustered index, keep in mind that all of the non-clustered indexes
also have to be rebuilt. Also keep in mind that to recreate a new clustered index, you will need
free disk space equivalent to 1.2 times the size of the table you are working with. This space is
necessary to recreate the entire table while maintaining the old table until the new table is
recreated. Then the old table is deleted.
************************************************************************
When deciding on whether to create a clustered or non-clustered index, it is often helpful to
know what the estimated size of the clustered index will be. Sometimes, the size of a
clustered index on a particular column or columns may be very large, leading to database size
bloat.
************************************************************************
Clustered index values often repeat many times in a table's non-clustered storage structures,
and a large clustered index value can unnecessarily increase the physical size of a non-
clustered index. This increases disk I/O and reduces performance when non-clustered indexes
are accessed.
Because of this, ideally a clustered index should be based on a single column (not
multiple columns) that is as narrow as possible. This not only reduces the clustered index's
physical size, it also reduces the physical size of non-clustered indexes and boosts SQL Servers
overall performance.
************************************************************************
When you create a clustered index, try to create it as a UNIQUE clustered index, not a non-
unique clustered index. The reason for this is that while SQL Server will allow you to create a
non-unique clustered index, under the surface, SQL Server will make it unique for you by adding
a 4-byte "uniqueifer" to the index key to guarantee uniqueness. This only serves to increase the
size of the key, which increases disk I/O, which reduces performance. If you specify that your
clustered index is UNIQUE when it is created, you will prevent this unnecessary overhead.
************************************************************************
If possible, avoid adding a clustered index to a GUID column (uniqueidentifier data type).
GUIDs take up 16-bytes of storage, more than an Identify column, which in turn make the index
larger, which increases I/O reads, which can hurt performance. While the performance hit will be
minimal if you do decide to add a clustered index to a GUID column, every little bit ads up.
************************************************************************
If a column in a table is not at least 95% unique, then most likely the query optimizer will
not use a non-clustered index based on that column. Because of this, don't add non-clustered
indexes to columns that aren't at least 95% unique. For example, a column with "yes" or "no" as
the data won't be at least 95% unique.
************************************************************************
If your table needs a clustered index, be sure it is added to the table before you add
any non-clustered indexes. If you don't, when you add a clustered index to your table, all of
the pre-existing non-clustered indexes will have to be rebuilt (which is done automatically when
the clustered index is built), putting an unnecessary strain on your server
************************************************************************
To determine the selectivity on an index on a given table, run this command: DBCC
SHOW_STATISTICS (table_name, index_name). The higher the selectivity of an index, the greater
the likelihood it will be used by the query optimizer.
************************************************************************
When deciding whether or not to add a non-clustered index to a column of a table, it
is useful to first find out how selective it is. By this, what we want to know is the ratio of
unique rows to total rows (based on a specific column) found in the table. Generally, if a column
is not more than 95% unique, then the Query Optimizer might not even use the index. If this is
the case, then adding the non-clustered index may be a waste of disk space. If fact, adding a
non-clustered index that is never used will hurt a table's performance.
Another useful reason to determine the selectivity of a column is to decide what is the best
order to position indexes in a composite index. This is because you will get the best
performance out of a composite index if the columns are arranged so that the most selective is
the first one, the next most selective, the second one, and so on.
So how do you determine the selectivity of a column? One way is to run the following script on
any column you are considering for a non-clustered index. This example script is designed to be
used with the Northwind database, so you will need to modify it appropriately for your use.
SELECT @total_unique = 0
SELECT @total_rows = 0
SELECT @selectivity_ratio = 0
The results in this case is 38%, which means that adding a non-clustered index to the OrderID
column of the Order Details table in the Northwind database is probably not a very good idea.
************************************************************************
In some cases, even though a column (or columns if a composite index) has a non-clustered
index, the Query Optimizer may not use it (even though it should), instead performing a table
scan (if the table is a heap) or a clustered index scan (if there is a clustered index). This, of
course, can produce unwanted performance problems.
This particular problem can occur when there is a data correlation between the order of the
rows in the table, and the order of the non-clustered index entries. This can occur when there is
correlation between the clustered index and the non-clustered index. For example, the clustered
index may be created on a date column, and the non-clustered index might be created on an
invoice number column. If this is the case, then there is a correlation (or direct relationship)
between the increasing dates and the increasing invoice numbers found in each row.
The reason this problem occurs is because the Query Optimizer assumes there is no correlation,
and it makes its optimization decisions based on this assumption.
If you run into this problem, there are three potential resolutions to this problem:
************************************************************************
When you think of page splits, you normally only think of clustered indexes. This is because
clustered indexes enforce the physical order of the index, and page splitting can be a problem if
the clustered index is based on a non-incrementing column. But what has this to do with non-
clustered indexes? While non-clustered indexes use a clustered index (assuming the table is not
a heap) as their key, most people don't realize that non-clustered indexes can suffer
from page splitting, and because of this, need to have an appropriate fillfactor and pad_index
set for them.
Here's an example of how non-clustered indexes can experience page splits. Let's say you have
a table that has a clustered index on it, such as customer number. Let's also say that you have
a non-clustered index on the zip code column. As you can quite well imagine, the data in the zip
code column will have no relation to the customer number and will be more or less random, and
data will have to be inserted into the zip code index randomly. Like clustered index pages, non-
clustered index pages can experience page splitting.So just as with clustered indexes, non-
clustered indexes need to have an appropriate fillfactor and pad_index, and also be rebuilt on a
periodic basis.