17 Ways To Speed Your SQL Queries
17 Ways To Speed Your SQL Queries
17 ways to speed
your SQL queries
It’s easy to create database code that slows down query results or
ties up the database unnecessarily -- unless you follow these tips.
BY SEAN MCCOWN
Deep Dive
3
in orders to be labeled as “Preferred.” Thus,
you insert the data into the table and run an
These tech- UPDATE statement to set the CustomerRank
niques should column to “Preferred” for anyone who has more Do pull only the number
give you a little than $100,000 in orders. The problem is that of columns you need
more insight the UPDATE statement is logged, which means This issue is similar to issue No. 2, but it’s specific
into the minds it has to write twice for every single write to the to columns. It’s all too easy to code all your
of your DBAs, table. The way around this, of course, is to use queries with SELECT * instead of listing the
as well as the an inline CASE statement in the SQL query itself. columns individually. The problem again is that
ability to start This tests every row for the order amount condi- it pulls more data than you need. I’ve seen this
thinking of tion and sets the “Preferred” label before it’s error dozens and dozens of times. A developer
processes in written to the table. The performance increase does a SELECT * query against a table with
a production- can be staggering. 120 columns and millions of rows, but winds up
oriented way.
2
using only three to five of them. At that point,
you’re processing so much more data than
you need it’s a wonder the query returns at all.
Don’t blindly You’re not only processing more data than you
reuse code need, but you’re also taking resources away from
This issue is also very common. It’s very easy to other processes.
InfoWorld.com DEEP DIVE SERIES 4
4 Don’t double-dip
Here’s another one I’ve seen more
times than I should have: A stored proce-
dure is written to pull data from a table with
5 Do know when to use
temp tables
This issue is a bit harder to get a handle on, but
it can yield impressive gains. You can use temp
hundreds of millions of rows. The developer tables in a number of situations, such as keeping
needs customers who live in California and have you from double-dipping into large tables.
incomes of more than $40,000. So he queries You can also use them to greatly decrease the
for customers that live in California and puts processing power required to join large tables.
the results into a temp table; then he queries If you must join a table to a large table and there’s
for customers with incomes above $40,000 a condition on that large table, you can improve
and puts those results into another temp table. performance by pulling out the subset of data you
Finally, he joins both tables to get the final need from the large table into a temp table and
product. joining with that instead. This is also helpful (again)
Are you kidding me? This should be done in if you have several queries in the procedure that
Don’t be a a single query; instead, you’re double-dipping a have to make similar joins to the same table.
moron: Query
6
superlarge table. Don’t be a moron: Query large
large tables tables only once whenever possible -- you’ll find
only once how much better your procedures perform.
whenever A slightly different scenario is when a subset Do pre-stage data
possible -- of a large table is needed by several steps in This is one of my favorite topics
you’ll find how a process, which causes the large table to be because it’s an old technique that’s often over-
much better queried each time. Avoid this by querying for the looked. If you have a report or a procedure (or
your proce- subset and persisting it elsewhere, then pointing better yet, a set of them) that will do similar joins
dures perform. the subsequent steps to your smaller data set. to large tables, it can be a benefit for you to
pre-stage the data by joining the tables ahead of
time and persisting them into a table. Now the
reports can run against that pre-staged table and
avoid the large join.
You’re not always able to use this technique,
but when you can, you’ll find it is an excellent
way to save server resources.
Note that many developers get around this
join problem by concentrating on the query
itself and creating a view-only around the join
so that they don’t have to type the join condi-
tions again and again. But the problem with this
approach is that the query still runs for every
report that needs it. By pre-staging the data, you
run the join just once (say, 10 minutes before the
reports) and everyone else avoids the big join. I
can’t tell you how much I love this technique; in
most environments, there are popular tables that
get joined all the time, so there’s no reason why
they can’t be pre-staged.
DEEP DIVE SERIES 5
7 Do delete and
update
in batches
Here’s another easy technique that gets
overlooked a lot. Deleting or updating
large amounts of data from huge tables
can be a nightmare if you don’t do it
right. The problem is that both of these
statements run as a single transaction,
and if you need to kill them or if some-
Along these thing happens to the system while they’re
lines, many working, the system has to roll back the
developers entire transaction. This can take a very
have it stuck long time. These operations can also
in their heads block other transactions for their dura-
that these tion, essentially bottlenecking the system.
delete and The solution is to do deletes or other operations for a lot longer than is neces-
update opera- updates in smaller batches. This solves your sary. This greatly decreases concurrency in your
tions must be problem in a couple ways. First, if the transac- system.
completed tion gets killed for whatever reason, it only has However, you can’t always avoid using
the same a small number of rows to roll back, so the cursors, and when those times arise, you may be
day. That’s database returns online much quicker. Second, able to get away from cursor-induced perfor-
not always while the smaller batches are committing to mance issues by doing the cursor operations
true, espe- disk, others can sneak in and do some work, so against a temp table instead. Take, for example,
cially if you’re concurrency is greatly enhanced. a cursor that goes through a table and updates
archiving. Along these lines, many developers have a couple of columns based on some comparison
it stuck in their heads that these delete and results. Instead of doing the comparison against
update operations must be completed the same the live table, you may be able to put that data
day. That’s not always true, especially if you’re into a temp table and do the comparison against
archiving. You can stretch that operation out as that instead. Then you have a single UPDATE
long as you need to, and the smaller batches statement against the live table that’s much
help accomplish that. If you can take longer to smaller and holds locks only for a short time.
do these intensive operations, spend the extra Sniping your data modifications like this can
time and don’t bring your system down. greatly increase concurrency. I’ll finish by saying
you almost never need to use a cursor. There’s
almost always a set-based solution; you need to
8 9
learn to see it.
performance issues, particularly in two ways. First, Not everyone will be able to take advantage
you will very likely have much more data coming of this tip, which relies on partitioning in SQL
back than you need. Second, the query optimizer Server Enterprise, but for those of you who
will give up and return a bad query plan. can, it’s a great trick. Most people don’t realize
I once had a client that loved nesting views. that all tables in SQL Server are partitioned. You
The client had one view it used for almost every- can separate a table into multiple partitions if
thing because it had two important joins. The you like, but even simple tables are partitioned
problem was that the view returned a column from the time they’re created; however, they’re
with 2MB documents in it. Some of the docu- created as single partitions. If you’re running SQL
ments were even larger. The client was pushing Server Enterprise, you already have the advan-
This is at least an extra 2MB across the network for tages of partitioned tables at your disposal.
one of my every single row in almost every single query it This means you can use partitioning features
favorite ran. Naturally, query performance was abysmal. like SWITCH to archive large amounts of data
tricks of And none of the queries actually used that from a warehousing load. Let’s look at a real
all time column! Of course, the column was buried seven example from a client I had last year. The client
because it views deep, so even finding it was difficult. When had the requirement to copy the data from the
is truly one I removed the document column from the view, current day’s table into an archive table; in case
of those the time for the biggest query went from 2.5 the load failed, the company could quickly recover
hidden hours to 10 minutes. When I finally unraveled the with the current day’s table. For various reasons,
secrets nested views, which had several unnecessary joins it couldn’t rename the tables back and forth every
that only and columns, and wrote a plain query, the time time, so the company inserted the data into an
the experts for that same query dropped to subseconds. archive table every day before the load, then
know. When
10
deleted the current day’s data from the live table.
This process worked fine in the beginning,
you use a scalar but a year later, it was taking 1.5 hours to copy
function in the Do use table- each table -- and several tables had to be copied
SELECT list of a valued functions every day. The problem was only going to get
query, the func- This is one of my favorite tricks of all time worse. The solution was to scrap the INSERT and
tion gets called because it is truly one of those hidden secrets DELETE process and use the SWITCH command.
for every single that only the experts know. When you use a The SWITCH command allowed the company
row in the scalar function in the SELECT list of a query, the to avoid all of the writes because it assigned the
result set. function gets called for every single row in the pages to the archive table. It’s only a metadata
result set. This can reduce the performance of change. The SWITCH took on average between
large queries by a significant amount. However, two and three seconds to run. If the current load
you can greatly improve the performance by ever fails, you SWITCH the data back into the
converting the scalar function to a table-valued original table.
function and using a CROSS APPLY in the query. This is a case where understanding that all
This is a wonderful trick that can yield great tables are partitions slashed hours from a data
12
improvements. load.
Want to know more about the APPLY
operator? You’ll find a full discussion in an excel-
lent course on Microsoft Virtual Academy by Itzik If you must use
Ben-Gan. ORMs, use stored
procedures
11
This is one of my regular diatribes. In short, don’t
use ORMs (object-relational mappers). ORMs
Do use partitioning produce some of the worst code on the planet,
to avoid large data and they’re responsible for almost every perfor-
moves mance issue I get involved in. ORM code genera-
DEEP DIVE SERIES 7
tors can’t possibly write SQL as well as a person database, but what can I say other than you’re
who knows what they’re doing. However, if you outright wrong. By putting the business logic
use an ORM, write your own stored procedures on the front end of the application, you have to
and have the ORM call the stored procedure bring all of the data across the wire merely to
instead of writing its own queries. Look, I know compare it. That’s not good performance. I had
all the arguments, and I know that developers a client earlier this year that kept all of the logic
and managers love ORMs because they speed out of the database and did everything on the
you to market. But the cost is incredibly high front end. The company was shipping hundreds
when you see what the queries do to your data- of thousands of rows of data to the front end,
base. so it could apply the business logic and present
Stored procedures have a number of the data it needed. It took 40 minutes to do
advantages. For starters, you’re pushing much that. I put a stored procedure on the back end
less data across the network. If you have a and had it call from the front end; the page
long query, then it could take three or four loaded in three seconds.
round trips across the network to get the Of course, the truth is that sometimes the
This is one of entire query to the database server. That’s not logic belongs on the front end and sometimes it
my regular including the time it takes the server to put the belongs in the database. But ORMs always get
diatribes. In
13
query back together and run it, or considering me ranting.
short, don’t use that the query may run several -- or several
ORMs (object- hundred -- times a second.
relational Using a stored procedure will greatly Don’t do large ops
mappers). reduce that traffic because the stored proce- on many tables in
ORMs dure call will always be much shorter. Also, the same batch
produce some stored procedures are easier to trace in Profiler This one seems obvious, but apparently it’s not.
of the worst or any other tool. A stored procedure is an I’ll use another live example because it will drive
code on the actual object in your database. That means it’s home the point much better. I had a system
planet, and much easier to get performance statistics on that suffered tons of blocking. Dozens of opera-
they’re respon- a stored procedure than on an ad-hoc query tions were at a standstill. As it turned out, a
sible for almost and, in turn, find performance issues and draw
every perfor- out anomalies.
mance issue I In addition, stored procedures parameterize
get involved in. more consistently. This means you’re more
likely to reuse your execution plans and even
deal with caching issues, which can be
difficult to pin down with ad-hoc queries.
Stored procedures also make it much
easier to deal with edge cases and
even add auditing or change-locking
behavior. A stored procedure can
handle many tasks that trouble ad-hoc
queries. My wife unraveled a two-page
query from Entity Framework a couple
of years ago. It took 25 minutes to run.
When she boiled it down to its essence,
she rewrote that huge query as SELECT
COUNT(*) from T1. No kidding.
OK, I kept it as short as I could. Those are
the high-level points. I know many .Net coders
think that business logic doesn’t belong in the
InfoWorld.com DEEP DIVE SERIES 8
delete routine that ran several times a day was bunch of data into one table with a clustered
deleting data out of 14 tables in an explicit trans- GUID and into another table with an IDENTITY
action. Handling all 14 tables in one transaction column. The GUID table fragmented so severely
meant that the locks were held on every single that the performance degraded by several thou-
table until all of the deletes were finished. The sand percent in a mere 15 minutes. The IDENTITY
solution was to break up each table’s deletes into table lost only a few percent off performance
separate transactions so that each delete trans- after five hours. This applies to more than GUIDs
action held locks on only one table. This freed up -- it goes toward any volatile column.
16
the other tables and reduced the blocking and
allowed other operations to continue working.
You always want to split up large transactions
like this into separate smaller ones to prevent Don’t count all
blocking. rows if you only
need to see if data exists
14
It’s a common situation. You need to see if data
exists in a table or for a customer, and based
Don’t use triggers on the results of that check, you’re going to
This one is largely the same as perform some action. I can’t tell you how often
Don’t use the previous one, but it bears mentioning. Don’t I’ve seen someone do a SELECT COUNT(*) FROM
triggers use triggers unless it’s unavoidable -- and it’s dbo.T1 to check for the existence of that data:
unless it’s almost always avoidable.
unavoid- The problem with triggers: Whatever it is SET @CT = (SELECT COUNT(*) FROM
able -- and you want them to do will be done in the same dbo.T1);
it’s almost transaction as the original operation. If you write If @CT > 0
always a trigger to insert data into another table when BEGIN
avoidable. you update a row in the Orders table, the lock <Do something>
will be held on both tables until the trigger is END
done. If you need to insert data into another
table after the update, then put the update and It’s completely unnecessary. If you want to
the insert into a stored procedure and do them check for existence, then do this:
in separate transactions. If you need to roll back,
you can do so easily without having to hold locks If EXISTS (SELECT 1 FROM dbo.T1)
on both tables. As always, keep transactions as BEGIN
short as possible and don’t hold locks on more <Do something>
than one resource at a time if you can help it. END
15
Don’t count everything in the table. Just get
back the first row you find. SQL Server is smart
Don’t cluster on enough to use EXISTS properly, and the second
GUID block of code returns superfast. The larger the
After all these years, I can’t believe we’re still table, the bigger difference this will make. Do
fighting this issue. But I still run into clustered the smart thing now before your data gets too
GUIDs at least twice a year. big. It’s never too early to tune your database.
A GUID (globally unique identifier) is a 16-byte In fact, I just ran this example on one of my
randomly generated number. Ordering your production databases against a table with 270
table’s data on this column will cause your table million rows. The first query took 15 seconds,
to fragment much faster than using a steadily and included 456,197 logical reads, while the
increasing value like DATE or IDENTITY. I did a second one returned in less than one second and
benchmark a few years ago where I inserted a included only five logical reads. However, if you
DEEP DIVE SERIES 9
17 Don’t do negative
searches
Take the simple query SELECT * FROM
Customers WHERE RegionID <> 3. You
can’t use an index with this query because it’s a
negative search that has to be compared row by
row with a table scan. If you need to do some-
thing like this, you may find it performs much
better if you rewrite the query to use the index.