Managing Large Data Sets in SQL Server 2005 and 2008
Managing Large Data Sets in SQL Server 2005 and 2008
Overview
One of SQL Server's strengths is its ability to operate against multiple rows using just one
statement, but in excess that strength can be problematic. For example, performing an insert of
20 million rows will place numerous page locks and possibly a table lock on that table, which can
prevent others from effectively using any applications that access that table. Alternatively, one
could use cursors to modify one row at a time, but the performance of cursors is not great and is
not a good solution for processing millions of rows.
This article describes techniques that can be used to make operations against large data sets
more manageable. You will be able to stop and start operations at a whim and throttle them up or
down depending on how heavily you may want to use system resources at any given time.
Breaking it Down
The best way to manage operations on large data sets like this is to break them down into smaller
pieces and perform the operation piece-by-piece. To demonstrate how this can be done, let's
imagine that a daily ETL feed of millions of rows is placed into a staging table named
"new_purchase" and that our operation needs to copy all of these rows into the "purchase" table.
Then let's assume that inserting these rows will take a long time because the "purchase" table
already has millions of rows in it, has numerous indexes on it, and is actively used by the
production application.
So we have some data to test our solution with, I will first create a "new_purchase" table and
populate it with a list of fake purchases:
CREATE TABLE new_purchase (
purchase_date DATETIME,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2)
)
-- populate a fake table with 100,000 new purchases
SET NOCOUNT ON
DECLARE @i AS INT
SET @i=0
WHILE @i< 100000
BEGIN
INSERT new_purchase
SELECT Cast('2009-01-01' AS DATETIME) + Rand() AS purchase_date,
'SKU'+Cast(Cast(Rand() * 10000000 AS BIGINT) AS VARCHAR(7)) AS
item,
Cast(Rand() * 10 + 1 AS INT) AS quantity,
Cast(Rand() * 100 + .01 AS DECIMAL(10,2)) AS price
SET @i = @i + 1
END
Second, I will create a "purchase" table, which is the table that holds the cumulative list of all
purchases in the production system:
--Very large table that contains all purchases and is used by the
production system
CREATE TABLE purchase (
id UNIQUEIDENTIFIER,
created_date DATETIME,
purchase_date datetime,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2),
total DECIMAL(14,2)
)
And finally, I will create a stored procedure to copy one to twenty thousand rows at a time from
the "new_purchase" table into the "purchase" table. I will also create a log table to track and
measure each of the copy operations.
--Create a table for tracking
CREATE TABLE load_new_purchase_log (
start_time DATETIME,
end_time DATETIME,
row_count INT
)
GO
--Create a procedure to load new purchases in groups of 1,000-20,000
rows
CREATE PROCEDURE load_new_purchase
AS
BEGIN
--If this process had already run, it may have already added some
rows.
--Determine where to start.
SET @starting_row = IsNull((SELECT Sum(row_count) + 1
FROM load_new_purchase_log),1)
Wrap-Up
In situations where system performance is important and one needs to operate on a large number
of rows, breaking the operation down into smaller pieces has many benefits, including: