0% found this document useful (0 votes)

56 views

Managing Large Data Sets in SQL Server 2005 and 2008

This article describes techniques that can be used to make operations against large data sets more manageable. You will be able to stop and start operations at a whim and throttle them up or down depending on how heavily you may want to use system resources.

Uploaded by

Pareen Vatani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Managing Large Data Sets in SQL Server 2005 and 2008

Uploaded by

Pareen Vatani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 5

Managing Large Data Sets in SQL Server 2005 and 2008

By Zach Mided, 2010/01/04

Total article views: 6886 | Views in the last 30 days: 6886
Rate this | Join the discussion | Briefcase | Print

Overview
One of SQL Server's strengths is its ability to operate against multiple rows using just one
statement, but in excess that strength can be problematic. For example, performing an insert of
20 million rows will place numerous page locks and possibly a table lock on that table, which can
prevent others from effectively using any applications that access that table. Alternatively, one
could use cursors to modify one row at a time, but the performance of cursors is not great and is
not a good solution for processing millions of rows.
This article describes techniques that can be used to make operations against large data sets
more manageable. You will be able to stop and start operations at a whim and throttle them up or
down depending on how heavily you may want to use system resources at any given time.

'Tis the Season...

In some situations you can schedule operations against large data sets to occur off-peak hours,
but what if your system has to be very responsive around-the-clock? Or, what if the operation
takes so long that it can't be completed entirely before peak-hours?
Imagine the following scenario. Your nightly loads of large volumes of data usually complete
before peak-hours start, but because today is the start of the holiday season there is suddenly
twice the amount of data and the process is still running when peak-hours begin. OK, just wait 30-
60 minutes and let it complete...
Twiddle thumbs...
Respond to emails from upset users: "Don't worry. It will be resolved soon."...
Nope. An hour later and it still hasn't completed. How much longer could it possibly take?
Management makes the call - kill the job and we'll load it tomorrow. OK, you kill it. Hmmm...
performance is still bad. What's going on?
Now that huge transaction you just cancelled is rolling itself back! Now, all you can do is wait. The
system could very well be ground to a halt for the entire day. How can this be avoided?

Breaking it Down
The best way to manage operations on large data sets like this is to break them down into smaller
pieces and perform the operation piece-by-piece. To demonstrate how this can be done, let's
imagine that a daily ETL feed of millions of rows is placed into a staging table named
"new_purchase" and that our operation needs to copy all of these rows into the "purchase" table.
Then let's assume that inserting these rows will take a long time because the "purchase" table
already has millions of rows in it, has numerous indexes on it, and is actively used by the
production application.
So we have some data to test our solution with, I will first create a "new_purchase" table and
populate it with a list of fake purchases:
CREATE TABLE new_purchase (
purchase_date DATETIME,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2)
)
-- populate a fake table with 100,000 new purchases
SET NOCOUNT ON
DECLARE @i AS INT
SET @i=0
WHILE @i< 100000
BEGIN
INSERT new_purchase
SELECT Cast('2009-01-01' AS DATETIME) + Rand() AS purchase_date,
'SKU'+Cast(Cast(Rand() * 10000000 AS BIGINT) AS VARCHAR(7)) AS
item,
Cast(Rand() * 10 + 1 AS INT) AS quantity,
Cast(Rand() * 100 + .01 AS DECIMAL(10,2)) AS price
SET @i = @i + 1
END
Second, I will create a "purchase" table, which is the table that holds the cumulative list of all
purchases in the production system:
--Very large table that contains all purchases and is used by the
production system
CREATE TABLE purchase (
id UNIQUEIDENTIFIER,
created_date DATETIME,
purchase_date datetime,
item VARCHAR(10),
quantity INT,
price DECIMAL(10,2),
total DECIMAL(14,2)
)
And finally, I will create a stored procedure to copy one to twenty thousand rows at a time from
the "new_purchase" table into the "purchase" table. I will also create a log table to track and
measure each of the copy operations.
--Create a table for tracking
CREATE TABLE load_new_purchase_log (
start_time DATETIME,
end_time DATETIME,
row_count INT
)
GO
--Create a procedure to load new purchases in groups of 1,000-20,000
rows
CREATE PROCEDURE load_new_purchase
AS
BEGIN

DECLARE @start_time AS DATETIME

DECLARE @maximum_row AS BIGINT
DECLARE @starting_row AS BIGINT
DECLARE @rows_to_process_per_iteration AS INT
DECLARE @rows_added AS INT
DECLARE @message AS VARCHAR(100)

--Define how many rows to process in each iteration

--depending on whether the routine is running during peak hours or
not
SET @rows_to_process_per_iteration =
CASE WHEN DatePart(Hour, GetDate()) BETWEEN 7 AND 17 THEN 1000
--Peak hours
ELSE 20000 END --Off-Peak hours, load a larger number of records
at a time
--Determine how many rows need to be processed in total
SET @maximum_row = (SELECT Count(*) FROM new_purchase)

--If this process had already run, it may have already added some
rows.
--Determine where to start.
SET @starting_row = IsNull((SELECT Sum(row_count) + 1
FROM load_new_purchase_log),1)

--Continue looping until all new_purchase records have been

processed
SET NOCOUNT OFF
WHILE @starting_row <= @maximum_row
BEGIN
SET @message = 'Processing next '+
Cast(@rows_to_process_per_iteration AS VARCHAR(100))+
' records starting at record '+
Cast(@starting_row AS VARCHAR(100))+' (of '+
Cast(@maximum_row AS VARCHAR(100))+')'
PRINT @message

SET @start_time = GetDate()

--Insert the next set of rows and log it to the log table in the
same transaction
BEGIN TRANSACTION
--Copy rows into purchase table from the new_purchase table
INSERT purchase
SELECT NewId() as id, GetDate() as created_date,
purchase_date, item, quantity, price, quantity * price as
total
FROM
(
SELECT *,
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS
rownumber
FROM new_purchase
) as new
WHERE rownumber BETWEEN @starting_row
AND @starting_row + @rows_to_process_per_iteration - 1

--Log the records that have been added

SET @rows_added = @@RowCount
INSERT load_new_purchase_log (start_time,end_time,row_count)
VALUES (@start_time, GetDate(), @rows_added)
COMMIT TRANSACTION
SET @starting_row = @starting_row + @rows_added

--Define how many rows to process in each iteration

--depending whether the routine is running during peak hours or
not
IF DatePart(Hour, GetDate()) BETWEEN 7 and 17
BEGIN
--Peak Hours
--Load a small number of rows for each iteration
SET @rows_to_process_per_iteration = 1000
--Delay 10 seconds to lighten the load even further
WAITFOR DELAY '00:00:10'
END
ELSE
BEGIN
--Off-Peak Hours
--Load a large number of rows for each iteration
SET @rows_to_process_per_iteration = 20000
END

END --end while statement

END --end stored proc

The main trick in the code is to use the Row_Number() function that is available in version 2005
and later. This enables you to segment the large operation in to smaller operations, each of which
operates only on a sub-set of all of the rows. Note that SQL Server does not support
Row_Number() calculations within the WHERE clause. So, the following would produce an error:
SELECT NewId() as id, GetDate() as created_date,
purchase_date, item, quantity, price, quantity * price as total
FROM
(
SELECT *,
Row_Number() OVER (ORDER BY purchase_date, item, quantity) AS
rownumber
FROM new_purchase
) as new
WHERE rownumber BETWEEN @starting_row
AND @starting_row + @rows_to_process_per_iteration - 1
That is why the stored procedure has to compute the Row_Number() in a sub-select statement.
Even though there is this large SELECT statement in the sub-select, SQL Server performs
surprisingly well at filtering out just the requested records; this is true especially if the source table
has an index with the same ordering that is in the ORDER BY in the OVER clause.
The number of rows copied in each iteration is written to a log table. This enables the procedure
to restart where it left off in the case that it is terminated before it can process all of the rows. The
main INSERT statement and the INSERT of a row to the log table are included in the same
TRANSACTION to ensure that they are COMMITTED to the database as one transaction. A
secondary benefit of writing to a log table is that it is helpful for reporting on the progress of the
operation and for benchmarking performance.
You will also notice a couple of places in this code where it tests to see if the current time is
during peak-hours (between 7AM and 5PM). If it is then the routine only loads 1,000 rows and
then pauses for 30 seconds for each iteration. If the current time is outside of peak hours, then
the routine loads up to 20,000 rows per iteration because there are more system resources
available for processing.
Now, to test the stored procedure I type:
EXEC load_new_purchase
And the following output is generated:
Processing next 1000 records starting at record 1 (of 100000)
Processing next 1000 records starting at record 1001 (of 100000)
Processing next 1000 records starting at record 2001 (of 100000)
I then disconnect that query to test its ability to stop and start with ease. Then, rerun the stored
procedure and let it complete. Running the following queries shows that all 100,000 rows were
copied correctly to the "purchase" table.
SELECT Count(*) as record_count_in_purchase_table FROM purchase
SELECT TOP 10 * FROM purchase
SELECT TOP 10 * FROM load_new_purchase_log

Wrap-Up
In situations where system performance is important and one needs to operate on a large number
of rows, breaking the operation down into smaller pieces has many benefits, including:

1. Fewer database locks and less contention.

2. Better performance for any other applications that share the affected tables.
3. Adding robustness to the operation so that it can easily and efficiently be stopped and
restarted at will.
4. Visibility into the progress of the operation and automatic tracking of metrics that can then
be used for benchmarking.

Database Design Review Checklist
100% (5)
Database Design Review Checklist
16 pages
Azure SQL to Mysql
No ratings yet
Azure SQL to Mysql
26 pages
Answer 2
No ratings yet
Answer 2
3 pages
COGNIZANT Data Analyst Interview Questions Part 2-11
No ratings yet
COGNIZANT Data Analyst Interview Questions Part 2-11
17 pages
23BCE277 DBMS PR1
No ratings yet
23BCE277 DBMS PR1
20 pages
sql_interview
100% (1)
sql_interview
68 pages
SQL Server DBF Coop Group 2
No ratings yet
SQL Server DBF Coop Group 2
50 pages
DBMS-TONMOY
No ratings yet
DBMS-TONMOY
7 pages
Labpsp
No ratings yet
Labpsp
46 pages
T-SQL Cheat Sheet
100% (1)
T-SQL Cheat Sheet
20 pages
Assignment 2
No ratings yet
Assignment 2
23 pages
Dbms 2022 Scheme Lab Exercise Solution-1
No ratings yet
Dbms 2022 Scheme Lab Exercise Solution-1
12 pages
Eee MCT Mech Dbms CIA 2 QB
No ratings yet
Eee MCT Mech Dbms CIA 2 QB
9 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
5 pages
Answers7 9 10
No ratings yet
Answers7 9 10
10 pages
WB
No ratings yet
WB
7 pages
SQL Bulk Copy Method To Insert Large Amount of Data To The SQL Database - CodeProject
No ratings yet
SQL Bulk Copy Method To Insert Large Amount of Data To The SQL Database - CodeProject
3 pages
DBMS Project
No ratings yet
DBMS Project
27 pages
SQL 2012:usage of New Functions: Asanka Padmakumara
No ratings yet
SQL 2012:usage of New Functions: Asanka Padmakumara
19 pages
DBMSBCOM
No ratings yet
DBMSBCOM
24 pages
Dbms Exp5 Solution
No ratings yet
Dbms Exp5 Solution
7 pages
Lab Assignment 4 - 7
No ratings yet
Lab Assignment 4 - 7
7 pages
SQL Techniques
No ratings yet
SQL Techniques
37 pages
Documentation Project File ADVENTUREWORKS 2019 and Retailers To 4.0 Linear
No ratings yet
Documentation Project File ADVENTUREWORKS 2019 and Retailers To 4.0 Linear
20 pages
Bcomdbmsrecord
No ratings yet
Bcomdbmsrecord
24 pages
DB Exercises Project 2
No ratings yet
DB Exercises Project 2
4 pages
SQL SERVER Database Coding Standards and Guidelines Part 2
No ratings yet
SQL SERVER Database Coding Standards and Guidelines Part 2
5 pages
RDBMS Lab Manual
No ratings yet
RDBMS Lab Manual
10 pages
No Orders: Find Missing Values Tempdb
No ratings yet
No Orders: Find Missing Values Tempdb
4 pages
yash dbms
No ratings yet
yash dbms
56 pages
Module 2 SQL
No ratings yet
Module 2 SQL
10 pages
RDBMS Lab Record-IV Sem-1
No ratings yet
RDBMS Lab Record-IV Sem-1
39 pages
SQL Assignment
0% (1)
SQL Assignment
11 pages
7,8
No ratings yet
7,8
8 pages
SQL Project_ Exploring Trends, Segmentation & KPIs
No ratings yet
SQL Project_ Exploring Trends, Segmentation & KPIs
43 pages
T-SQL Interview Questions
No ratings yet
T-SQL Interview Questions
10 pages
SQL Tuning Examples 3
No ratings yet
SQL Tuning Examples 3
5 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
12 pages
P1 (2)
No ratings yet
P1 (2)
3 pages
MYNTRA DATA ANALYST INTERVIEW QUESTIONS
No ratings yet
MYNTRA DATA ANALYST INTERVIEW QUESTIONS
34 pages
Asm 2 Database
No ratings yet
Asm 2 Database
35 pages
Lab 11
No ratings yet
Lab 11
12 pages
Chapter 9-11 Informatic Practices Xii Web
No ratings yet
Chapter 9-11 Informatic Practices Xii Web
104 pages
CHAPTER 9-11 INFORMATIC PRACTICES XII WEB
No ratings yet
CHAPTER 9-11 INFORMATIC PRACTICES XII WEB
104 pages
Mod5 SQL
No ratings yet
Mod5 SQL
8 pages
More Senior SQL Questions
No ratings yet
More Senior SQL Questions
4 pages
RDBMS - SQL - Tutorial - DBMS - Lab
No ratings yet
RDBMS - SQL - Tutorial - DBMS - Lab
10 pages
COSC 2307: Row Functions
No ratings yet
COSC 2307: Row Functions
42 pages
RDBMS Lab Record-IV Sem
No ratings yet
RDBMS Lab Record-IV Sem
39 pages
Aggregate and Mod3 SQL
No ratings yet
Aggregate and Mod3 SQL
8 pages
PL SQL Assignment
No ratings yet
PL SQL Assignment
11 pages
7-12_merged
No ratings yet
7-12_merged
19 pages
DBMS_LAB_MANUAL
No ratings yet
DBMS_LAB_MANUAL
26 pages
SQLQueries First Editon
No ratings yet
SQLQueries First Editon
43 pages
SQLQueries First Editon PDF
No ratings yet
SQLQueries First Editon PDF
43 pages
Perl Tutorial
No ratings yet
Perl Tutorial
32 pages
Blinkit & Zepto interview questions
No ratings yet
Blinkit & Zepto interview questions
21 pages
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
SQL Server: Tips and Tricks - 1
From Everand
SQL Server: Tips and Tricks - 1
Priyanka Agarwal
5/5 (1)
GETTING STARTED WITH SQL: Exercises with PhpMyAdmin and MySQL
From Everand
GETTING STARTED WITH SQL: Exercises with PhpMyAdmin and MySQL
Remy Lentzner
No ratings yet
Cloud
No ratings yet
Cloud
91 pages
419 Engineering Design Assignment
No ratings yet
419 Engineering Design Assignment
20 pages
Discrete Mathematics and Its Applications by Kenneth H. Rosen
No ratings yet
Discrete Mathematics and Its Applications by Kenneth H. Rosen
2 pages
BATCH- (703)
No ratings yet
BATCH- (703)
2 pages
MS Excel Software App
No ratings yet
MS Excel Software App
51 pages
LD7577 DS 01 PDF
100% (2)
LD7577 DS 01 PDF
18 pages
Sim Unlocker Pro: Samsung Alcatel Wiko Coolpad Wingtech Zte - Huawei - Xiaomi Aio, Att, CCT, TMK, TMB, VZW, TFN Credits
No ratings yet
Sim Unlocker Pro: Samsung Alcatel Wiko Coolpad Wingtech Zte - Huawei - Xiaomi Aio, Att, CCT, TMK, TMB, VZW, TFN Credits
6 pages
Manual DVR
No ratings yet
Manual DVR
73 pages
6000 Diagnostic LED Numbers and Codes
No ratings yet
6000 Diagnostic LED Numbers and Codes
15 pages
SOLMAN Initial Configuration
No ratings yet
SOLMAN Initial Configuration
50 pages
Information Transfer
100% (4)
Information Transfer
17 pages
DMS Acknowledgement
No ratings yet
DMS Acknowledgement
3 pages
3) Question Bank - VT
No ratings yet
3) Question Bank - VT
20 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
62 pages
Connie Miller Resume 2014
No ratings yet
Connie Miller Resume 2014
2 pages
Learn Linux 101 A Roadmap For LPIC-1
No ratings yet
Learn Linux 101 A Roadmap For LPIC-1
13 pages
Initial WorkSet Map For SAP Enterprise Portal
No ratings yet
Initial WorkSet Map For SAP Enterprise Portal
7 pages
Teori Katharine
No ratings yet
Teori Katharine
40 pages
QBlade An Open Source Tool For Design An
No ratings yet
QBlade An Open Source Tool For Design An
6 pages
ACA Project Report Final
No ratings yet
ACA Project Report Final
12 pages
LEARNING PLAN 3rd Grading
No ratings yet
LEARNING PLAN 3rd Grading
3 pages
1080 Group GoToMeeting Ebook 102 Tips For Online Meetings
No ratings yet
1080 Group GoToMeeting Ebook 102 Tips For Online Meetings
58 pages
Arbaz Khan Full Stack Java Developer
No ratings yet
Arbaz Khan Full Stack Java Developer
1 page
EDM Reporting User Manual
No ratings yet
EDM Reporting User Manual
63 pages
Adding - Node in Ebs 12.1.1
No ratings yet
Adding - Node in Ebs 12.1.1
5 pages
Build
No ratings yet
Build
3 pages
Zad Elsayed Frontend CV
No ratings yet
Zad Elsayed Frontend CV
1 page
EcoSUI EN AN Configuration I-RC1
No ratings yet
EcoSUI EN AN Configuration I-RC1
175 pages
A Proxy Re-Encryption Approach To Secure Data Sharing in The Internet of Things Based On Blockchain
No ratings yet
A Proxy Re-Encryption Approach To Secure Data Sharing in The Internet of Things Based On Blockchain
12 pages
CSS3 Styling Syntax and Theory
No ratings yet
CSS3 Styling Syntax and Theory
6 pages

Managing Large Data Sets in SQL Server 2005 and 2008

Uploaded by

Managing Large Data Sets in SQL Server 2005 and 2008

Uploaded by

Managing Large Data Sets in SQL Server 2005 and 2008

By Zach Mided, 2010/01/04

'Tis the Season...

DECLARE @start_time AS DATETIME

--Define how many rows to process in each iteration

--Continue looping until all new_purchase records have been

SET @start_time = GetDate()

--Log the records that have been added

--Define how many rows to process in each iteration

END --end while statement

END --end stored proc

1. Fewer database locks and less contention.

You might also like