Expert T-SQL Window Functions in SQL Server 2019: The Hidden Secret To Fast Analytic and Reporting Queries Kathi Kellenberger
Expert T-SQL Window Functions in SQL Server 2019: The Hidden Secret To Fast Analytic and Reporting Queries Kathi Kellenberger
com
https://textbookfull.com/product/expert-t-sql-window-
functions-in-sql-server-2019-the-hidden-secret-to-fast-
analytic-and-reporting-queries-kathi-kellenberger/
OR CLICK BUTTON
DOWNLOAD NOW
https://textbookfull.com/product/expert-sql-server-in-memory-oltp-2nd-
edition-dmitri-korotkevitch/
textboxfull.com
Query Store for SQL Server 2019: Identify and Fix Poorly
Performing Queries 1st Edition Tracy Boggiano
https://textbookfull.com/product/query-store-for-sql-
server-2019-identify-and-fix-poorly-performing-queries-1st-edition-
tracy-boggiano/
textboxfull.com
https://textbookfull.com/product/sql-server-execution-plans-for-sql-
server-2008-through-to-2017-and-azure-sql-database-3rd-edition-grant-
fritchey/
textboxfull.com
https://textbookfull.com/product/expert-scripting-and-automation-for-
sql-server-dbas-springerlink-online-service/
textboxfull.com
https://textbookfull.com/product/beginning-sql-queries-from-novice-to-
professional-clare-churcher/
textboxfull.com
https://textbookfull.com/product/microsoft-sql-
server-2019-a-beginners-guide-dusan-petkovic/
textboxfull.com
https://textbookfull.com/product/sql-server-internals-in-memory-oltp-
inside-the-sql-server-2016-hekaton-engine-2nd-edition-kalen-delaney/
textboxfull.com
Expert T-SQL
Window Functions
in SQL Server 2019
The Hidden Secret to Fast Analytic and
Reporting Queries
—
Second Edition
—
Kathi Kellenberger
Clayton Groom
Ed Pollack
Expert T-SQL Window
Functions in
SQL Server 2019
The Hidden Secret to Fast Analytic
and Reporting Queries
Second Edition
Kathi Kellenberger
Clayton Groom
Ed Pollack
Expert T-SQL Window Functions in SQL Server 2019
Kathi Kellenberger Clayton Groom
Edwardsville, IL, USA Smithton, IL, USA
Ed Pollack
Albany, NY, USA
Acknowledgments���������������������������������������������������������������������������������������������������xv
Introduction�����������������������������������������������������������������������������������������������������������xvii
v
Table of Contents
vi
Table of Contents
Chapter 9: Hitting a Home Run with Gaps, Islands, and Streaks�������������������������� 141
The Classic Gaps/Islands Problem�������������������������������������������������������������������������������������������� 142
Finding Islands�������������������������������������������������������������������������������������������������������������������� 143
Finding Gaps������������������������������������������������������������������������������������������������������������������������ 145
Limitations and Notes���������������������������������������������������������������������������������������������������������� 146
Data Clusters����������������������������������������������������������������������������������������������������������������������������� 147
Tracking Streaks����������������������������������������������������������������������������������������������������������������������� 153
Winning and Losing Streaks������������������������������������������������������������������������������������������������ 154
Streaks Across Partitioned Data Sets���������������������������������������������������������������������������������� 158
Data Quality������������������������������������������������������������������������������������������������������������������������������� 168
NULL������������������������������������������������������������������������������������������������������������������������������������ 169
Unexpected or Invalid Values����������������������������������������������������������������������������������������������� 170
Duplicate Data��������������������������������������������������������������������������������������������������������������������� 170
Performance����������������������������������������������������������������������������������������������������������������������������� 171
Summary���������������������������������������������������������������������������������������������������������������������������������� 172
vii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 209
viii
About the Authors
Kathi Kellenberger is a data platform MVP and the editor
of Simple Talk at Redgate Software. She has worked with
SQL Server for over 20 years. She is also coleader of the PASS
Women in Technology Virtual Group and an instructor at
LaunchCode. In her spare time, Kathi enjoys spending time
with family and friends, singing, and cycling.
ix
About the Technical Reviewer
Rodney Landrum went to school to be a poet and a writer.
And then he graduated, so that dream was crushed. He
followed another path, which was to become a professional
in the fun-filled world of information technology. He has
worked as a systems engineer, UNIX and network admin,
data analyst, client services director, and finally as a
database administrator. The old hankering to put words on
paper, while paper still existed, got the best of him, and in
2000 he began writing technical articles, some creative and
humorous, some quite the opposite. In 2010, he wrote SQL Server Tacklebox, a title
his editor disdained, but a book closest to the true creative potential he sought; he still
yearned to do a full book without a single screenshot, which he accomplished in 2019
with his first novel, Chronicles of Shameus. He currently works from his castle office in
Pensacola, FL, as a senior DBA consultant for Ntirety, a division of Hostway/HOSTING.
xi
Foreword
SQL was developed in the 1970s and became standardized through ANSI-approved
committees as a formal standard starting in 1986.1 Today in 2019, SQL has become the
most widely used declarative language. Along the way, window functions have come
to be an important part of that standard. ANSI does not make standards but plays an
important role in documenting and preserving them. The individual software vendors
voluntarily decide to comply, and it’s the work of the authors of books like this one to
explain SQL use in practical terms.
In my own career in data science and advanced analytics, window functions have
been an important part of several key projects in the past few years. Several years ago, I
made a YouTube video for a user group based on the earlier edition of this book. Since
then, as a career architect at Microsoft, I have advised the application for data science.
In one project, the input of about 20 features was not yielding adequate results: using
window functions, a team under my leadership (and yes, direct coding) quickly grew
that number to over 1,000. More than numeric growth, the accuracy rates improved, and
on the business story, the organization is saving millions of dollars annually for their
question. In the past month, I have encountered an unrelated new project, and a similar
story is there: a time-series type of data set and an opportunity to grow from under 20
features to a number much larger.
One wonders whether automated machine learning technologies would make
such combinations on their own, and I’m skeptical. Making a robust set of features
from window functions requires not just time-series considerations but also clustering
knowledge based on knowing the data domain. Even if automated technologies make
great progress in this topic, I anticipate the need for any data scientist to have simple
knowledge of these functions for the more typical data science investigation which has
only a few features and low number of observations.
SQL is central to on-premise and cloud database technologies – and in the data
science world, many use Apache Spark (part of SQL Server 2019 and so many other data
technologies). This reach into advanced analytics is yet another reason why this topic is
1
See https://blog.ansi.org/2018/10/sql-standard-iso-iec-9075-2016-ansi-x3-135/
xiii
Foreword
an expert-level subject in the SQL language. The mainstream applications extend from
any business analytics SQL query and even into supporting advanced analytics and
machine learning algorithms.
Over the years, it’s been an honor to individually know Kathi Kellenberger and
Clayton Groom as respected peers and professionals and to see how they have each
become important leaders to the technical user community through many presentations
(for which they typically volunteer their own time) and through the creation and now
revision of this book. In this revision, Ed Pollack has applied material on baseball
statistics, illustrating that not every time series is about money. It’s not enough to have a
standard written, but one needs to have expert coaches to explain how these functions
describe an approach for business analytics. This book has rich examples and altogether
provides a clear path into one of the most mathematically complex and yet practically
useful aspects of the SQL language.
xiv
Acknowledgments
The first edition of this book would not have been written except for the suggestion
of one of my friends in the data platform community, Larry Toothman. Sadly, Larry
passed away shortly after the book was published and before I could get a copy to him.
Larry was just getting started with presenting at events and being more involved in the
community during the last couple of years of his life. Who knows what he might have
accomplished if things had turned out differently. Thanks to Larry’s inspiration, people
around the world will learn about windowing functions.
They say it takes a village to raise a child, and the same might be true for a book. I
would like to thank Jonathan Gennick and Jill Balzano for their help and guidance. There
are probably many people at Apress who had a hand in this project that I’ll never meet,
and I would like to thank them as well.
Clayton Groom and Ed Pollack each wrote about their real-world experience using
windowing functions. In each case, the idea for their chapter came from my running
into each of them at user group meetings and just talking about my project. Their
contributions definitely make this a better and more enjoyable book for you.
Thanks to Rodney Landrum for doing a great job on the technical review and to Mark
Tabladillo for the wonderful foreword.
Thank you to my family, especially my husband, Dennis, who takes care of just
about everything around the house. He makes my life so much easier when I take on big
projects like this.
Finally, thank you dear reader for learning about windowing functions from this
book. I hope that you enjoy it and can apply the things you learn right away. I would love
to hear from you at events, so don’t be shy!
xv
Introduction
Several years ago, I would create a user group presentation for each new version of SQL
Server about the new T-SQL features. There was so much to say in 2012 that I decided to
build a presentation on just the windowing functions introduced that year. Eventually,
I had so much material that it turned into two sessions. Over the years, I have probably
presented this information at least 50 times at events around the United States and the
United Kingdom. Despite that, most people still are not using windowing functions
because they haven’t heard about them or do not realize the benefits.
I ntended Audience
This book is meant for people who already have good T-SQL skills. They know how to
join tables, use subqueries and CTEs, and write aggregate queries. Despite these skills,
they occasionally run into problems that are not easy to solve in a set-based manner.
Without windowing functions, some of these problems can only be solved by using
xvii
Introduction
cursors or expensive triangular joins. By using the concepts taught in this book, your
T-SQL skills will improve to the next level. Once you start using windowing functions,
you’ll find even more reasons to learn them.
xviii
CHAPTER 1
Looking Through
the Window
SQL Server is a powerful database platform with a versatile query language called
T-SQL. The most exciting T-SQL enhancement over the years, in my opinion, is the
window functions. Window functions enable you to solve query problems in new, easier
ways and with better performance most of the time over traditional techniques. They
are a great tool for analytics. You may hear these called “windowing” or “windowed”
functions as well. The three terms are synonymous when talking about this feature.
After the release of SQL Server 2000, SQL Server enthusiasts waited 5 long years for
the next version of SQL Server to arrive. Microsoft delivered an entirely new product
with SQL Server 2005. This version brought SQL Server Management Studio, SQL
Server Integration Services, snapshot isolation, and database mirroring. Microsoft also
enhanced T-SQL with many great features, such as common table expressions (CTEs).
The most exciting T-SQL enhancement of all with 2005 was the introduction of window
functions.
That was just the beginning. Window functions are part of the standard ANSI
SQL specification beginning with ANSI SQL2003. More functionality according to the
standard was released with version 2012 of SQL Server. In 2019, Microsoft gave some of
the window functions a performance boost with batch mode processing, a feature once
reserved for column store indexes. You’ll see how this performance feature works in
Chapter 8. Even now, the functionality falls short of the entire specification, so there is
more to look forward to in the future.
This chapter provides a first look at two T-SQL window functions, LAG and
ROW_NUMBER. You will learn just what the window is and how to define it with the OVER
clause. You will also learn how to divide the windows into smaller sections called
partitions.
1
© Kathi Kellenberger, Clayton Groom, and Ed Pollack 2019
K. Kellenberger et al., Expert T-SQL Window Functions in SQL Server 2019,
https://doi.org/10.1007/978-1-4842-5197-3_1
Chapter 1 Looking Through the Window
Note If you would like to follow along with this example, a sample script to
create the StockAnalysisDemo database and generated stock market data can be
found along with the code for this chapter on the Apress site.
For a quick look at how to solve this problem first by using one of the traditional
methods and then by using LAG, review and run Listing 1-1.
2
Chapter 1 Looking Through the Window
USE StockAnalysisDemo;
GO
--1-1.1 Using a subquery
SELECT TickerSymbol, TradeDate, ClosePrice,
(SELECT TOP(1) ClosePrice
FROM StockHistory AS SQ
WHERE SQ.TickerSymbol = OQ.TickerSymbol
AND SQ.TradeDate < OQ.TradeDate
ORDER BY TradeDate DESC) AS PrevClosePrice
FROM StockHistory AS OQ
ORDER BY TickerSymbol, TradeDate;
The partial results are shown in Figure 1-1. Since the data is randomly generated, the
values of ClosePrice and PrevClosePrice in the image will not match your values. Query
1 uses a correlated subquery, the old method, to select one ClosePrice for every outer
row. By joining the TickerSymbol from the inner query to the outer query you ensure
that you are not comparing two different stocks. The inner and outer queries are also
joined by the TradeDate, but the TradeDate for the inner query must be less than the
outer query to make sure you get the prior day. The inner query must also be sorted to
get the row that has the latest data but still less than the current date. This query took
over a minute to run on my laptop, which has 16GB of RAM and is using SSD storage.
Almost 700,000 rows were returned.
Query 2 uses the window function LAG to solve the same problem and produces the
same results. Don’t worry about the syntax at this point; you will be an expert by the end
of this book. The query using LAG took just 13 seconds to run on my laptop.
3
Chapter 1 Looking Through the Window
By just looking at the code in Listing 1-1, you can see that Query 2 using LAG is much
simpler to write, even though you may not understand the syntax just yet. It also runs
much faster because it is just reading the table once instead of once per row like Query 1.
As you continue reading this book and running the examples, you will learn how window
functions like LAG will make your life easier and your customers happier!
4
Chapter 1 Looking Through the Window
Queries with window functions are much different than traditional aggregate
queries. There are no restrictions to the columns that appear in the SELECT list, and no
GROUP BY clause is required. You can also add window functions to aggregate queries,
and that will be discussed in Chapter 3. Instead of summary rows being returned, all
the details are returned and the result of the expression with the window function is
included as just another column. In fact, by using a window function to get the overall
count of the rows, you could still include all of the columns in the table.
Imagine looking through a window to see a specific set of rows while your query is
running. You have one last chance to perform an operation, such as grabbing one of the
columns from another row. The result of the operation is added as an additional column.
You will learn how window functions really work as you read this book, but the idea of
looking through the window has helped me understand and explain window functions
to audiences at many SQL Server events. Figure 1-2 illustrates this concept.
Figure 1-2. Looking through the window to perform an operation on a set of rows
The window is not limited to the columns found in the SELECT list of the query. For
example, if you take a look at the StockHistory table, you will see that there is also an
OpenPrice column. The OpenPrice from one day is not the same as the ClosePrice from
the previous day. If you wanted to, you could use LAG to include the previous OpenPrice
in the results even though it is not included in the SELECT list originally.
In the stock history example using LAG, each row has its own window where it finds
the previous close price. When the calculation is performed on the third row of the data,
the window consists of the second and third rows. When the calculation is performed on
the fourth row, the window consists of the third and fourth rows.
What would happen if the rows for 2017-12-02 were removed from the query by a
WHERE clause? Does the window contain filtered-out rows? The answer is “No,” which
brings up two very important concepts to understand when using window functions:
where window functions may be used in the query and the logical order of operations.
Window functions may only be used in the SELECT list and ORDER BY clause. You
cannot filter or group on window functions. In situations where you must filter or group
5
Chapter 1 Looking Through the Window
on the results of a window function, the solution is to separate the logic. You could use a
temp table, derived table subquery, or a CTE and then filter or group in the outer query.
Window functions operate after the FROM, WHERE, GROUP BY, and HAVING clauses. They
operate before the TOP and DISTINCT clauses are evaluated. You will learn more about
how DISTINCT and TOP affect queries with window functions in the “Uncovering Special
Case Windows” section later in this chapter.
The window is defined by the OVER clause. Notice in Query 2 of Listing 1-1 that the
LAG function is followed by an OVER clause. Each type of window function has specific
requirements for the OVER clause. The LAG function must have an ORDER BY expression
and may have a PARTITION BY expression.
Note There is one situation in which you will see the OVER keyword in a
query not following a window function, and that is with the sequence object.
The sequence object, introduced with SQL Server 2008, is a bucket containing
incrementing numbers often used in place of an identity column.
For any type of expression in the SELECT list of a query, a calculation is performed for
each row in the results. For example, if you had a query with the expression Col1 + Col2,
those two columns would be added together once for every row returned. A calculation
is performed for row 1, row 2, row 3, and so on. Expressions with window functions must
also be calculated once per row. In this case, however, the expressions operate over a set
of rows that can be different for each row where the calculation is performed.
The OVER clause determines which rows make up the window. The OVER clause has
three possible components: PARTITION BY, ORDER BY, and the frame. The PARTITION BY
expression divides up the rows, and it’s optional depending on what you are trying to
accomplish. The ORDER BY expression is required for some types of window functions.
6
Chapter 1 Looking Through the Window
Where it is used, it determines the order in which the window function is applied.
Finally, the frame is used for some specific types of window functions to provide even
more granularity. You’ll learn about framing in Chapter 5.
Many T-SQL developers and database professionals have used the ROW_NUMBER
function. They may not have even realized that this is one of the window functions.
There are many situations where adding a row number to the query is a step along the
way to solving a complex query problem.
ROW_NUMBER supplies an incrementing number, starting with one, for each row. The
order in which the numbers are applied is determined by the columns specified in the
ORDER BY expression, which is independent of an ORDER BY clause found in the query
itself. Run the queries in Listing 1-2 to see how this works.
USE AdventureWorks;
GO
--1-2.1 Row numbers applied by CustomerID
SELECT CustomerID, SalesOrderID,
ROW_NUMBER() OVER(ORDER BY CustomerID) AS RowNumber
FROM Sales.SalesOrderHeader;
The OVER clause follows the ROW_NUMBER function. Inside the OVER clause, you will see
ORDER BY followed by one or more columns. The difference between Queries 1 and 2 is
just the ORDER BY expression within the OVER clause. Notice in the partial results shown
in Figure 1-3 that the row numbers end up applied in the order of the column found
7
Chapter 1 Looking Through the Window
in the ORDER BY expression of the OVER clause, which is also the order that the data is
returned. Since the data must be sorted to apply the row numbers, it is easy for the data
to stay in that order, but it is not guaranteed. The only way to ever actually guarantee the
order of the results is to add an ORDER BY to the query.
Figure 1-3. Partial results of using ROW_NUMBER with different OVER clauses
If the query itself has an ORDER BY clause, it can be different than the ORDER BY within
OVER. Listing 1-3 demonstrates this.
8
Chapter 1 Looking Through the Window
In this case, the row numbers are applied in order of the CustomerID, but the results
are returned in order of SalesOrderID. The partial results are shown in Figure 1-4. In
order to show that the row numbers are applied correctly, the figure shows the grid
scrolled down to the first customer, CustomerID 11000.
Figure 1-4. Partial results of showing a query with a different ORDER BY than the
OVER clause
Just like the ORDER BY clause of a query, you can specify a descending order with the
DESCENDING or DESC keyword within the OVER clause, as shown in Listing 1-4.
Figure 1-5 shows partial results. Since it was easy for the database engine to return
the results in descending order by CustomerID, you can easily see that row number 1 was
applied to the largest CustomerID.
In the SalesOrderHeader table, the CustomerID is not unique. Notice in the last
example that 30118 is the largest CustomerID. The row number with SalesOrderID 71803
is 1 and with 65221 is 2. There is no guarantee that the row numbers will be assigned
exactly this way as long as the lowest RowNumbers are lined up with CustomerID
30118. To ensure that the row numbers line up as expected, use a unique column or
combination of columns in the ORDER BY expression of the OVER clause. If you use more
than one column, separate the columns with commas. You could even apply the row
numbers in a random order. Listing 1-5 demonstrates this.
By using the NEWID function, the row numbers are applied in a random fashion.
Figure 1-6 shows this. If you run the code, you will see different CustomerID values
aligned with the row numbers. Each time the data is returned in order of row number,
just because it is easy for the database engine to do so.
As you may guess, applying the row numbers in a specific order involves sorting,
which is an expensive operation. If you wish to generate row numbers but do not care
about the order, you can use a subquery selecting a literal value in place of a column
name. Listing 1-6 demonstrates how to do this.
10
Chapter 1 Looking Through the Window
Figure 1-7 shows the partial results. In Queries 1 and 2, a subquery selecting a
constant replaces the ORDER BY column. The OVER clauses are identical, but the row
numbers are applied differently, the easiest way possible. The difference between the
two queries is that Query 2 has an ORDER BY clause. Since there is no specific order for
the row numbers to be assigned, the easiest way is the order that the results would be
returned even if the ROW_NUMBER function was not there. Query 3 shows how the rows
are returned with no ROW_NUMBER and no ORDER BY. You may be wondering why the
optimizer chose to return the results in Queries 1 and 3 in CustomerID order. There
just happens to be a nonclustered index on CustomerID covering those queries. The
optimizer chose the index that is ordered on CustomerID to solve the queries.
11
Chapter 1 Looking Through the Window
Figure 1-7. Partial results of letting the engine decide how row numbers are applied
12
Chapter 1 Looking Through the Window
Figure 1-8 shows the partial results. Notice that SalesOrderID is assigned 1 in Query
1 and 3 in Query 2. The only difference between the two queries is the ORDER BY clause.
Since CustomerID 11000 has three orders, numbers 1, 2, and 3 must be assigned to the
three rows, but there is no guarantee how they will be assigned.
13
Chapter 1 Looking Through the Window
There are a couple of things that you cannot do with window functions that may be
related to determinism. You cannot use a window function expression in a computed
column (a column in a table composed of an expression), and you cannot use a window
function expression as a key for the clustered index of a view.
The ORDER BY expression in the OVER clause is quite versatile. You can use an
expression instead of a column as was shown with NEWID in the earlier example. You
can also list multiple columns or expressions. Listing 1-8 demonstrates using the CASE
statement in the ORDER BY.
Figure 1-9 shows the partial results. In this case, the row numbers are applied first
to the orders from 2014 and then by SalesOrderID. The grid is scrolled down to the last
three orders of 2014 so you can see that the next numbers applied are from the beginning
of the data, 2011.
There are two additional components of the OVER clause: partitioning and framing.
You will learn about framing, introduced in 2012, in Chapter 5. Partitioning divides the
window into multiple, smaller windows, and you’ll learn about that next.
14
Discovering Diverse Content Through
Random Scribd Documents
The Project Gutenberg eBook of L'été de
Guillemette
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.
Language: French
L’ÉTÉ DE GUILLEMETTE
PARIS
L I B R A I R I E P LO N
PLON-NOURRIT ET Cie, IMPRIMEURS-ÉDITEURS
8, RUE GARANCIÈRE — 6e
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com