119 SQL Code Smells PDF
119 SQL Code Smells PDF
1
Contents
Introduction 3
Problems with Database 5
Design
Problems with Table 11
Design
Problems with Data Types 16
Problems with Expressions 21
Difficulties with Query 26
Syntax
Problems with Naming 39
Problems with Routines 43
Security Loopholes 63
Acknowledgements 66
2
Introduction
Once youve done a number of SQL code-reviews,
youll be able to identify signs in the code that
indicate all might not be well. These code smells
are coding styles that, while not bugs, suggest
design problems with the code.
Kent Beck and Massimo Arnoldi seem to have coined the term
CodeSmell in the Once And Only Once page of www.C2.com, where
Kent also said that code wants to be simple. Kent Beck and Martin
Fowler expand on the issue of code challenges in their essay Bad Smells
in Code, published as Chapter 3 of the book Refactoring: Improving the
Design of Existing Code (ISBN 978-0201485677).
Although there are generic code smells, SQL has its own particular
habits that will alert the programmer to the need to refactor code. (For
grounding in code smells in C#, see Exploring Smelly Code and Code
Deodorants for Code Smells by Nick Harrison.) Plamen Ratchevs
wonderful article Ten Common SQL Programming Mistakes lists some
of these code smells along with out-and-out mistakes, but there are
more. The use of nested transactions, for example, isnt entirely incorrect,
even though the database engine ignores all but the outermost, but
their use does flag the possibility the programmer thinks that nested
transactions are supported.
If you are moving towards continuous delivery of database applications,
you should automate as much as possible the preliminary SQL code-
review. Its a lot easier to trawl through your code automatically to pick
out problems, than to do so manually. Imagine having something like
the classic lint tools used for C, or better still, a tool similar to Jonathan
Peli de Halleuxs Code Metrics plug-in for .NET Reflector, which finds
code smells in .NET code.
One can be a bit defensive about SQL code smells. I will cheerfully write
very long stored procedures, even though they are frowned upon. Ill
even use dynamic SQL on occasion. You should use code smells only
as an aid. It is fine to sign them off as being inappropriate in certain
circumstances. In fact, whole classes of code smells may be irrelevant
for a particular database. The use of proprietary SQL, for example, is
only a code smell if there is a chance that the database will be ported
to another RDBMS. The use of dynamic SQL is a risk only with certain
security models. Ultimately, you should rely on your own judgment. As
the saying goes, a code smell is a hint of possible bad practice to a
pragmatist, but a sure sign of bad practice to a purist.
5
A related code smell is:
Using inappropriate data types
Although a business may choose to represent a date
as a single string of numbers or require codes that
mix text with numbers, it is unsatisfactory to store
such data in columns that dont match the actual data
type. This confuses the presentation of data with its
storage. Dates, money, codes and other business data
can be represented in a human-readable form, the
presentation mode, they can be represented in their
storage form, or in their data-interchange form.
6
Storing the hierarchy structure
2 in the same table as the entities
that make up the hierarchy
Self-referencing tables seem like an elegant way to
represent hierarchies. However, such an approach
mixes relationships and values. Real-life hierarchies
need more than a parent-child relationship. The
Closure Table pattern, where the relationships are held
in a table separate from the data, is much more suitable
for real-life hierarchies. Also, in real life, relationships
tend have a beginning and an end, and this often needs
to be recorded. The HIERARCHYID data type and the
common language runtime (CLR) SqlHierarchyId class
are provided to make tree structures represented by
self-referencing tables more efficient, but they are likely
to be appropriate for only a minority of applications.
7
Using a polymorphic association
4 Sometimes, one sees table designs which have keys that
can reference more than one table, whose identity is
usually denoted by a separate column. This is where an
entity can relate to one of a number of different entities
according to the value in another column that provides
the identity of the entity. This sort of relationship
cannot be subject to foreign key constraints, and any
joins are difficult for the query optimizer to provide
good plans for. Also, the logic for the joins is likely to get
complicated. Instead, use an intersection table, or if you
are attempting an object-oriented mapping, look at the
method by which SQL Server represents the database
metadata by creating an object supertype class that all
of the individual object types extend. Both these devices
give you the flexibility of design that polymorphic
associations attempt.
8
Creating tables as God Objects
5 God Tables are usually the result of an attempt to
encapsulate a large part of the data for the
business domain in a single wide table. This is
usually a normalization error, or rather, a rash and
over-ambitious attempt to denormalize the database
structure. If you have a table with many columns, it is
likely that you have come to grief on the third normal
form. It could also be the result of believing, wrongly,
that all joins come at great and constant cost. Normally
they can be replaced by views or table-valued functions.
Indexed views can have maintenance overhead but are
greatly superior to denormalization.
Contrived interfaces
6 Quite often, the database designer will need to create
an interface to provide an abstraction layer between
schemas within a database, between database and
ETL processes, or between a database and application.
You face a choice between uniformity, and simplicity.
Overly complicated interfaces, for whatever reason,
should never be used where a simpler design would
suffice. It is always best to choose simplicity over
conformity. Interfaces have to be clearly documented
and maintained, let alone understood.
9
Using command-line and
7 OLE automation to access
server-based resources
In designing a database application, there is sometimes
functionality that cannot be done purely in SQL, usually
when other server-based, or network-based resources
must be accessed. Now that SQL Servers integration
with PowerShell is so much more mature, it is better to
use that, rather than xp_cmdshell or sp_OACreate (or
similar), to access the file system or other server-based
resources. This needs some thought and planning.
You should also use SQL Agent jobs when possible to
schedule your server-related tasks. This requires
up-front design to prevent them becoming
unmanageable monsters prey to ad-hoc growth.
10
Problems with Table
Design
11
Leaving referential
10 integrity constraints
disabled
Some scripting engines disable
referential integrity during updates.
You must ensure that WITH CHECK is
enabled or else the constraint is marked
as untrusted and therefore wont be
used by the optimizer.
12
Misusing NULL values
12 The three-value logic required to handle NULL values
can cause problems in reporting, computed values
and joins. A NULL value means unknown, so any
sort of mathematics or concatenation will result in
an unknown (NULL) value. Table columns should be
nullable only when they really need to be. Although
it can be useful to signify that the value of a column
is unknown or irrelevant for a particular row, NULLs
should be permitted only when theyre legitimate for
the data and application, and fenced around to avoid
subsequent problems.
13
Creating a table without
14 specifying a schema
If youre creating tables from a script, they must, like
views and routines, always be defined with two-part
names. It is possible for different schemas to contain
the same table name, and there are some perfectly
legitimate reasons for doing this.
14
Defining a table column without
17 explicitly specifying whether it
is nullable
The default nullability for a databases columns can
be altered as a setting. Therefore one cannot assume
whether a column will default to NULL or NOT NULL.
It is safest to specify it in the column definition, and it is
essential if you need any portability of your table design.
Creating dated
18 copies of the same
table to manage
table sizes
2014
Now that SQL Server supports
table partitioning, it is far
better to use partitions than
to create dated tables, such as
Invoices2012, Invoices2013, etc.
If old data is no longer used,
2013
archive the data, store only
aggregations, or both.
2012
15
Problems with
Data Types
Using VARCHAR(1),
19 VARCHAR(2), etc.
Columns of short or fixed length should have a fixed
size since variable-length types have a disproportionate
storage overhead. For a large table, this could be
significant.
16
Using MONEY data type
21 The MONEY data type confuses the storage of data
values with their display, though it clearly suggests, by
its name, the sort of data held. Using the DECIMAL
data type is almost always better.
17
Using DATETIME or DATETIME2
24 when youre concerned only
with the date
Even with data storage being so cheap, a saving in a
data type adds up and makes comparison and
calculation easier. When appropriate, use the DATE
or SMALLDATETIME type.
18
Using the TIME data type to
27 store a duration rather than
a point in time
Durations are best stored as a start date/time value and
end date/time value. This is especially true given that
you usually need the start and end points to calculate
a duration. It is possible to use a TIME data type if the
duration is less than 24 hours, but this is not what the
type is intended for and can cause confusion for the
next person who has to maintain your code.
19
Using VARCHAR(MAX) or
28 NVARCHAR(MAX) when it
isnt necessary
VARCHAR types that specify a number rather than
MAX have a finite maximum length and can be stored
in-page, whereas MAX types are treated as BLOBS
and stored off-page, preventing online re-indexing.
Use MAX only when you need more than 8000 bytes
(4000 characters for NVARCHAR, 8000 characters for
VARCHAR).
20
Problems with
expressions
21
Injudicious use of the LTRIM
32 and RTRIM functions
These dont work as they do in any other computer
language. They only trim ASCII space rather than any
whitespace character. Use a scalar user-defined function
instead.
22
Relying on data being implicitly
35 converted between types
Implicit conversions can have unexpected results, such
as truncating data or reducing performance. It is not
always clear in expressions how differences in data types
are going to be resolved. If data is implicitly converted
in a join operation, the database engine is more likely
to build a poor execution plan. More often than not,
you should explicitly define your conversions to avoid
unintentional consequences.
23
Using the @@IDENTITY
36 system function
The generation of an IDENTITY value is not
transactional, so in some circumstances, @@IDENTITY
returns the wrong value and not the value from the row
you just inserted. This is especially true when using
triggers that insert data, depending on when the triggers
fire. The SCOPE_IDENTITY function is safer because
it always relates to the current batch (within the same
scope). Also consider using the IDENT_CURRENT
function, which returns the last IDENTITY value
regardless of session or scope. The OUTPUT clause is a
better and safer way of capturing identity values.
24
Using BETWEEN for
37 DATETIME ranges
You never get complete accuracy if you specify dates
when using the BETWEEN logical operator with
DATETIME values, due to the inclusion of both the
date and time values in the range. It is better to first
use a date function such as DATEPART to convert the
DATETIME value into the necessary granularity (such as
day, month, year, day of year) and store this in a column
(or columns), then indexed and used as a filtering or
grouping value. This can be done by using a persisted
computed column to store the required date part as an
integer, or via a trigger.
25
Difficulties with
Query Syntax
Creating UberQueries
39 (God-like Queries)
Always avoid overweight queries (e.g., a single query
with four inner joins, eight left joins, four derived tables,
ten subqueries, eight clustered GUIDs, two UDFs and
six case statements).
26
Joins between large views
41 Views are like tables in their behaviour, but they cant be
indexed to support joins. When large views participate
in joins, you never get good performance. Instead, either
create a view that joins the appropriately indexed base
tables, or create indexed temporary tables to contain the
filtered rows from the views you wish to join.
27
Using correlated subqueries
43 instead of a JOIN
Correlated subqueries, queries that run against each
row returned by the main query, sometimes seem
an intuitive approach, but they are merely disguised
cursors needed only in exceptional circumstances.
Window functions will usually perform the same
operations much faster. Most usages of correlated
subqueries are accidental and can be replaced with a
much simpler and faster JOIN query.
28
Using scalar user-defined
45 functions (UDFs) for data lookups
as a poor mans join.
It is true that SQL Server provides a number of system
functions to simplify joins when accessing metadata,
but these are heavily optimized. Using user-defined
functions in the same way will lead to very slow queries
since they perform much like correlated subqueries.
29
Using full outer joins
48 unnecessarily.
It is rare to require both matched and unmatched rows
from the two joined tables, especially if you filter out
the unmatched rows in the WHERE clause. If what you
really need is an inner join, left outer join or right outer
join, then use one of those. If you want all rows from
both tables, use a cross join.
30
Mixing data types in joins or
50 WHERE clauses
If you compare or join columns that have different data
types, you rely on implicit conversions, which result in
poor execution plans that use table scans. This approach
can also lead to errors because no constraints are in
place to ensure the data is
the correct type.
31
Assuming that SELECT
51 statements all have roughly the
same execution time
Few programmers admit to this superstition, but it
is apparent by the strong preference for hugely long
SELECT statements (sometimes called UberQueries).
A simple SELECT statement runs in just a few
milliseconds. A process runs faster if the individual SQL
queries are clear enough to be easily processed by the
query optimizer. Otherwise, you will get a poor query
plan that performs slowly and wont scale.
32
Referencing an unindexed
53 column within the IN predicate of
a WHERE clause
A WHERE clause that references an unindexed column
in the IN predicate causes a table scan and is therefore
likely to run far more slowly than necessary.
33
Using a predicate or join column
55 as a parameter for a user-defined
function
The query optimizer will not be able to generate a
reasonable query plan if the columns in a predicate or
join are included as function parameters. The optimizer
needs to be able to make a reasonable estimate of the
number of rows in an operation in order to effectively
run a SQL statement and cannot do so when functions
are used on predicate or join columns.
34
Not using NOCOUNT ON in stored
58 procedures and triggers
Unless you need to return messages that give you
the row count of each statement, you should specify
the NOCOUNT ON option to explicitly turn off this
feature. This option is not likely to be a significant
performance factor one way or the other.
35
Defining foreign keys without a
60 supporting index
Unlike some relational database management systems
(RDBMSs), SQL Server does not automatically index a
foreign key column, even though an index will likely be
needed. It is left to the implementers of the RDBMS as
to whether an index is automatically created to support
a foreign key constraint. SQL Server chooses not to do
so, probably because, if the referenced table is a lookup
table with just a few values, an index isnt useful. SQL
Server also does not mandate a NOT NULL constraint
on the foreign key, perhaps to allow rows that arent
related to the referenced table.
Even if youre not joining the two tables via the primary
and foreign keys, with a table of any size, an index
is usually necessary to check changes to PRIMARY
KEY constraints against referencing FOREIGN KEY
constraints in other tables to verify that changes to the
primary key are reflected in the foreign key
36
Using a non-SARGable (Search
61 ARGument..able) expression in a
WHERE clause
In the WHERE clause of a query it is good to avoid
having a column reference or variable embedded within
an expression, or used as a parameter of a function.
A column reference or variable is best used as a single
element on one side of the comparison operator,
otherwise it will most probably trigger a table scan,
which is expensive in a table of any size.
37
Using SELECT DISTINCT to mask
63 a join problem
It is tempting to use SELECT DISTINCT to eliminate
duplicate rows in a join. However, its much better to
determine why rows are being duplicated and fix the
problem.
38
Problems with naming
39
Tibbling SQL Server objects with
67 Reverse-Hungarian prefixes such
as tbl_, vw_, pk_, fn_, and usp
SQL names dont need prefixes because there isnt any
ambiguity about what they refer to. Tibbling is a habit
that came from databases imported from Microsoft
Access.
40
Including special characters in
69 object names
SQL Server supports special character in object names
for backward compatibility with older databases such
as Microsoft Access, but using these characters in newly
created databases causes more problems than theyre
worth. Special characters require brackets (or double
quotations) around the object name, make code difficult
to read, and make the object more difficult to reference.
Avoid particularly using any whitespace characters,
square brackets or either double or single quotation
marks as part of the object name.
41
Using square brackets
71 unnecessarily for object names
If object names are valid and not reserved words,
there is no need to use square brackets. Using invalid
characters in object names is a code smell anyway, so
there is little point in using them. If you cant avoid
brackets, use them only for invalid names.
Using system-generated
72 object names, particularly for
constraints
This tends to happen with primary keys and foreign
keys if, in the data definition language (DDL), you dont
supply the constraint name. Auto-generated names
are difficult to type and easily confused, and they tend
to confuse SQL comparison tools. When installing
SharePoint via the GUI, the database names get GUID
suffixes, making them very difficult to deal with.
42
Problems with routines
43
Creating routines (especially
75 stored procedures) as God
Routines or UberProcs
Occasionally, long routines provide the most efficient
way to execute a process, but occasionally they just grow
like algae as functionality is added. They are difficult
to maintain and likely to be slow. Beware particularly
of those with several exit points and different types of
result set.
44
Too many parameters in stored
77 procedures or functions
The general consensus is that a lot of parameters can
make a routine unwieldy and prone to errors. You can
use table-valued parameters (TVPs) or XML parameters
when it is essential to pass data structures or lists into a
routine.
Duplicated code
78 This is a generic code smell. If you discover an error
in code that has been duplicated, the error needs to be
fixed in several places.
45
High cyclomatic complexity
79 Sometimes it is important to have long procedures,
maybe with many code routes. However, if a high
proportion of your procedures or functions are
excessively complex, youll likely have trouble
identifying the atomic processes within your
application. A high average cyclomatic complexity in
routines is a good sign of technical debt.
46
Unnecessarily using stored
81 procedures or table-valued
functions where a view is
sufficient
Stored procedures are not designed for delivering
result sets. You can use stored procedures as such with
INSERTEXEC, but you cant nest INSERTEXEC so
youll soon run into problems. If you do not need to
provide input parameters, then use views, otherwise use
table valued functions.
Using Cursors
82 SQL Server originally supported cursors to more easily
port dBase II applications to SQL Server, but even then,
you can sometimes use a WHILE loop as an effective
substitute. However, modern versions of SQL Server
provide window functions and the CROSS/OUTER
APPLY syntax to cope with most of the traditional valid
uses of the cursor.
47
Overusing CLR routines
83 There are many valid uses of CLR routines, but they
are often suggested as a way to pass data between
stored procedures or to get rid of performance
problems. Because of the maintenance overhead, added
complexity, and deployment issues associated with CLR
routines, it is best to use them only after all SQL-based
solutions to a problem have been found wanting or
when you cannot use SQL to complete a task.
48
Relying on the INSERTEXEC
85 statement
In a stored procedure, you must use an INSERT
EXEC statement to retrieve data via another stored
procedure and insert it into the table targeted by the
first procedure. However, you cannot nest this type
of statement. In addition, if the referenced stored
procedure changes, it can cause the first procedure to
generate an error.
49
Specifying parameters by
87 order rather than assignment,
where there are more than four
parameters
When calling a stored procedure, it is generally better
to pass in parameters by assignment rather than just
relying on the order in which the parameters are defined
within the procedure. This makes the code easier to
understand and maintain. As with all rules, there are
exceptions. It doesnt really become a problem when
there are less than a handful of parameters. Also,
natively compiled procedures work fastest by passing in
parameters by order.
50
Creating a routine with ANSI_
89 NULLS or QUOTED_IDENTIFIER
options set to OFF.
At the time the routine is created (parse time), both
options should normally be set to ON. They are ignored
on execution. The reason for keeping Quoted Identifiers
ON is that it is necessary when you are creating or
changing indexes on computed columns or indexed
views. If set to OFF, then CREATE, UPDATE, INSERT,
and DELETE statements on tables with indexes on
computed columns or indexed views will fail. SET
QUOTED_IDENTIFIER must be ON when you are
creating a filtered index or when you invoke XML data
type methods. ANSI_NULLS will eventually be set to
ON and this ISO compliant treatment of NULLS will
not be switchable to OFF.
51
Overusing hints to force a
91 particular behaviour in joins
Hints do not take into account the changing number
of rows in the tables or the changing distribution of
the data between the tables. The query optimizer is
generally smarter than you, and a lot more patient.
52
Using the NOLOCK hint
94 Avoid using the NOLOCK hint. It is much better and
safer to specify the correct isolation level instead. To
use NOLOCK, you would need to be very confident
that your code is safe from the possible problems that
the other isolation levels protect against. The NOLOCK
hint forces the query to use a read uncommitted
isolation level, which can result in dirty reads,
non-repeatable reads and phantom reads. In certain
circumstances, you can sacrifice referential integrity
and end up with missing rows or duplicate reads of
the same row.
53
Using SET ROWCOUNT to specify
96 how many rows should be
returned
We had to use this option until the TOP clause (with
ORDER BY) was implemented. The TOP option is much
easier for the query optimizer.
54
Duplicating names of objects of
98 different types
Although it is sometimes necessary to use the same
name for the same type of object in different schemas,
it is never necessary to do it for different object types
and it can be very confusing. You would never want a
SalesStaff table and SalesStaff view and SalesStaff stored
procedure.
55
Using WHILE (not done) loops
99 without an error exit
WHILE loops must always have an error exit. The
condition that you set in the WHILE statement may
remain true even if the loop is spinning on an error.
56
101 Using TOP without ORDER BY
Using TOP without an ORDER BY clause in a SELECT
statement is meaningless and cannot be guaranteed to
give consistent results.
57
104 Using the GROUP BY ALL
<column>, GROUP BY <number>,
COMPUTE, or COMPUTE BY clause
The GROUP BY ALL <column> clause and the GROUP
BY <number> clause are deprecated. There are other
ways to perform these operations using the standard
GROUP BY syntax. The COMPUTE and COMPUTE BY
operations were devised for printed summary results.
The ROLLUP clause is a better alternative.
58
106 Using unnecessary three-part
and four-part column references
in a select list
Column references should be two-part names when
there is any chance of ambiguity due to column names
being duplicated in other tables. Three-part column
names might be necessary in a join if you have duplicate
table names, with duplicate column names, in different
schemas, in which case, you ought to be using aliases.
The same goes for cross-database joins.
59
108 Doing complex error-handling
in a transaction before the
ROLLBACK command
The database engine releases locks only when the
transaction is rolled back or committed. It is unwise to
delay this because other processes may be forced to wait.
Do any complex error handling after the ROLLBACK
command wherever possible.
60
110 Not defining a default value for a
SET assignment that is the result
of a query
If a variables SET assignment is based on a query result
and the query returns no rows, the variable is set to
NULL. In this case, you should assign a default value to
the variable unless you want it to be NULL.
61
112 Using the NULLIF expression
The NULLIF expression compares two expressions
and returns the first one if the two are not equal. If the
expressions are equal then NULLIF returns a NULL
value of the data type of the first expression. NULLIF is
syntactic sugar. Use the CASE statement instead so that
ordinary folks can understand what youre trying to do.
The two are treated identically.
62
Security Loopholes
63
116 Using the xp_cmdshell system
stored procedure
Use xp_cmdshell in a routine only as a last resort, due
to the elevated security permissions they require and
consequential security risk. The xp_cmdshell procedure
is best reserved for scheduled jobs where security can be
better managed.
64
119 Using dynamic SQL with the
possibility of SQL injection
SQL injection can be used not only from an application
but also by a user who lacks, but wants, the permissions
necessary to perform a particular role, or who simply
wants to access sensitive data. If dynamic SQL is
executed within a stored procedure, under the
temporary EXECUTE AS permission of a user with
sufficient privileges to create users, it can be accessed
by a malicious user. Suitable precautions must be taken
to make this impossible. These precautions start with
giving EXECUTE AS permissions only to WITHOUT
LOGIN users with least-necessary permissions, and
using sp_ExecuteSQL
with parameters rather than EXECUTE.
65
Acknowledgements
For a booklet like this, it is best to go with the
established opinion of what constitutes a SQL
Code Smell. There is little room for creativity. In
order to identify only those SQL coding habits that
could, in some circumstances, lead to problems,
I must rely on the help of experts, and I am very
grateful for the help, support and writings of the
following people in particular.
SQL Compare
SQL Compare is the industry-standard
tool for synchronizing SQL Server
environments.
SQL Test
Run unit tests inside SQL Server
Management Studio. Use the
command line to automate testing as
part of your CI process.
SQL Enlight provides a fast, automated way to ensure that your T-SQL
source code fits to your predefined design and style guidelines as well
as best coding practices. With its help, you can easily analyze, identify
and quickly resolve potential design and performance pitfalls in your
SQL Server databases.