CH 04
CH 04
Data Selection
Queries
Objectives
• Understand the origins of the SQL language.
• Learn how to formulate SELECT statements.
• Use WHERE clauses to filter result sets.
• Use ORDER BY clauses to sort result sets.
• Use GROUP BY clauses to aggregate result sets.
• Understand the principles of joining.
• Learn how to join tables using inner and outer joins.
• Learn how a self-join joins a table to itself.
The files associated with this chapter are located in the following folders:
{Install Folder}\Selecting
{Install Folder}\SelectingDataLab
Understanding Transact-SQL
Transact-SQL is based on the Structured Query Language (SQL) core, which
is used to select and manipulate data as well as define data structures and
implement security.
The original standard, SQL86, was simply IBM’s implementation of the query
language. In 1989, the SQL89 standard was the first enhancement of SQL that
reflected the influence of a variety of vendors. The SQL language was further
expanded in 1992 to the SQL92 standard and then to SQL99. Transact-SQL is
the SQL Server-specific implementation. SQL Server 2005 conforms to
SQL92 and implements several SQL99 enhancements, including common
table expressions and ranking functions.
In addition, Transact-SQL lets you work with variables, temporary objects, and
cursors. It also contains its own built-in system functions, which you can use
to aggregate data, to work with numbers, strings, and dates, and to retrieve
information from the system. Transact-SQL contains many features common
to all programming languages, such as branching, looping, and error handling.
-- Use Northwind;
SELECT CompanyName FROM dbo.Customers;
-- "SELECT CompanyName FROM Customers;" will also work.
-- USE AdventureWorks;
SELECT Name FROM Sales.Store;
-- "SELECT Name FROM Store;" will fail.
The easiest way to deal with schemas is to keep all database objects assigned
to dbo and to avoid creating or assigning any other schemas. However,
schemas can be a useful way of creating multiple namespaces in a database,
just as namespaces make it easier for .NET programmers to keep track of
classes. The AdventureWorks database provides a good example of using
schemas as namespaces.
Using this basic syntax, a SQL statement that retrieves the first name and last
name from the Customer table looks like this:
TIP: The query processor ignores tabs, carriage returns, and extra spaces. In
addition, Transact-SQL statements do not have to be typed on a single line.
The statement:
SELECT
LastName
FROM
dbo.Employees;
is equivalent to:
Semicolon vs. GO
GO is not a Transact-SQL statement. It signals the end of a batch so that all of the
preceding statements execute together. A batch is not the same as a script, which can
contain multiple batches. GO is supported by Microsoft SQL Server Management
Studio (SSMS), the sqlcmd utility, and the osql utility. In SSMS, you can even set an
option to define your own batch terminator, rather than using GO.
The semicolon character is a statement terminator and is a part of the ANSI SQL92
standard, but was never widely used within Transact-SQL. Although it is not required,
it is considered good programming practice to use it at the end of each statement.
The main advantage of using a semicolon instead of GO is that a semicolon does not
reset variables. When a GO statement terminates a batch, all variables are destroyed.
However, some situations require a GO statement, such as when you use DDL to
create objects in which the first statement in the batch must be the CREATE statement.
Any statements that attempt to work with the new object will fail unless you use GO
after the CREATE statement.
This SQL statement retrieves all fields for the table, as shown in Figure 2 (note
that not all the columns are visible in this figure).
TIP: It’s best to avoid using the asterisk in your SQL statements. You’ll almost
never need every column in a table, and every column you retrieve adds time
to your query, adds overhead to the server, and eats up more bandwidth on
the network. It’s simply good SQL writing practice to be explicit about which
columns you need.
Sometimes, however, you actually don’t know which columns may be added
later and you want to write a query that is sure to retrieve all the columns. In
such cases, the asterisk is appropriate.
Concatenating Columns
Not every field specified in a SELECT statement has to be a column in a
database. You can create your own columns using expressions. For example,
you can concatenate values from multiple columns to create a new column.
In the Customer table, the first name and last name of each customer are in two
separate fields. You can combine these fields using the addition (+) operator
and adding a comma and a space in the middle:
When SQL Server receives this query, it executes the expression, and the
result set contains a single constructed column that contains both the last and
first names of the customer, separated by a comma and space, as shown in
Figure 3.
Naming Columns
Notice in Figure 3 that the column has no name. Since the column is an
expression, rather than a column in the database, you must explicitly name the
column. Here is how you can do that by using the AS keyword:
The query itself is essentially the same as the previous example; all that’s
changed is that the column now has a name, FullName, as shown in Figure 4.
This name is also called an alias. Aliases are used routinely in complex SQL
and are necessary for certain types of joins, like self-joins. To include a space
within an alias, surround the alias with square brackets:
However, the AS clause is optional—you can name a column just by using the
name separated from the column list by a space:
This query functions in the same way as the query with the AS clause, but
makes your SQL statement harder to read. Is FullName a column in the
database, or an alias? It is better to be explicit with the AS clause. The AS
clause is the most explicit way to alias a column and is supported by the ANSI
standard.
Another supported option is to use an equal sign for defining a column alias:
Deprecated Syntax
One method of defining column aliases using the equal sign (=) has been deprecated,
which means it will not be supported in future versions of SQL Server. This method
uses a string expression to define the alias:
-- Deprecated Syntax
SELECT 'FullName' = LastName + ', ' + FirstName
FROM dbo.Employees
You can, however, still use a string expression to define the column alias if you use the
AS syntax (even if you leave out the word AS).
However, this would return one row for every employee. You only want a list
of each distinct title; that is, you want each title listed once in the result set.
The DISTINCT keyword limits a query to unique rows only:
Now instead of having a row for every employee, you have one row for each
unique title, as shown in Figure 5.
You use the WHERE clause to specify the search conditions that SQL Server
should use to identify rows that should or shouldn’t be included in the result
set. The simplest WHERE clauses check for equality. For example, if you want
a list of all customers in the city of Paris write the following query:
Figure 6. Using the WHERE clause to retrieve only customers from Paris
Operator Description
= Equal to
<> Not equal to
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
!= Not equal to
!< Not less than
!> Not greater than
[NOT] LIKE Pattern matching a string with wildcards
BETWEEN expr1 AND An inclusive range of values
expr2
IS [NOT] NULL Check for null value
[NOT] IN (val1, List matching
val2…)
-or-
[NOT] IN (subquery)
ANY (SOME) Tests whether one or more rows in the result set of
a subquery meet the specified condition
ALL Tests whether all rows in the result set of a
subquery meet the specified condition
[NOT] EXISTS Tests whether a subquery returns any results
Table 1. Transact-SQL comparison operators.
Wildcard Meaning
Write the following query to show a list of customers whose names start with
“S.”
SELECT CompanyName
FROM dbo.Customers
WHERE CompanyName LIKE 'S%';
SELECT CompanyName
FROM dbo.Customers
WHERE CompanyName LIKE '%S';
And this one will display customers who have an “S” anywhere in their names:
SELECT CompanyName
FROM dbo.Customers
WHERE CompanyName LIKE '%S%'
The wildcard characters for the LIKE operator are the percent symbol (%) and
the underscore (_). Their behavior corresponds to the asterisk (*) and question
mark (?) in DOS search strings. So the expression ‘S%’ means an ‘S’ character
followed by any number of other characters, and the expression ‘S_’ means an
‘S’ character followed by one other character.
Use the underscore to write a query that shows all of the customer first names
that start with “B”, end with “P”, and have any combination of characters in
the middle three slots (indicated by three underscore characters):
SELECT CustomerID
FROM dbo.Customers
WHERE CustomerID LIKE 'B___P';
Use square brackets to delimit a list of values to match. The following query
will return any CustomerID values that begin with “FRAN” and end with
either “R” or “K”:
SELECT CustomerID
FROM dbo.Customers
WHERE CustomerID LIKE 'FRAN[RK]';
You can also use square brackets to specify a consecutive range of characters
to match:
SELECT CustomerID
FROM dbo.Customers
WHERE CustomerID LIKE 'FRAN[A-S]';
Use the caret character inside square brackets to indicate negation. For
example, the following query will return CustomerID values that begin with
“FRAN” and end with any character except “R”:
SELECT CustomerID
FROM dbo.Customers
WHERE CustomerID LIKE 'FRAN[^R]';
The result set shown in Figure 9 includes all customers who have a postal code
between 98103 and 98999, inclusively.
TIP: The BETWEEN operator can be used to express date ranges as well as string
or numeric ranges.
A database null value is not the same as a null in a programming language. Null in
SQL Server means an unknown value, and is not equivalent to zero. The possible
values of a SQL expression can be TRUE, FALSE, and NULL (UNKNOWN),
whereas in a programming language they are simply true or false. In SQL, any
comparison or operation involving a null value results in null. For example, x + null =
null, or to rephrase it, x + unknown = unknown. The reason is that an unknown value
cannot be compared logically against any other value. This occurs if an expression is
compared to the literal NULL, or if two expressions are compared and one of them
evaluates to NULL.
When testing for nulls, always use IS NULL or IS NOT NULL. An attempt to test for
nulls using the equality operator (=) may or may not work, depending on the database
settings for ANSI NULLS. When SET ANSI_NULLS is ON, a comparison in which
one or more of the expressions is NULL does not yield either TRUE or FALSE; it
yields NULL. Note that the session level setting overrides the default database setting,
and the default for ODBC and OLE DB connections is SET ANSI NULLS True.
The AND operator requires both expressions to be true for the condition to be
true:
The results display all employees who live in Seattle and whose ZIP code
starts with “9”, as shown in Figure 11.
The OR operator requires only one expression to be true, although both can be
true. The preceding query can be modified to use the OR operator:
This query shows all of the employees in Seattle in addition to all employees
who have a ZIP code starting with 9. Figure 12 shows the result set.
The NOT operator is used for negation against a single expression, which
makes it more difficult to understand. In essence, it reverses the behavior of
the expression that follows it. The following query returns all cities and postal
codes except Seattle:
Figure 13 shows the result set of all employees who don’t live in Seattle.
Operator Precedence
When you combine operators in expressions, the precedence of the operator
determines how the conditions are evaluated. The order of precedence is:
1. NOT
2. AND
3. OR
For example, the following query will eliminate any employees from Seattle,
regardless of their names, and then filter the LastName on the remaining rows:
However, a much more efficient (and faster) way is to use the IN operator. The
IN operator compares a field against an array of correct values. The following
query returns the same list of all customers in France and Spain:
Figure 15. Using the IN operator to include customers from multiple countries.
TIP: You can also use the IN operator with subqueries. For example, to find all
customers who have not placed orders, you could use this:
SELECT CustomerID
FROM dbo.Customers
WHERE CustomerID NOT IN(SELECT CustomerID
FROM dbo.Orders);
Later in this chapter you’ll see how to create this query by using an outer join.
The default sort order for an ORDER BY clause is in ascending order: 0-9,
A-Z. The results of sorting on the City name are shown in Figure 16.
To sort in descending order, you add the DESC keyword after the column:
The result set in Figure 17 displays the cities sorted in descending order first,
and then by last name in ascending order where the cities are identical.
SELECT LastName
FROM dbo.Employees
ORDER BY LEN(LastName);
Figure 18 shows that by using the LEN function in an expression, you can sort
employees by the length of their last names.
Figure 18. Using an expression to sort by the length of the last name.
Aggregate Functions
Table 3 lists some of the most commonly used aggregate functions. Of the five
functions, COUNT is used most often, providing the aggregate count of the
number of rows in the query. The other aggregate functions are intended for
use in summarizing numeric values.
Function Description
TIP: You can view a complete list of aggregate functions by searching in SQL
Server Books Online for Aggregate Functions.
Counting Rows
Use the COUNT function to return the number of rows in a table, as shown in
the following query, which uses the asterisk (*) to count all of the rows:
SELECT COUNT(*)
FROM dbo.Employees
TIP: Your aggregate queries will execute faster if you use the asterisk in the
COUNT function instead of a column name. The asterisk instructs SQL
Server to count only the number of rows. Using a column name will force
SQL Server to retrieve every value in the column and check for nulls, which
aren’t included in the count. If all you’re doing is counting the rows, the
asterisk is more efficient.
Counting Columns
When you specify a column name in the COUNT function, any null values in
the column are excluded from the count. The following query counts the total
number of rows in the Employees table, and then counts the non-null values in
the Region column:
Because there are null Region values, the result set in Figure 19 shows a
discrepancy between the number of employees and the number of regions.
Figure 19. Null values are not included in COUNT aggregate functions.
While this query returned the number of employees in Seattle, by using the
GROUP BY clause you can return a list of every city and the number of
customers in each city.
Using GROUP BY
Use a GROUP BY clause to return a list of every city where employees live
and the number of employees in each city. The GROUP BY clause combines
aggregate columns and regular columns in the same query. The queries in the
previous examples included only aggregate columns, so only aggregates were
returned. You might think that simply listing the City and the COUNT of rows
might work, as in the following query, but you would be wrong:
Unfortunately, this query generates the following error, which tells you exactly
what’s going on:
Add a GROUP BY clause to fix the query. The GROUP BY clause must
include all of the non-aggregate columns in the select list, in this case the City:
The result set in Figure 21 lists all the cities in the Employees table, plus the
count of the employees from those cities. Note that the rows are ordered
alphabetically by City, which is the column used for the grouping.
Figure 21. A basic GROUP BY showing cities and the number of employees in
each city.
Notice that aggregate functions can be used in the ORDER BY clause, and
using the DESC keyword shows the cities with the most employees first (see
Figure 22). Adding City to the ORDER BY clause puts all cities with the same
number of employees in alphabetical order.
Figure 23. Filtering the result set using the HAVING clause.
WARNING! Note that the last query used the alias NumEmployees in the
ORDER BY clause, but not in the HAVING clause. Even when you
define an alias for an aggregate expression, you must use the full
aggregate expression in the HAVING clause. Using the alias there
will cause an error.
SELECT TOP 3
City, COUNT(*) AS NumEmployees
FROM dbo.Employees
GROUP BY City
ORDER BY COUNT(*) DESC;
Figure 24 shows the result set from the query. A key element is the ORDER
BY clause, which sorts the result set in descending order before the TOP
clause is applied. A sort in ascending order would return the top three cities
with the fewest number of employees.
Figure 24. Using TOP to show the top three cities with the most employees.
However, one problem with a TOP query is the matter of ties. Aren’t there
many cities with only one employee? If you want to see the ties for last place,
you need to use the WITH TIES clause:
Figure 25 shows the result set. All five cities that tied for last place are
returned. For the data in this table, only a TOP 2 query will return unique
values.
Figure 25. The top three cities WITH TIES shows all of the ties for last place.
TOP also enables you to specify a percent value rather than an absolute
number. Here’s an example of using TOP with PERCENT:
TIP: SQL Server 2005 added several enhancements to TOP. You can now use any
numeric expression, even a variable, to specify the number; you can also now
use TOP in INSERT, UPDATE, and DELETE statements.
Joining Tables
One of the fundamental concepts of relational databases is tables, or sets, of
data. Different data elements are grouped into separate tables. Data about
employees is in the dbo.Employees table, data about orders is in the
dbo.Orders table, and so forth. The process of organizing the various elements
of data into tables is called normalization. The key measure of successful
normalization is the ability to join tables effectively.
The result of this query is a Cartesian product (named after the French
mathematician Rene Descartes who developed the concept). All possible
combinations of ProductName and CategoryName are returned. Consider what
would happen if each table had thousands of rows. A multimillion-row result
set is not something you want your SQL Server to process, much less to send
down your network connection and into your workstation.
Not every Cartesian product is bad. It’s only when they are created
unintentionally that Cartesian products cause problems with a database.
Cartesian products are often used to generate large amounts of data rapidly,
which is useful for generating sample data, for instance. Certain scientific and
mathematical tasks require the creation of sets that combine every element of
one set with another set, which is exactly what a Cartesian product is.
The Products table also has a primary key, the ProductID. But the connection
to the Categories table is the CategoryID, which is a foreign key in the Product
table.
By adding a WHERE clause, you make an intelligent join between the two
tables, as shown here where the CategoryID in the Products table is joined to
the CategoryID in the Categories table:
Figure 26 shows the results of this query. Each product is listed once, with its
appropriate category.
Figure 26. The result set from a join based on the CategoryID of both tables.
TIP: SQL Server performs best when indexed columns are used for joins. You can
write join queries using any column in a table that matches another column in
another table. You can even build matching expressions for joins. But the
more complex the joining criteria, the slower the join will execute. The long
integers used in identity field primary keys are the perfect join column.
Join Notation
As the SQL standard has evolved, so has the join notation. The earliest join
notation uses the WHERE clause to enforce the joining criteria, as shown in
the previous example.
The problem with this notation is that it overloads the WHERE clause; in
essence, the WHERE clause is pulling double duty by restricting the rows of
the result set based on search criteria and enforcing joins between different
tables. This causes a variety of problems:
SELECT dbo.Products.ProductName,
dbo.Categories.CategoryName
FROM dbo.Products JOIN dbo.Categories
ON dbo.Products.CategoryID = dbo.Categories.CategoryID;
The results of this query are identical to the previous query; only the notation
changed. The join notation puts joining criteria into the FROM clause,
independent of the WHERE clause. This makes the query easier to read, and
SQL Server can decipher and execute it more efficiently (SQL Server
translates WHERE-clause joins into FROM-clause joins).
Whenever possible, use the JOIN condition notation instead of the WHERE
notation, but at the same time, be aware that the older style is still used.
TIP: Another notation convention for joining is the use of table names with their
field names, separated by a dot. Table names are required only when a field is
referenced that exists in both tables, such as the CategoryID field. Using table
names for all field names not only looks better, but it makes the query self-
documenting and easier to read. Anyone who reads the query (including you a
day later) can quickly determine which table each field comes from without
having to look it up elsewhere. To reduce the amount of typing, short aliases
are often used for the table names:
Inner Joins
Inner joins are the original joins developed in the SQL language, and are
certainly the most common. The joins you’ve already seen in this chapter are
all inner joins. The principle behind the inner join is that the result set includes
only rows that have matching joining criteria.
The word INNER is always optional, because the inner join is the default. Here
is an example of adding the optional INNER keyword:
In this query all the data in both tables is included in the query, because all
products have categories assigned to them. However, it is important to
remember that data can be excluded from an inner join. CategoryID is allowed
to be NULL in the Northwind database, and the inner join wouldn’t return any
products with a NULL CategoryID.
SELECT dbo.Products.ProductName,
dbo.Categories.CategoryName,
dbo.Products.UnitPrice
FROM dbo.Products INNER JOIN dbo.Categories
ON dbo.Products.CategoryID = dbo.Categories.CategoryID
WHERE UnitPrice > 50
ORDER BY ProductName;
The results of this query, shown in Figure 27, show a list of all products, sorted
alphabetically, including the category and price, where the price is greater than
fifty dollars. The only role of the join in this query is to bring the category
name into the result set. The WHERE clause limited the products to those
above fifty dollars, and the ORDER BY clause sorted the products
alphabetically.
Figure 27. A simple join brings the product data together with the category name.
The first few rows of the result set are shown in Figure 28, displaying the order
number and date from the Orders table, as well as the company name from the
Customers table and employee name from the Employees table.
Figure 28. Joining multiple tables yields a list of orders that includes customer and
employee names.
A typical group join, which also demonstrates a sequential join, determines the
total sales per customer. Three tables are involved:
This query selects the CompanyName, joins the Customers table to Orders,
and then to Order Details in order to derive a total for each customer:
SELECT CompanyName,
SUM([Order Details].UnitPrice * [Order
Details].Quantity)
AS TotalSold
FROM dbo.Customers INNER JOIN dbo.Orders
ON Customers.CustomerID = Orders.CustomerID
INNER JOIN [Order Details]
ON Orders.OrderID = [Order Details].OrderID
WHERE Orders.OrderDate BETWEEN '9/1/1996' AND '9/10/1996'
GROUP BY CompanyName
ORDER BY TotalSold DESC;
Figure 29. A grouped join showing the total sales for each customer.
TIP: You may have noticed the square brackets around the name of the Order
Details table. Those brackets are required because the table name includes a
space character. Brackets are required if an object name includes a character
that isn’t otherwise allowed. Another example would be names that include
hyphens. In general, it is best to avoid using names that require square
brackets.
Outer Joins
There are three types of outer joins: left, right, and full. A left join includes all
rows of the first table, a right join includes all rows of the second table, and a
full join includes both tables. The challenge of using outer joins is knowing
when they are appropriate.
Write the following query using an inner join to generate a list of all customers
and the dates of their first order:
The Customers table is joined to the Orders table and grouped by company
name. The MIN aggregate function returns the lowest order date for each
customer. The first few rows of the result set shown in Figure 30 display each
customer and the date of their first order.
Figure 30. Using an inner join to return the first order of each customer.
But check the record counts—the list may not include every customer. The
behavior of the inner join will exclude any customers who have not placed any
orders. If you need to see a list of all customers regardless of whether they’ve
placed an order, as well as the date of the first order for customers who have,
you need to use an outer join.
To change the previous inner join query, replace the inner join with a left join:
Figure 31 shows one of the two Northwind customers that were excluded
before but show up in these query results. This customer and Paris spécialités
have no orders in the Northwind database.
Figure 31. The result set of the left join, showing null values.
This is a key purpose of outer joins: Making certain that a given result set
includes all rows.
This example also points out how you can use outer joins to find unmatched
values. To find all customers who don’t have any orders, you can use the
following outer join, which relies on the fact that the OrderID column will
contain a NULL value for those customers. Figure 32 shows the result set.
It is important to note that the only difference between a right outer join and a
left outer join is the order in which the tables are listed in the FROM clause.
The following query uses a right outer join to return customers without orders:
A Cartesian Product returns all the rows of the first table for each of the rows
of the second table. A full outer join returns all rows that meet the join criteria
(the inner join), as well as rows in the first table that did not meet the criteria
(the left join) and rows in the second table that did not meet the criteria (the
right join). The actual number of rows returned will be no more than the sum
of the number of rows in each table, not the product of those numbers. If all
the rows have matches in the two tables, the number of rows will be equal to
the number in the largest of the tables.
The following query will show you all products, including any without
categories, and all categories, including any without products:
If you want to see all of the customers, whether they have orders or not, and
the total sales for customers who do have orders, you could try using a left join
on the Customers-Orders join, and an inner join on the Orders-OrderDetails
join. On the surface, this seems to make sense because you want all of the
customers regardless of whether they have orders, and you know that every
Order Detail will have an OrderID that matches one in the Orders table:
SELECT CompanyName,
SUM([Order Details].UnitPrice * [Order
Details].Quantity)
AS TotalSold
FROM dbo.Customers LEFT JOIN dbo.Orders
ON Customers.CustomerID = Orders.CustomerID
INNER JOIN dbo.[Order Details]
ON Orders.OrderID = [Order Details].OrderID
GROUP BY CompanyName
ORDER BY CompanyName;
However, the results of the query don’t show the customers that you know
haven’t placed an order yet.
The problem is that the left join precedes the inner join in the query. The inner
join, which is executed second, excludes all the null rows generated by the left
join⎯you might as well have used two inner joins. You need to revise the
query so that the inner join comes first. Join the Orders to Order Details first
and then use a right join on Customers for the correct result:
SELECT CompanyName,
SUM([Order Details].UnitPrice *
[Order Details].Quantity) AS TotalSold
FROM dbo.Orders INNER JOIN dbo.[Order Details]
ON Orders.OrderID = [Order Details].OrderID
RIGHT JOIN dbo.Customers
ON Orders.CustomerID = Customers.CustomerID
GROUP BY CompanyName
ORDER BY CompanyName;
Scrolling through the results, you can see that the customers without orders are
now showing up as expected.
Even though you know that every Order has at least one matching Order
Detail, and that you shouldn’t ordinarily need an outer join between Orders
and Order Details, using the outer join here ensures that results from the first
join that have a NULL OrderID won’t be excluded.
TIP: When it comes to performance, inner joins win hands down. Use outer joins
sparingly. An inner join can exploit indexes to improve performance; an outer
join almost always requires a table scan, where SQL Server reads every row
in the table. However, if you use a WHERE clause that restricts the rows
returned in the outer join, SQL Server may elect to use an index on the outer
join table, speeding up data retrieval.
Self Joins
A self join is counter-intuitive: why join a table to itself? But self joins are the
best solution to certain kinds of problems.
An example of an appropriate use for a self join is to track the manager each
employee reports to. In the Employees table in the Northwind database, the
ReportsTo column contains the EmployeeID of the person to whom each
employee reports. Employees who don't report to anyone (presumably the top
people) have a null value in that field.
In a self join, both sides of the join are the same table, but each side is
referenced separately, using an alias to distinguish them. An alias allows you
to assign a custom name to one or both of the instances of the table, which
allows the query processor to treat them as separate tables that happen to
contain the same data. In this query, the first reference to the Employees table
uses the default table name. The second reference to the Employees table uses
the alias of Managers. The Employees table is joined to the virtual Managers
table on ReportsTo in Employees and EmployeeID in Managers:
The result set shown in Figure 33 displays only employees with managers.
Figure 33. A self join query using an inner join returns only rows with data.
To return all employees, use a left join between Employees and the alias table
Reports instead of an inner join:
The result, shown in Figure 34, lists all employees, including the big boss.
Figure 34. A self join query using a left join returns all rows.
TIP: Self joins are useful when the data you want to join exists in two different
columns in a single table. You can also use self joins to join each row in a
table to the row that comes before or after it in a particular order. For
example, if your rows all have consecutive dates, you could join the date
column to an expression that adds one day to that date. This would allow you
to use expressions that compare or perform calculations on the values in
adjoining rows.
Summary
• SQL is a comprehensive querying language for creating database
objects and for retrieving and modifying data.
• The basic SELECT statement includes a list of columns and a table to
get the columns from.
• You can restrict the rows that the query returns by using the WHERE
clause.
• You can order the rows that the query returns using the ORDER BY
clause.
• You can summarize data in tables by using aggregate functions.
• The GROUP BY clause creates sophisticated aggregate queries.
• Joining is a necessity with relational databases, because related data is
separated into tables (called normalization).
• A join with no joining criteria results in a Cartesian product, in which
the rows of the first table are combined with all the rows of the second
table.
• Primary key/foreign key joins are the most efficient way to join tables.
• There are several joining notations. The SQL92 standard, which uses
the FROM clause, is the preferred joining notation.
• Inner joins combine rows based on the joining criteria to the exclusion
of all other rows.
• Left outer joins combine rows based on the joining criteria and also
include all rows of the first table in the join.
• Right outer joins combine rows based on the joining criteria and also
include all rows of the second table in the join.
• Full outer joins combine rows based on the joining criteria and also
include all rows from both tables in the join.
• A self join is a special form of inner or outer join that involves only
one table, joined to another aliased instance of itself.
Questions
1. How can you retrieve the last name and first name of all employees in the
Employees table?
2. Which wildcard character do you use with the LIKE keyword to indicate
“all characters”?
3. How can you count the number of employees from each region in the
Employees table?
5. What sort of join would you use to return the rows from the Order Details
table for a specific order?
6. What kind of join would you use to return the names of managers where
the Manager ID is stored in the Employees table for each employee?
Answers
1. How can you retrieve the last name and first name of all customers in the
Employees table?
SELECT LastName, FirstName FROM dbo.Employees
2. Which wildcard character do you use with the LIKE keyword to indicate
“all characters”?
The percent sign, or '%'
3. How can you count the number of customers from each region in the
Employees table?
SELECT Region, COUNT(*)
FROM dbo.Employees
GROUP BY Region
5. What sort of join would you use to return the rows from the Order Details
table for a specific order?
An inner join between the Order table and Order Details table
6. What kind of join would you use to return the names of managers where
the manager ID is stored in the Employees table for each employee?
A self-join on the Employees table between the Manager ID and
the employee ID
Lab 4:
Data Selection
Queries
Lab 4 Overview
In this lab you’ll learn the fundamentals of building SELECT queries.
Objective
In this exercise, you’ll build a simple select query to retrieve a list of products,
the quantity per unit for each product, and the unit price.
Things to Consider
• Which table in the database contains the data you want to retrieve?
• Which fields are you going to retrieve in the query?
• Are there any expressions that you could use to improve the query?
• Which order should the query appear in?
Step-by-Step Instructions
1. Open a query window and type the following statements to use the
Northwind database and to see the column names in the Products table.
Note the names of the columns.
USE Northwind;
SELECT * FROM dbo.Products WHERE 1=0;
2. From the column name information gathered from the SELECT query,
you’re ready to create the query. Concatenate the product name and
quantity per unit, and select the unit price. Sort by the product name. The
following query will produce the results shown in Figure 35, which
displays the first few rows of the result set.
3. Execute the query by pressing the F5 key, clicking the Execute button, or
choosing Query|Execute from the menu.
Aggregate Query
Objective
In this exercise, you’ll build an aggregate query to find out how many units are
in stock, how many on order for products where the UnitPrice is greater than
30 and there is at least 1 unit on order. The query will then calculate and
display the total units by adding the units in stock to the units on order. You'll
then revise the query so that you only display rows where there are more than
40 total units.
Things to Consider
• Which fields should you use to create the query?
• What kind of aggregate function are you going to need?
• Which fields should you include in the GROUP BY clause?
• When do you use a HAVING clause?
Step-by-Step Instructions
1. To determine the total units, you’ll need to sum the units in stock with the
units on order. Since City is not part of the aggregate function, it has to be
included in the GROUP BY. For clarity, order the query by the COUNT
function in descending order.
Figure 37. Total units for products with a unit price greater than 30 and at least 1
order.
3. To only display rows where the TotalUnits are greater than 40, add the
bolded HAVING clause to the query. Execute the query. The results
should look like those shown in Figure 38.
Figure 38. Displaying only rows where the total units are more than 40.
Objective
In this exercise, you’ll create a query that lists all the products and the category
that each product is in. The data for this query is stored in two tables: Products
and Categories. You’ll insert a new category named All Diet into the
Categories table. You’ll first create a query using an inner join to display all of
the products and categories where each category has at least one product.
You’ll then create a query that displays all categories and products.
Things to Consider
• What are the names of the tables and fields for the query?
• Which field in each table will you use for the join?
• What type of join should you use for the query?
Step-by-Step Instructions
1. Use a SELECT statement to determine the column names you need to use
to insert a row in the Categories table.
2. Execute these statements to insert a new row for the All Diet category into
the Categories table.
3. The CategoryID field is common between both the Product and Category
tables, and it is a primary key/foreign key, making it the ideal joining field.
4. Make sure you specify both the table names and field names every time a
field is referenced in the query:
5. Execute the query and view the results as shown in Figure 39, which
displays the first few rows of the result set.
6. Next, display all of the categories regardless of whether they have any
products associated with them. Use the LEFT JOIN syntax shown here.
Objective
In this exercise, you’ll modify the query from the previous exercise to turn it
into an aggregate query with multiple inner joins. Instead of listing only the
product name and category, you’ll also include the total sales for that product.
Things to Consider
• Which tables are needed to retrieve the data for the query? Which
fields?
• Which fields should you use to join the fields together?
• What type of join should you use?
• Which fields should be aggregate? What aggregate function should
you use?
• Which fields should you include in the GROUP BY clause?
Step-by-Step Instructions
1. The Order Details table is required in order to aggregate the data.
2. The ProductID field is the common field between the Products table and
the Order Details table. CategoryID remains the common field between
the Products table and the Categories table.
3. To get total sales, the Quantity must be multiplied with the UnitPrice in the
Order Details table. To aggregate the results together, you must use the
SUM aggregate function.
5. Execute the query and view the results as shown in Figure 41.
Objective
In this exercise, you’ll modify the query from the previous exercise to use an
outer join. The goal is to make certain that all products are included in the
query.
Things to Consider
• Which join should you change to an outer join?
• What kind of outer join should you use?
• In what order should the joins occur?
Step-by-Step Instructions
1. There are only two joins in the query, the inner join between Categories
and Products, and the inner join between Products and Order Details.
2. Because the first exercise included all rows, it is possible that the inclusion
of the Products table excluded some Categories that did not have
matching products, and that joining the Order Detail table as an inner join
excluded some products.
3. The simplest solution is to change all of the inner joins to left joins. The
modified query is as follows:
4. Execute the query and view the results, shown in Figure 42.