0% found this document useful (0 votes)
23 views

SQL Notes

This document discusses string functions and ordering in SQL. It provides examples of common string functions like LENGTH, UPPER, LOWER, TRIM, SUBSTRING, and CONCAT. It explains that ordering string columns depends on the collation, which defines character precedence and sorting options like case sensitivity.

Uploaded by

Carl
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

SQL Notes

This document discusses string functions and ordering in SQL. It provides examples of common string functions like LENGTH, UPPER, LOWER, TRIM, SUBSTRING, and CONCAT. It explains that ordering string columns depends on the collation, which defines character precedence and sorting options like case sensitivity.

Uploaded by

Carl
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Order

SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT

-Can’t use column alias name (AS) from SELECT clause in WHERE Clause in Impala. Need to use
o.g name in WHERE clause

Common String Functions


There are many string functions available for use in SQL statements. These are useful for working
with text or character string data types. The following list is not exhaustive, but it does present some
of the more common ones you might want to use.

Unless otherwise noted, the function takes a string argument and returns a string.

Note the bottom square box character (⎵) is sometimes used to represent whitespace, which could be, for
example, a space or a tab character. For some of these, it would be difficult to see the effect of the function if
regular spaces are used here.

length(str)

This returns an integer value equal to the number of characters in the string argument str.

Notes:

● The name of this function is different, depending on the SQL engine you're using. For
example, some engines use len(str) or char_len(str). (The other functions described
below have the same name across all the major SQL engines.)
● For Apache Hive and Apache Impala, use the length function; it works as described
here.
● Some SQL engines have functions that are similar to length as described above, but
that return the number of bytes or other units of information that are required to store a
character string. If you're using some other SQL engine besides Hive or Impala, check
the documentation to be sure you understand what the length function returns and to
see what other similar functions are available.

Examples:
length('Common String Functions') = 23

length(' Common String Functions ') = 25

length('') = 0

reverse(str)

This returns the characters within the string argument str, but in the reverse order. Try it with your
favorite palindrome!

Examples:

reverse('Common String Functions') = 'snoitcnuF gnirtS nommoC'

reverse('never odd or even') = 'neve ro ddo reven'

upper(str), lower(str)

These return the string str but with all characters converted either to uppercase or lowercase. These
can be useful for doing case-insensitive string comparisons (by converting the string to be compared
to one case, for example, WHERE upper(fname) = 'BOB' or WHERE lower(fname) = 'bob').

Examples:

upper('Common String Functions') = 'COMMON STRING FUNCTIONS'

lower('Common String Functions') = 'common string functions'

trim(str), ltrim(str), rtrim(str)

These remove whitespace at the ends of the argument str. You can choose to remove only leading
whitespace (ltrim for left trim), trailing whitespace (rtrim for right trim), or both (trim). If there is no
whitespace on the specified end, the string is unchanged.

Examples:
trim('⎵Common String Functions⎵ ⎵ ⎵') = 'Common String Functions'

ltrim('⎵Common String Functions⎵ ⎵ ⎵') = 'Common String Functions⎵ ⎵ ⎵'

rtrim('⎵Common String Functions⎵ ⎵ ⎵') = '⎵Common String Functions'

ltrim('Common String Functions⎵ ⎵ ⎵') = 'Common String Functions⎵ ⎵ ⎵'

rtrim('⎵Common String Functions') = '⎵Common String Functions'

lpad(str, n, padstr), rpad(str, n, padstr)

These functions take a string str and an integer n and return a string of length n. If the original string
str is shorter than n characters, the returned string will be str with characters from padstr added at
the left (lpad) or the right (rpad) to make it length n. (This is called padding the string, and is the
opposite of trimming.) These functions are often used to add zeros to the left or right of numbers that
are represented in strings (this is called zero-padding). If necessary, the pad string will be repeated.
If the length of str is longer, however, the function will return a truncated version of the string.
Truncated characters will be taken from the right, regardless of which function you specify.

Examples:

lpad('.50', 4, '0') = '0.50'

rpad('0.5', 4, '0') = '0.50'

rpad('Common', 13, ' String') = 'Common String'

rpad('Common', 17, ' String') = 'Common String Str'

lpad('Common', 17, ' String') = ' String StrCommon'

rpad('Common String', 6, ' Function') = 'Common'

lpad('Common String', 6, ' Function') = 'Common'

substring(str, index, max_length)


This function takes a string and two integers, and returns a portion of the original string. The
argument index indicates where to start the substring (indexing the original string str starting at 1)
and max_length is how many characters to include (though it might be fewer, if the end of the
original string is reached). With many SQL engines, you can also use substr which is an alias for
substring.

Examples:

substring('Common String',1,6) = 'Common'

substring('Common String',8,3) = 'Str'

substring('Common String',8,6) = 'String'

substring('Common String',8,10) = 'String'

concat(str1, str2[, str3, …]), concat_ws(sep, str1, str2[, str3, …])

These functions concatenate strings—that is, they put them together into a single string. The ws in
concat_ws stands for “with separator,” the first argument in that case is placed between each pair of
strings. In both cases, the arguments are concatenated in the order given.

Notes:

● Both concat and concat_ws must include at least two strings to concatenate. They can
take more than two, as well.
● Some SQL engines have an operator for string concatenation, usually + or ||. However,
Hive and Impala do not have concatenation operators; one of these functions must be
used.

Examples:

concat('Common','String') = 'CommonString'

concat_ws(' ','Common','String') = 'Common String'

concat('Common','String','Functions') = 'CommonStringFunctions'

concat_ws(' ','Common','String','Functions') = 'Common String Functions'


concat_ws(', ','Common','String','Functions') = 'Common, String, Functions'

Non-ASCII characters

Note that the string functions in different SQL engines can differ in their handling of non-ASCII
characters. For example: In most SQL engines, upper('é') returns É, but in others it might return é or
throw an error. You should test or consult the documentation to see how this works.

Other String Functions

Many more string functions are available in most SQL engines. For example, there are functions for
splitting strings, extracting parts of strings, and finding and replacing specific characters or
substrings within strings. If you are interested in them, check the documentation of the SQL engine
you are using (probably under “String Functions”).

List of Aggregate Functions

COUNT, MIN, MAX, SUM, AVG

Helpful Links
Hive.apache.org go to Language Manual then click Select
Impala.apache.org
W3schools.com

Ordering by String Columns


You can control the sort order of SQL query results using the ORDER BY clause. When sorting on a
numeric column, the resulting order typically makes intuitive sense, but when sorting on a string
column, you might be surprised by the resulting order. This is especially true when the strings
include numbers, or a mix of numbers and letters or other characters within a value.

Unfortunately, there isn't a simple explanation to tell you how SQL will sort your results, because it
depends on what collation you are using.

A DBMS uses a collating sequence, or collation, to determine the order in which characters are sorted.
The collation defines the order of precedence for every character in your character set. Your character
set depends on the language that you’re using—European languages (a Latin character set), Hebrew
(the Hebrew alphabet), or Chinese (ideographs), for example. The collation also determines case
sensitivity (is ‘A’ < ‘a’?), accent sensitivity (is ‘A’ < ‘À’ ?), width sensitivity (for multibyte or Unicode
characters), and other factors such as linguistic practices. The SQL standard doesn’t define particular
collations and character sets, so each DBMS uses its own sorting strategy and default collation…
Search your DBMS documentation for collation or sort order. (1)

Collations have different options associated with them, and many can be customized depending on
the system you are using. For English, case sensitivity is a major one to consider—should "A" and
"a" be considered the same character for the purposes of ordering? Others include accent sensitivity
(for example, should "a" and "á" be considered the same), Kana sensitivity (which distinguishes
between the two types of Japanese characters), and script order (for example, which should be
ordered first: Hebrew, Greek, or Cyrillic). See "Customization" (2) and "Collation" (3) for more
examples of these and other options.

When using Unicode—an industry standard that assigns a number to each character or symbol—
SQL will most likely follow the Unicode ordering to distinguish the order of two characters, while
taking customizations into account. Non-Unicode data may have a different order:

When you use a SQL collation you might see different results for comparisons of the same characters,
depending on the underlying data type. For example, if you are using the SQL collation
"SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less than the string 'ab' because the
hyphen ("-") is sorted as a separate character that comes before "b". However, if you convert these
strings to Unicode and you perform the same comparison, the Unicode string N'a-c' is considered to be
greater than N'ab' because the Unicode sorting rules use a "word sort" that ignores the hyphen. (4)

When it comes to numbers represented within strings, you must remember than string sorting is
done on a character-by-character basis. For example:

'4 This compares only the first characters: '4'<'7'. The order is now established and any other
2' remaining characters can be ignored.
<
'7
1'

'4 The first characters are the same, '4' = '4', so the sort then compares the next characters,
2' '2'<'5'. So '42' < '45'.
<
'4
5'

'4 Although numerically 42 > 7, the sort compares the first characters, '4' and '7'. Since '4' < '7',
2' the order is established and any other remaining characters are ignored. For this string sort,
< '42' < '7'.
'7
'

You can sometimes find ways to customize the sort, when necessary. For example, "Use SQL
Server to Sort Alphanumeric Values" (5) provides a method, usable with Microsoft SQL Server, to
sort values with a mixture of letters and numerals that would consider '7' < '42'.

Spaces, especially leading spaces, often cause confusion as well. The space character is typically
considered to come before any number or letter, and some punctuation as well. Again, sort order is
done character by character. For example:

'no The first characters are equivalent, 'n' = 'n', so the sort would move to the second
one' < characters. These are also equivalent, 'o' = 'o', so the sort moves to the third characters.
'nobo These are ' ' and 'b', and ' ' < 'b', so 'no one' < 'nobody'.
dy'

' start' Notice that the first character in the string on the left is a space. While 'begin' < 'start'
< because 'b' < 's', these string sort as ' start' < 'begin' because ' ' < 'b'.
'begin
'

For more detail on these points, see the referenced articles.

(1) Fehily, Chris. SQL VIsual QuickStart Guide, 3rd Edition.Retrieved from
http://www.peachpit.com/articles/article.aspx?p=1276352&seqNum=4 on May 25, 2018.

(2) Unicode® Technical Standard #10: Unicode Collation Algorithm. Retrieved from
http://unicode.org/reports/tr10/#Customization on May 25, 2018.
(3) Collation and Unicode Support. Retrieved from https://docs.microsoft.com/en-us/sql/relational-
databases/collations/collation-and-unicode-support?view=sql-server-2017#Collation_Defn on May
25, 2018.

(4) Comparing SQL collations to Windows collations. Retrieved from


https://support.microsoft.com/en-us/help/322112/comparing-sql-collations-to-windows-collations on
May 25, 2018.

(5) Use SQL Server to Sort Alphanumeric Values. Retrieved from https://www.essentialsql.com/use-
sql-server-to-sort-alphanumeric-values/ on May 25, 2018.

Alternative Join Syntax


This reading describes alternative ways of expressing joins in SQL. We do not recommend using the
techniques described in this reading, but you should familiarize yourself with them so you can read
and understand SQL queries that use them.

SQL-92-Style Joins and SQL-89-Style Joins


In the video lectures describing joins in SQL, the following join syntax is used:

SELECT ...

FROM toys JOIN makers

ON toys.maker_id = makers.id;

Notice the JOIN keyword between the table names, and the ON keyword followed by the join
condition. This is called a SQL-92-style join, or explicit join syntax, and it is usually considered to be
the best syntax to use for joins in SQL.
However, many SQL engines also support another join syntax, called the SQL-89-style join, or
implicit join syntax. In this syntax, you use a comma-separated list of table names in the FROM
clause, and you specify the join condition in the WHERE clause:

SELECT ...

FROM toys, makers

WHERE toys.maker_id = makers.id;

With most SQL engines, this join query returns exactly the same result as the previous one.

With both join styles, you can use table aliases (t and m in this example):

SELECT ...

FROM toys AS t JOIN makers AS m

ON t.maker_id = m.id;

SELECT ...

FROM toys AS t, makers AS m

WHERE t.maker_id = m.id;

With both styles, the AS keyword before each table alias is optional.

When you use a SQL-89-style join, the SQL engine always performs an inner join. With this syntax,
there is no way to specify any other type of join. If you want to use one of the other types of joins (left
outer, right outer, full outer), then you must use a SQL-92-style join. Because of this limitation, and
because the SQL-89-style join syntax makes it harder to understand the intent of the query, we
recommend using SQL-92-style joins.

Unqualified Column References in Join Condition


In the join condition that comes after the ON keyword in a join query, the references to the
corresponding columns are typically qualified with table names or table aliases. For example, when
joining the toys table (alias t) and makers table (alias m), the join condition is specified as:

ON t.maker_id = m.id

However, in the case where a bare column name unambiguously identifies a column, most SQL
engines allow you to use a bare column name. For example, since there is no column named
maker_id in the makers table, the table alias t is not required in this join condition. So you could
specify the join condition as:

ON maker_id = m.id

But because there are columns named id in both tables, the table alias m is required in this join
condition. If you omit the table alias m, then the SQL engine will throw an error indicating that the
column reference id is ambiguous.

In join conditions, we recommend always qualifying column names with table names or table aliases,
whether or not they are strictly required. Doing this makes your queries safer and clearer.

The USING Keyword


In some join queries, the names of the two corresponding columns in the join condition are identical.
For example, in this query, the corresponding columns in the employees and offices table are both
named office_id:

SELECT …

FROM employees e JOIN offices o

ON e.office_id = o.office_id;

When the corresponding columns in the join condition have identical names, some SQL engines
allow you to use a shorthand notation to specify the join condition. Instead of using the ON keyword
and specifying the condition as an equality expression, you use the USING keyword and specify the
common join key column name in parentheses after USING:
SELECT …

FROM employees e JOIN offices o

USING (office_id);

Natural Joins
When the corresponding columns in the join condition have identical names, some SQL engines will
allow you to omit the join condition, and will automatically join the tables on all the pairs of columns
that have identical names in the left and right tables. To make a SQL engine do this, you need to
specify the keyword NATURAL before the other join keywords. For example:

SELECT …

FROM employees e NATURAL JOIN offices o;

MySQL and PostgreSQL support natural joins, but Hive and Impala do not. In the SQL engines that
support it, you can use the keyword NATURAL with any type of join; for example: NATURAL LEFT
OUTER JOIN or NATURAL INNER JOIN.

Omitting Join Conditions


What happens if you attempt to perform a join without specifying the join condition, and you do not
specify NATURAL before the join keywords?

For example, you might run a query like this:

SELECT *

FROM toys JOIN makers;

Notice that no join condition is specified. With some SQL engines (including PostgreSQL), this
throws an error. But with other SQL engines (including Impala, Hive, and MySQL) this performs
what’s called a cross join. In a cross join, the SQL engine iterates through each row in the table on
the left side and combines it with every row in the table on the right side. So the result set includes
every possible combination of the rows in the left table and the rows in the right table. The number of
rows in the result set is the product (multiplication) of the number of rows in the left table and the
number of rows in the right table (in this example, 3 x 3 = 9):

i name pric maker_i id name city


d e d

2 Lite-Brite 14.4 105 10 Hasbro Pawtucket,


1 7 5 RI

2 Lite-Brite 14.4 105 10 Ohio Art Bryan, OH


1 7 6 Company

2 Lite-Brite 14.4 105 10 Mattel Segundo,


1 7 7 CA

2 Mr. Potato 11.5 105 10 Hasbro Pawtucket,


2 Head 0 5 RI

2 Mr. Potato 11.5 105 10 Ohio Art Bryan, OH


2 Head 0 6 Company

2 Mr. Potato 11.5 105 10 Mattel Segundo,


2 Head 0 7 CA

2 Etch A 29.9 106 10 Hasbro Pawtucket,


3 Sketch 9 5 RI

2 Etch A 29.9 106 10 Ohio Art Bryan, OH


3 Sketch 9 6 Company

2 Etch A 29.9 106 10 Mattel Segundo,


3 Sketch 9 7 CA

In most cases, the result of a cross join is meaningless. The rows of the result contain values with no
correspondence. If you don’t realize that you have performed a cross join, you might be misled by
the results. In addition, when performed on large tables, a cross join can return a dangerously large
number of rows.
There are some specific cases when cross joins are useful, and in most SQL dialects, you can
explicitly specify CROSS JOIN in your SQL statement to make it clear that you are performing a
cross join. This is discussed in a video in the upcoming honors lesson.

So unless you intend to perform a cross join, and you understand the risks of this and how to
interpret the output, we recommend specifying the join condition in every join query.

THEN IN CASE STATEMENT


SELECT
c.name AS country,

COUNT(CASE WHEN m.season = '2012/2013' THEN m.id END) AS matches_2012_2013,


COUNT(CASE WHEN m.season = '2013/2014' THEN m.id END) AS matches_2013_2014,
COUNT(CASE WHEN m.season = '2014/2015' THEN m.id END) AS matches_2014_2015

FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id

GROUP BY country;

Taken from a random data sample you can add a THEN clause in the CASE statement. This is
done with an aggregate to count (in this case) the results of the criteria set by the CASE
statement.

What is highlighted in red refers to a column, but can be a number (would use SUM instead of
COUNT) such as 0, 1, 2, 3 that if TRUE will be the values added in the aggregate function. Can
also be any string/text as SQL (in this case) is counting the number of rows returned by the
case statement.

Subquery
A query found inside another query, there will be multiple SELECT statements in the same
query

Helpful for intermediary transformations with data

Can be placed in any part of subquery (SELECT, FROM, WHERE, GROUP BY)

Will return numbers (scalar), list, table


Subqueries in WHERE and SELECT can only return a single column (That is only one column
can follow the subquery SELECT). Can then use main FROM statement for more complex
queries.

Can use multiple subqueries in the FROM statement (make sure you use an alias)

If subquery is long and complex can use common table expression (CTE) formatted like:

WITH subquery_nickname AS ( Entire subquery put here

Then main the SELECT statement is used after. Pretty much like declarations in math (let x =
….)

Only use one WITH statement with multiple CTEs

Window Function
Allows us to use aggregate functions without the GROUP BY statement for non aggregate
columns.

In order to use must use the OVER() clause after the aggregate function

The catch is that if you want to order the results in a certain way you must use a stupid rank
syntax of:

RANK () OVER (ORDER BY AVG(m.home_goal + m.away_goal)) AS


league_rank

… -- Where eventually add ORDER BY with the alias you created


above ^
ORDER BY league rank;
Partition allows us to take an aggregate function and then filter it based off a particular
column(s) that we want. All that is needed is to specify which column(s) after the PARTITION
BY clause

AVG(home_goal) OVER(PARTITION BY season) AS season_homeavg,

Sliding Windows allow us to make running calculations on the aggregated data from the OVER
clause. Using the syntax as an example

SUM(home_goal) OVER(ORDER BY date


ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING) AS
running_total,

In this example the home_goals are added from all the previous rows to the current one. There
are five keywords that can be used for the sliding windows

PRECEDING: Included with a number, so for example ROWS BETWEEN 4 PRECEDING AND
CURRENT ROW will include the previous 4 rows up to the current row

FOLLOWING: Included with a number, opposite as PRECEDING

UNBOUNDED PRECEDING: All rows prior to the current row are to be included

UNBOUNDED FOLLOWING: All rows after the current row are to be included

CURRENT ROW: the current row in the table

You might also like