SQL Notes
SQL Notes
-Can’t use column alias name (AS) from SELECT clause in WHERE Clause in Impala. Need to use
o.g name in WHERE clause
Unless otherwise noted, the function takes a string argument and returns a string.
Note the bottom square box character (⎵) is sometimes used to represent whitespace, which could be, for
example, a space or a tab character. For some of these, it would be difficult to see the effect of the function if
regular spaces are used here.
length(str)
This returns an integer value equal to the number of characters in the string argument str.
Notes:
● The name of this function is different, depending on the SQL engine you're using. For
example, some engines use len(str) or char_len(str). (The other functions described
below have the same name across all the major SQL engines.)
● For Apache Hive and Apache Impala, use the length function; it works as described
here.
● Some SQL engines have functions that are similar to length as described above, but
that return the number of bytes or other units of information that are required to store a
character string. If you're using some other SQL engine besides Hive or Impala, check
the documentation to be sure you understand what the length function returns and to
see what other similar functions are available.
Examples:
length('Common String Functions') = 23
length('') = 0
reverse(str)
This returns the characters within the string argument str, but in the reverse order. Try it with your
favorite palindrome!
Examples:
upper(str), lower(str)
These return the string str but with all characters converted either to uppercase or lowercase. These
can be useful for doing case-insensitive string comparisons (by converting the string to be compared
to one case, for example, WHERE upper(fname) = 'BOB' or WHERE lower(fname) = 'bob').
Examples:
These remove whitespace at the ends of the argument str. You can choose to remove only leading
whitespace (ltrim for left trim), trailing whitespace (rtrim for right trim), or both (trim). If there is no
whitespace on the specified end, the string is unchanged.
Examples:
trim('⎵Common String Functions⎵ ⎵ ⎵') = 'Common String Functions'
These functions take a string str and an integer n and return a string of length n. If the original string
str is shorter than n characters, the returned string will be str with characters from padstr added at
the left (lpad) or the right (rpad) to make it length n. (This is called padding the string, and is the
opposite of trimming.) These functions are often used to add zeros to the left or right of numbers that
are represented in strings (this is called zero-padding). If necessary, the pad string will be repeated.
If the length of str is longer, however, the function will return a truncated version of the string.
Truncated characters will be taken from the right, regardless of which function you specify.
Examples:
Examples:
These functions concatenate strings—that is, they put them together into a single string. The ws in
concat_ws stands for “with separator,” the first argument in that case is placed between each pair of
strings. In both cases, the arguments are concatenated in the order given.
Notes:
● Both concat and concat_ws must include at least two strings to concatenate. They can
take more than two, as well.
● Some SQL engines have an operator for string concatenation, usually + or ||. However,
Hive and Impala do not have concatenation operators; one of these functions must be
used.
Examples:
concat('Common','String') = 'CommonString'
concat('Common','String','Functions') = 'CommonStringFunctions'
Non-ASCII characters
Note that the string functions in different SQL engines can differ in their handling of non-ASCII
characters. For example: In most SQL engines, upper('é') returns É, but in others it might return é or
throw an error. You should test or consult the documentation to see how this works.
Many more string functions are available in most SQL engines. For example, there are functions for
splitting strings, extracting parts of strings, and finding and replacing specific characters or
substrings within strings. If you are interested in them, check the documentation of the SQL engine
you are using (probably under “String Functions”).
Helpful Links
Hive.apache.org go to Language Manual then click Select
Impala.apache.org
W3schools.com
Unfortunately, there isn't a simple explanation to tell you how SQL will sort your results, because it
depends on what collation you are using.
A DBMS uses a collating sequence, or collation, to determine the order in which characters are sorted.
The collation defines the order of precedence for every character in your character set. Your character
set depends on the language that you’re using—European languages (a Latin character set), Hebrew
(the Hebrew alphabet), or Chinese (ideographs), for example. The collation also determines case
sensitivity (is ‘A’ < ‘a’?), accent sensitivity (is ‘A’ < ‘À’ ?), width sensitivity (for multibyte or Unicode
characters), and other factors such as linguistic practices. The SQL standard doesn’t define particular
collations and character sets, so each DBMS uses its own sorting strategy and default collation…
Search your DBMS documentation for collation or sort order. (1)
Collations have different options associated with them, and many can be customized depending on
the system you are using. For English, case sensitivity is a major one to consider—should "A" and
"a" be considered the same character for the purposes of ordering? Others include accent sensitivity
(for example, should "a" and "á" be considered the same), Kana sensitivity (which distinguishes
between the two types of Japanese characters), and script order (for example, which should be
ordered first: Hebrew, Greek, or Cyrillic). See "Customization" (2) and "Collation" (3) for more
examples of these and other options.
When using Unicode—an industry standard that assigns a number to each character or symbol—
SQL will most likely follow the Unicode ordering to distinguish the order of two characters, while
taking customizations into account. Non-Unicode data may have a different order:
When you use a SQL collation you might see different results for comparisons of the same characters,
depending on the underlying data type. For example, if you are using the SQL collation
"SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less than the string 'ab' because the
hyphen ("-") is sorted as a separate character that comes before "b". However, if you convert these
strings to Unicode and you perform the same comparison, the Unicode string N'a-c' is considered to be
greater than N'ab' because the Unicode sorting rules use a "word sort" that ignores the hyphen. (4)
When it comes to numbers represented within strings, you must remember than string sorting is
done on a character-by-character basis. For example:
'4 This compares only the first characters: '4'<'7'. The order is now established and any other
2' remaining characters can be ignored.
<
'7
1'
'4 The first characters are the same, '4' = '4', so the sort then compares the next characters,
2' '2'<'5'. So '42' < '45'.
<
'4
5'
'4 Although numerically 42 > 7, the sort compares the first characters, '4' and '7'. Since '4' < '7',
2' the order is established and any other remaining characters are ignored. For this string sort,
< '42' < '7'.
'7
'
You can sometimes find ways to customize the sort, when necessary. For example, "Use SQL
Server to Sort Alphanumeric Values" (5) provides a method, usable with Microsoft SQL Server, to
sort values with a mixture of letters and numerals that would consider '7' < '42'.
Spaces, especially leading spaces, often cause confusion as well. The space character is typically
considered to come before any number or letter, and some punctuation as well. Again, sort order is
done character by character. For example:
'no The first characters are equivalent, 'n' = 'n', so the sort would move to the second
one' < characters. These are also equivalent, 'o' = 'o', so the sort moves to the third characters.
'nobo These are ' ' and 'b', and ' ' < 'b', so 'no one' < 'nobody'.
dy'
' start' Notice that the first character in the string on the left is a space. While 'begin' < 'start'
< because 'b' < 's', these string sort as ' start' < 'begin' because ' ' < 'b'.
'begin
'
(1) Fehily, Chris. SQL VIsual QuickStart Guide, 3rd Edition.Retrieved from
http://www.peachpit.com/articles/article.aspx?p=1276352&seqNum=4 on May 25, 2018.
(2) Unicode® Technical Standard #10: Unicode Collation Algorithm. Retrieved from
http://unicode.org/reports/tr10/#Customization on May 25, 2018.
(3) Collation and Unicode Support. Retrieved from https://docs.microsoft.com/en-us/sql/relational-
databases/collations/collation-and-unicode-support?view=sql-server-2017#Collation_Defn on May
25, 2018.
(5) Use SQL Server to Sort Alphanumeric Values. Retrieved from https://www.essentialsql.com/use-
sql-server-to-sort-alphanumeric-values/ on May 25, 2018.
SELECT ...
ON toys.maker_id = makers.id;
Notice the JOIN keyword between the table names, and the ON keyword followed by the join
condition. This is called a SQL-92-style join, or explicit join syntax, and it is usually considered to be
the best syntax to use for joins in SQL.
However, many SQL engines also support another join syntax, called the SQL-89-style join, or
implicit join syntax. In this syntax, you use a comma-separated list of table names in the FROM
clause, and you specify the join condition in the WHERE clause:
SELECT ...
With most SQL engines, this join query returns exactly the same result as the previous one.
With both join styles, you can use table aliases (t and m in this example):
SELECT ...
ON t.maker_id = m.id;
SELECT ...
With both styles, the AS keyword before each table alias is optional.
When you use a SQL-89-style join, the SQL engine always performs an inner join. With this syntax,
there is no way to specify any other type of join. If you want to use one of the other types of joins (left
outer, right outer, full outer), then you must use a SQL-92-style join. Because of this limitation, and
because the SQL-89-style join syntax makes it harder to understand the intent of the query, we
recommend using SQL-92-style joins.
ON t.maker_id = m.id
However, in the case where a bare column name unambiguously identifies a column, most SQL
engines allow you to use a bare column name. For example, since there is no column named
maker_id in the makers table, the table alias t is not required in this join condition. So you could
specify the join condition as:
ON maker_id = m.id
But because there are columns named id in both tables, the table alias m is required in this join
condition. If you omit the table alias m, then the SQL engine will throw an error indicating that the
column reference id is ambiguous.
In join conditions, we recommend always qualifying column names with table names or table aliases,
whether or not they are strictly required. Doing this makes your queries safer and clearer.
SELECT …
ON e.office_id = o.office_id;
When the corresponding columns in the join condition have identical names, some SQL engines
allow you to use a shorthand notation to specify the join condition. Instead of using the ON keyword
and specifying the condition as an equality expression, you use the USING keyword and specify the
common join key column name in parentheses after USING:
SELECT …
USING (office_id);
Natural Joins
When the corresponding columns in the join condition have identical names, some SQL engines will
allow you to omit the join condition, and will automatically join the tables on all the pairs of columns
that have identical names in the left and right tables. To make a SQL engine do this, you need to
specify the keyword NATURAL before the other join keywords. For example:
SELECT …
MySQL and PostgreSQL support natural joins, but Hive and Impala do not. In the SQL engines that
support it, you can use the keyword NATURAL with any type of join; for example: NATURAL LEFT
OUTER JOIN or NATURAL INNER JOIN.
SELECT *
Notice that no join condition is specified. With some SQL engines (including PostgreSQL), this
throws an error. But with other SQL engines (including Impala, Hive, and MySQL) this performs
what’s called a cross join. In a cross join, the SQL engine iterates through each row in the table on
the left side and combines it with every row in the table on the right side. So the result set includes
every possible combination of the rows in the left table and the rows in the right table. The number of
rows in the result set is the product (multiplication) of the number of rows in the left table and the
number of rows in the right table (in this example, 3 x 3 = 9):
In most cases, the result of a cross join is meaningless. The rows of the result contain values with no
correspondence. If you don’t realize that you have performed a cross join, you might be misled by
the results. In addition, when performed on large tables, a cross join can return a dangerously large
number of rows.
There are some specific cases when cross joins are useful, and in most SQL dialects, you can
explicitly specify CROSS JOIN in your SQL statement to make it clear that you are performing a
cross join. This is discussed in a video in the upcoming honors lesson.
So unless you intend to perform a cross join, and you understand the risks of this and how to
interpret the output, we recommend specifying the join condition in every join query.
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
GROUP BY country;
Taken from a random data sample you can add a THEN clause in the CASE statement. This is
done with an aggregate to count (in this case) the results of the criteria set by the CASE
statement.
What is highlighted in red refers to a column, but can be a number (would use SUM instead of
COUNT) such as 0, 1, 2, 3 that if TRUE will be the values added in the aggregate function. Can
also be any string/text as SQL (in this case) is counting the number of rows returned by the
case statement.
Subquery
A query found inside another query, there will be multiple SELECT statements in the same
query
Can be placed in any part of subquery (SELECT, FROM, WHERE, GROUP BY)
Can use multiple subqueries in the FROM statement (make sure you use an alias)
If subquery is long and complex can use common table expression (CTE) formatted like:
Then main the SELECT statement is used after. Pretty much like declarations in math (let x =
….)
Window Function
Allows us to use aggregate functions without the GROUP BY statement for non aggregate
columns.
In order to use must use the OVER() clause after the aggregate function
The catch is that if you want to order the results in a certain way you must use a stupid rank
syntax of:
Sliding Windows allow us to make running calculations on the aggregated data from the OVER
clause. Using the syntax as an example
In this example the home_goals are added from all the previous rows to the current one. There
are five keywords that can be used for the sliding windows
PRECEDING: Included with a number, so for example ROWS BETWEEN 4 PRECEDING AND
CURRENT ROW will include the previous 4 rows up to the current row
UNBOUNDED PRECEDING: All rows prior to the current row are to be included
UNBOUNDED FOLLOWING: All rows after the current row are to be included