0% found this document useful (0 votes)
4 views96 pages

SQL Notes

The document provides a comprehensive overview of SQL programming, covering data manipulation, definition, and control languages, as well as database design principles like primary and foreign keys. It explains various SQL commands, including SELECT, JOINs, and aggregate functions, while emphasizing best coding practices and the importance of data integrity. Additionally, it highlights the advantages of using relational databases for efficient data access and analysis.

Uploaded by

Joe Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views96 pages

SQL Notes

The document provides a comprehensive overview of SQL programming, covering data manipulation, definition, and control languages, as well as database design principles like primary and foreign keys. It explains various SQL commands, including SELECT, JOINs, and aggregate functions, while emphasizing best coding practices and the importance of data integrity. Additionally, it highlights the advantages of using relational databases for efficient data access and analysis.

Uploaded by

Joe Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

IDEA - Automatic text generation from video.

Any jobs which require to store large amounts of data. Therefore before carrying out any
analysis we need to retrieve it.
Extracting information from database.

SQL programming language to work with DB . create, manipulate and share.

Record row
Field column

Sql -> Create and manipulate relational database.

Types of programming language


Procedural (imperative)
Object- oriented
Declarative (SQL ) - Non procedural - the focus is on What and not HOW
Functional

Difference between procedural and non procedural eg

If C or java If SQL
Step 1 open the door Fetch the bucket please
Step 2 walk outside Feel le
Take the bucket khali what should be done
Come inside
Feel le
How something should be done

Main Def of sql


● Data definition language (DDl)
● DML
● DCL
● TCL

1. DDL - is a language bt think of it as a syntax . It is a set of statements that allow the


user to define or modify data structures and objects, such as tables.
● Eg Create statement
○ Create object_type object_name;
○ Create object_type (Column Name);
○ Create USER ‘alay’@’localhost’ identified by ‘pass’
○ Create Database [If not exists] DB_name - IMP

● Alter Statement - altering existing objects Add, Remove or Rename


○ Alter Table Sales ADD column variable DATE;
● Drop statement - deletes
● Rename - Renames objects like table.
● Truncate - deletes all records, nt the table
2. DML
Select statement
Insert INTO values
Update - renew existing data into the table.
Delete - we can precisely specify what to remove.

3. DCL - GRANT AND REVOKE


● Allows us to manage right users have in database
● Grant - gives permission
● Eg grant type_of_permission on DB.tb_name TO ‘username’@’localhost’
4. TCL
● Not all changes are saved automatically
● Commit - works insert, update or delete.
● Rollback - undo changes

Database - Relational and non-relational


Relational
Main goal - organize(compact and well-structured) huge amounts of data that can be quickly
retrieved.
Spreadsheets vs Databases
SImilarities
● Both contain tabular data
● Can use existing data to make calculations
Differences
● Each cell can contain function or formulas in spreadsheets
● All calculatinos or operations can be doe after data retrieval
● Excel cannot handle more than 1 million data
● DB can use any number

Database design is an important step.


Primary Key - Values exist and Unique (maybe set of columns)

Underline the column which PK while representing.


FK - identifies relation between tables.
Unique KEy - for storing eg phone number which is nt PK bt values need to be unique.Can
contain null values.

Relationships
They tell you how much of the data from a foreign key field can be seen in the PK column of the
table the data is related to and vice versa

One to many eg Customer id in customer vs in sales table

Size of data does matter in sql


Data types
String - for text purposes

In fixed size storage size remain same in spite of data size.

Varchar

In floating point operations - precision number of digits in a number.


Scale is number of digits after decimal

Decimal(5,3) (pr,scale) eg 10.253


Double
Decimal

Date YYYY-MM-DD
Date- Time
YYYY-MM-DD HH::MM:Ss
Timestamp - to find duration between two events

BLOB - Binary large object - files of binary date

Constraints = specific rules, or limits that we define in our table.


FK links other tables.
On delete cascades - deletes child details if parent details are removed.
Use alter table to execute changes.

Foreign Key (customer_id) References customer(customer_id) On delete cascade

Best Coding Practice


● Best version to read and understand
● Shorter meaningful names
Ctrl + B for code readibility
/* ……*/ comments
# single line comment
In sql AND is logically applied first and then OR

Use IN operator when many options are to be considered


For eg

select first_name,last_name from employees where first_name IN ('Cathie', 'Elvis');


This is better than
select first_name,last_name from employees where first_name = ‘ ’ or first_name = ‘ ’

Time for access is also less in IN operator.

Sometimes we will need to look for pattern inside our database


Some name starting with MAR for eg

Then we would use LIKE operator


select * from employees where first_name like ('ar%');
% - sequence of charaters.

select * from employees where first_name like ('mar_');


_ for matching single character.

Between...And
Helps us designate the interval to which a given value belongs.

Aggregate functions apply to multiple rows of a single column of a table and return an
output of a single value.

Count()
Counts the number of non-null records in a field.

Sum()
Sums all the non-null values in a column

select count(first_name) from employees;


select count(Distinct first_name) from employees;
Order By clause for ordering
GROUP BY very IMPPPP!!!!!!!!!!
Results can be grouped together according to specific fields.
GroupBy clause has to kept immediately after where clause and before the orderBy
clause.

Always use group BY when aggregating very impp!!

Logical Flow of writing query.

Having Clause
Refines the output from records that do not satisfy a certain conditon
Applied to GROUP BY block
If some condition is applied on aggregate function use having clause

select first_name, count(first_name) from employees where hire_date > '1999-01-01'


group by first_name
having count(first_name) < 200;
Put limit statement at the end of the query

SQL JOINS
Joins allow to construct relationships between objects.

Inner joins syntax

Select t1.colum_name , t2.column_name


From table_name t1 inner join table_name_2 t2
On t1.id = t2.id

Null values or values appearing in either of the table are not shown in the result.

Dont assume
There might be duplicate rows in your data

Group By the field which differs the most.

Cross join Cartesian product


Inner join without on clause

Take care when using aggregate functions with mysql check the output for proper
explaination.

select
e.first_name, e.last_name
from
employees e
where
e.emp_no in ( select dm.emp_no from dept_manager dm);

Subquery
Use OrderBy in outer query.

Self join -
Entity Relationship Diagrams
An entity relationship diagram (ERD) is a common way to view data in a database.
Below is the ERD for the database we will use from Parch & Posey. These diagrams help
you visualize the data you are analyzing including:
1. The names of the tables.
2. The columns in each table.
3. The way the tables work together.
You can think of each of the boxes below as a spreadsheet.

In the Parch & Posey database there are five tables (essentially 5 spreadsheets):
1. web_events
2. accounts
3. orders
4. sales_reps
5. region
You can think of each of these tables as an individual spreadsheet. Then the columns in
each spreadsheet are listed below the table name. For example, the region table has
two columns: id and name. Alternatively the web_events table has four columns.
There are some major advantages to using traditional relational databases, which we
interact with using SQL. The five most apparent are:
● SQL is easy to understand.
● Traditional databases allow us to access data directly.
● Traditional databases allow us to audit and replicate our data.
he LIKE operator is extremely useful for working with text. You will use LIKE within a
WHERE clause. The LIKE operator is frequently used with %. The % tells us that we might
want any number of characters leading up to a particular set of characters or following
a certain set of characters, as we saw with the google syntax above. Remember you will
need to use single quotes for the text you pass to the LIKE operator, because of this
lower and uppercase letters are not the same within the string. Searching for 'T' is not
the same as searching for 't'. In other SQL environments (outside the classroom), you
can use either single or double quotes.
Hopefully you are starting to get more comfortable with SQL, as we are starting to move
toward operations that have more applications, but this also means we can't show you
every use case. Hopefully, you can start to think about how you might use these types of
applications to identify phone numbers from a certain region, or an individual where you
can't quite remember the full name.

● SQL is a great tool for analyzing multiple tables at once.


● SQL allows you to analyze more complex questions than dashboard tools like
Google Analytics.
You will experience these advantages first hand, as we learn to write SQL to interact
with data.
I realize you might be getting a little nervous or anxious to start writing code. This might
even be the first time you have written in any sort of programming language. I assure
you, we will work through examples to help assure you feel supported the whole time to
take on this new challenge!
Why Businesses Like Databases
1. Data integrity is ensured - only the data you want entered is entered, and only
certain users are able to enter data into the database.

2. Data can be accessed quickly - SQL allows you to obtain results very quickly from
the data stored in a database. Code can be optimized to quickly pull results.

3. Data is easily shared - multiple individuals can access data stored in a database,
and the data is the same for all users allowing for consistent results for anyone
with access to your database.
few key points about data stored in SQL databases:
1. Data in databases is stored in tables that can be thought of just like Excel
spreadsheets.
For the most part, you can think of a database as a bunch of Excel spreadsheets.
Each spreadsheet has rows and columns. Where each row holds data on a
transaction, a person, a company, etc., while each column holds data pertaining
to a particular aspect of one of the rows you care about like a name, location, a
unique id, etc.

2. **All the data in the same column must match in terms of data type. **
An entire column is considered quantitative, discrete, or as some sort of string.
This means if you have one row with a string in a particular column, the entire
column might change to a text data type. This can be very bad if you want to do
math with this column!

3. Consistent column types are one of the main reasons working with databases is
fast.
Often databases hold a LOT of data. So, knowing that the columns are all of the
same type of data means that obtaining data from a database can still be fast.
The ORDER BY statement is always after the SELECT and FROM statements, but it is
before the LIMIT statement. As you learn additional commands, the order of these
statements will matter more. If we are using the LIMIT statement, it will always appear
last.

Statement How to Use It Other Details

SELECT SELECT Col1, Col2, … Provide the columns you want

FROM FROM Table Provide the table where the columns exist

LIMIT LIMIT **10 ** Limits based number of rows returned

ORDER BY ORDER BY Col Orders table based on the column. Used


with DESC.

WHERE WHERE Col > 5 A conditional statement to filter your


results

LIKE WHERE Col LIKE '%me%' Only pulls rows where column has 'me'
within the text

IN WHERE Col IN ('Y', 'N') A filter for only rows with column of 'Y' or
'N'

NOT WHERE Col NOT IN ('Y', 'N') NOT is frequently used with LIKE and IN

AND WHERE **Col1 > 5 AND Filter rows where two or more conditions
Col2 < 3 ** must be true

OR WHERE Col1 > 5 OR Col2 < Filter rows where at least one condition
3 must be true

BETWEEN WHERE Col BETWEEN 3 Often easier syntax than using an AND
AND 5

Again - JOINs are useful for allowing us to pull data from multiple tables. This is both
simple and powerful all at the same time.
With the addition of the JOIN statement to our toolkit, we will also be adding the ON
statement.

NULLs are a datatype that specifies where no data exists in SQL. They are often ignored
in our aggregation functions, which you will get a first look at in the next concept using
COUNT.
When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don't
use =, because NULL isn't considered a value in SQL. Rather, it is a property of the data.

Notice that COUNT does not consider rows that have NULL values. Therefore, this
can be useful for quickly identifying which rows have missing data. You will learn
GROUP BY in an upcoming concept, and then each of these aggregators will
become much more useful.

Aggregation Reminder
An important thing to remember: aggregators only aggregate vertically - the

values of a column. If you want to perform a calculation across rows, you would

do this with simple arithmetic.

We saw this in the first lesson if you need a refresher, but the quiz in the next

concept should assure you still remember how to aggregate across rows.

MEDIAN - Expert Tip


One quick note that a median might be a more appropriate measure of center for

this data, but finding the median happens to be a pretty difficult thing to get using

SQL alone — so difficult that finding a median is occasionally asked as an interview

question.
The key takeaways here:

● GROUP BY can be used to aggregate data within subsets of the data. For

example, grouping for different accounts, different regions, or different sales

representatives.

● Any column in the SELECT statement that is not within an aggregator must

be in the GROUP BY clause.

● The GROUP BY always goes between WHERE and ORDER BY.

● ORDER BY works like SORT in spreadsheet software.

Table account - id, name, website, lat,long, primary_poc, sales_rep


● order - id, account_id, occurred_at, standard_qty, poster_qty, gloss_qty,
total,standard_amt_usd, poster_amt_usd, gloss_amt_usd, totla_amt_usd
● Region - id, name
● Sales_rep - id, region_id
● Web_events - id, account_id, occurred_at, channel

Which account (by name) placed the earliest order? Your solution should have the
account name and the date of the order.
SELECT a.name, o.occurred_at
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY occurred_at
LIMIT 1;
1.

Find the total sales in usd for each account. You should include two columns - the
total sales for each company's orders in usd and the company name.
SELECT a.name, SUM(total_amt_usd) total_sales
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.name;
2.

Via what channel did the most recent (latest) web_event occur, which account
was associated with this web_event? Your query should return only three values -
the date, channel, and account name.
SELECT w.occurred_at, w.channel, a.name
FROM web_events w
JOIN accounts a
ON w.account_id = a.id
ORDER BY w.occurred_at DESC
LIMIT 1;
3.

Find the total number of times each type of channel from the web_events was
used. Your final table should have two columns - the channel and the number of
times the channel was used.
SELECT w.channel, COUNT(*)
FROM web_events w
GROUP BY w.channel
4.

Who was the primary contact associated with the earliest web_event?
SELECT a.primary_poc
FROM web_events w
JOIN accounts a
ON a.id = w.account_id
ORDER BY w.occurred_at
LIMIT 1;
5.

What was the smallest order placed by each account in terms of total usd. Provide
only two columns - the account name and the total usd. Order from smallest
dollar amounts to largest.
SELECT a.name, MIN(total_amt_usd) smallest_order
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY smallest_order;
6.

Sort of strange we have a bunch of orders with no dollars. We might want to

look into those.

Find the number of sales reps in each region. Your final table should have two
columns - the region and the number of sales_reps. Order from fewest reps to
most reps.
SELECT r.name, COUNT(*) num_reps
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
GROUP BY r.name
7. ORDER BY num_reps;

8. You can GROUP BY multiple columns at once, as we showed here. This is

often useful to aggregate across a number of different segments.

9. The order of columns listed in the ORDER BY clause does make a difference.

You are ordering the columns from left to right.

GROUP BY - Expert Tips


● The order of column names in your GROUP BY clause doesn’t matter—the

results will be the same regardless. If we run the same query and reverse the

order in the GROUP BY clause, you can see we get the same results.

● As with ORDER BY, you can substitute numbers for column names in the

GROUP BY clause. It’s generally recommended to do this only when you’re

grouping many columns, or if something else is causing the text in the


GROUP BY clause to be excessively long.

● A reminder here that any column that is not within an aggregation must

show up in your GROUP BY statement. If you forget, you will likely get an

error. However, in the off chance that your query does work, you might not

like the results!

For each account, determine the average amount of each type of paper they

purchased across their orders. Your result should have four columns - one for the

account name and one for the average spent on each of the paper types.
SELECT a.name, AVG(o.standard_qty) avg_stand, AVG(o.gloss_qty) avg_gloss,

AVG(o.poster_qty) avg_post

FROM accounts a

JOIN orders o

ON a.id = o.account_id

GROUP BY a.name;

For each account, determine the average amount spent per order on each paper

type. Your result should have four columns - one for the account name and one for

the average amount spent on each paper type.


SELECT a.name, AVG(o.standard_amt_usd) avg_stand, AVG(o.gloss_amt_usd)

avg_gloss, AVG(o.poster_amt_usd) avg_post

FROM accounts a

JOIN orders o

ON a.id = o.account_id

GROUP BY a.name;

Determine the number of times a particular channel was used in the web_events

table for each sales rep. Your final table should have three columns - the name of

the sales rep, the channel, and the number of occurrences. Order your table with

the highest number of occurrences first.


SELECT s.name, w.channel, COUNT(*) num_events

FROM accounts a

JOIN web_events w

ON a.id = w.account_id

JOIN sales_reps s

ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel

ORDER BY num_events DESC;

Determine the number of times a particular channel was used in the web_events

table for each region. Your final table should have three columns - the region

name, the channel, and the number of occurrences. Order your table with the

highest number of occurrences first.


SELECT r.name, w.channel, COUNT(*) num_events

FROM accounts a

JOIN web_events w

ON a.id = w.account_id

JOIN sales_reps s

ON s.id = a.sales_rep_id

JOIN region r

ON r.id = s.region_id

GROUP BY r.name, w.channel


● ORDER BY num_events DESC;

DISTINCT is always used in SELECT statements, and it provides the unique rows for

all columns written in the SELECT statement. Therefore, you only use DISTINCT

once in any particular SELECT statement.

You could write:

SELECT DISTINCT column1, column2, column3

FROM table1;

which would return the unique (or DISTINCT) rows across all three columns.

You would not write:

SELECT DISTINCT column1, DISTINCT column2, DISTINCT column3

FROM table1;

You can think of DISTINCT the same way you might think of the statement

"unique".

DISTINCT - Expert Tip


It’s worth noting that using DISTINCT, particularly in aggregations, can slow your

queries down quite a bit.


Use DISTINCT to test if there are any accounts associated with more than one

region.

The below two queries have the same number of resulting rows (351), so we know

that every account is associated with only one region. If each account was

associated with more than one region, the first query should have returned more

rows than the second query.


SELECT a.id as "account id", r.id as "region id",

a.name as "account name", r.name as "region name"

FROM accounts a

JOIN sales_reps s

ON s.id = a.sales_rep_id

JOIN region r

ON r.id = s.region_id;

and
SELECT DISTINCT id, name

FROM accounts;
1.

Have any sales reps worked on more than one account?

Actually all of the sales reps have worked on more than one account. The fewest

number of accounts any sales rep works on is 3. There are 50 sales reps, and they

all have more than one account. Using DISTINCT in the second query assures that

all of the sales reps are accounted for in the first query.
SELECT s.id, s.name, COUNT(*) num_accounts

FROM accounts a

JOIN sales_reps s

ON s.id = a.sales_rep_id

GROUP BY s.id, s.name

ORDER BY num_accounts;

and
SELECT DISTINCT id, name

2. FROM sales_reps;

HAVING - Expert Tip


HAVING is the “clean” way to filter a query that has been aggregated, but this is

also commonly done using a subquery. Essentially, any time you want to perform a

WHERE on an element of your query that was created by an aggregate, you need to

use HAVING instead.

GROUPing BY a date column is not usually very useful in SQL, as these columns

tend to have transaction data down to a second. Keeping date information at such

a granular data is both a blessing and a curse, as it gives really precise information

(a blessing), but it makes grouping information together directly difficult (a curse).

How many of the sales reps have more than 5 accounts that they manage?
SELECT s.id, s.name, COUNT(*) num_accounts

FROM accounts a

JOIN sales_reps s

ON s.id = a.sales_rep_id

GROUP BY s.id, s.name

HAVING COUNT(*) > 5

ORDER BY num_accounts;

and technically, we can get this using a SUBQUERY as shown below. This same logic

can be used for the other queries, but this will not be shown.
SELECT COUNT(*) num_reps_above5

FROM(SELECT s.id, s.name, COUNT(*) num_accounts


FROM accounts a

JOIN sales_reps s

ON s.id = a.sales_rep_id

GROUP BY s.id, s.name

HAVING COUNT(*) > 5

ORDER BY num_accounts) AS Table1;

1.

How many accounts have more than 20 orders?

SELECT a.id, a.name, COUNT(*) num_orders

FROM accounts a

JOIN orders o

ON a.id = o.account_id

GROUP BY a.id, a.name

HAVING COUNT(*) > 20

ORDER BY num_orders;

2.

Which account has the most orders?


SELECT a.id, a.name, COUNT(*) num_orders
FROM accounts a

JOIN orders o

ON a.id = o.account_id

GROUP BY a.id, a.name

ORDER BY num_orders DESC

LIMIT 1;

3.

How many accounts spent more than 30,000 usd total across all orders?
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent

FROM accounts a

JOIN orders o

ON a.id = o.account_id

GROUP BY a.id, a.name

HAVING SUM(o.total_amt_usd) > 30000

ORDER BY total_spent;

4.

How many accounts spent less than 1,000 usd total across all orders?
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent

FROM accounts a

JOIN orders o
ON a.id = o.account_id

GROUP BY a.id, a.name

HAVING SUM(o.total_amt_usd) < 1000

ORDER BY total_spent;

5.

Which account has spent the most with us?


SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent

FROM accounts a

JOIN orders o

ON a.id = o.account_id

GROUP BY a.id, a.name

ORDER BY total_spent DESC

LIMIT 1;

6.

Which account has spent the least with us?


SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent

FROM accounts a

JOIN orders o

ON a.id = o.account_id

GROUP BY a.id, a.name


ORDER BY total_spent

LIMIT 1;

7.

Which accounts used facebook as a channel to contact customers more than 6

times?
SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel

FROM accounts a

JOIN web_events w

ON a.id = w.account_id

GROUP BY a.id, a.name, w.channel

HAVING COUNT(*) > 6 AND w.channel = 'facebook'

ORDER BY use_of_channel;

8.

Which account used facebook most as a channel?


SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel

FROM accounts a

JOIN web_events w

ON a.id = w.account_id

WHERE w.channel = 'facebook'

GROUP BY a.id, a.name, w.channel


ORDER BY use_of_channel DESC

LIMIT 1;

9.

Note: This query above only works if there are no ties for the account that

used facebook the most. It is a best practice to use a larger limit number first

such as 3 or 5 to see if there are ties before using LIMIT 1.

Which channel was most frequently used by most accounts?


SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel

FROM accounts a

JOIN web_events w

ON a.id = w.account_id

GROUP BY a.id, a.name, w.channel

ORDER BY use_of_channel DESC

10. LIMIT 10;

Lucky for us, there are a number of built in SQL functions that are aimed at helping

us improve our experience in working with dates.

Here we saw that dates are stored in year, month, day, hour, minute, second,

which helps us in truncating. In the next concept, you will see a number of

functions we can use in SQL to take advantage of this functionality.


The first function you are introduced to in working with dates is DATE_TRUNC.

DATE_TRUNC allows you to truncate your date to a particular part of your

date-time column. Common trunctions are day, month, and year. Here is a great

blog post by Mode Analytics on the power of this function.

DATE_PART can be useful for pulling a specific portion of a date, but notice pulling

month or day of the week (dow) means that you are no longer keeping the years in

order. Rather you are grouping for certain components regardless of which year

they belonged in.

For additional functions you can use with dates, check out the documentation here,

but the DATE_TRUNC and DATE_PART functions definitely give you a great start!
You can reference the columns in your select statement in GROUP BY and ORDER

BY clauses with numbers that follow the order they appear in the select statement.

For example

SELECT standard_qty, COUNT(*)

FROM orders

GROUP BY 1 (this 1 refers to standard_qty since it is the first of the columns included in

the select statement)

ORDER BY 1 (this 1 refers to standard_qty since it is the first of the columns included in

the select statement)


Find the sales in terms of total dollars for all orders in each year, ordered from

greatest to least. Do you notice any trends in the yearly sales totals?

SELECT DATE_PART('year', occurred_at) ord_year, SUM(total_amt_usd)

total_spent

FROM orders

GROUP BY 1

ORDER BY 2 DESC;

1.

When we look at the yearly totals, you might notice that 2013 and 2017 have

much smaller totals than all other years. If we look further at the monthly

data, we see that for 2013 and 2017 there is only one month of sales for each

of these years (12 for 2013 and 1 for 2017). Therefore, neither of these are

evenly represented. Sales have been increasing year over year, with 2016

being the largest sales to date. At this rate, we might expect 2017 to have the

largest sales.

Which month did Parch & Posey have the greatest sales in terms of total dollars?

Are all months evenly represented by the dataset?


In order for this to be 'fair', we should remove the sales from 2013 and 2017. For

the same reasons as discussed above.


SELECT DATE_PART('month', occurred_at) ord_month, SUM(total_amt_usd)

total_spent

FROM orders

WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01'

GROUP BY 1

ORDER BY 2 DESC;

2.

The greatest sales amounts occur in December (12).

Which year did Parch & Posey have the greatest sales in terms of total number of

orders? Are all years evenly represented by the dataset?


SELECT DATE_PART('year', occurred_at) ord_year, COUNT(*) total_sales

FROM orders

GROUP BY 1

ORDER BY 2 DESC;
3.

Again, 2016 by far has the most amount of orders, but again 2013 and 2017

are not evenly represented to the other years in the dataset.

Which month did Parch & Posey have the greatest sales in terms of total number of

orders? Are all months evenly represented by the dataset?


SELECT DATE_PART('month', occurred_at) ord_month, COUNT(*) total_sales

FROM orders

WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01'

GROUP BY 1

ORDER BY 2 DESC;

4.

December still has the most sales, but interestingly, November has the

second most sales (but not the most dollar sales. To make a fair comparison

from one month to another 2017 and 2013 data were removed.

In which month of which year did Walmart spend the most on gloss paper in terms

of dollars?
SELECT DATE_TRUNC('month', o.occurred_at) ord_date, SUM(o.gloss_amt_usd)

tot_spent
FROM orders o

JOIN accounts a

ON a.id = o.account_id

WHERE a.name = 'Walmart'

GROUP BY 1

ORDER BY 2 DESC

LIMIT 1;

5.

May 2016 was when Walmart spent the most on gloss paper.

CASE - Expert Tip


● The CASE statement always goes in the SELECT clause.

● CASE must include the following components: WHEN, THEN, and END. ELSE is

an optional component to catch cases that didn’t meet any of the other

previous CASE conditions.

● You can make any conditional statement using any conditional operator (like

WHERE) between WHEN and THEN. This includes stringing together multiple

conditional statements using AND and OR.


● You can include multiple WHEN statements, as well as an ELSE statement

again, to deal with any unaddressed conditions.

Example
In a quiz question in the previous Basic SQL lesson, you saw this question:

1. Create a column that divides the standard_amt_usd by the standard_qty to

find the unit price for standard paper for each order. Limit the results to the

first 10 orders, and include the id and account_id fields. NOTE - you will be

thrown an error with the correct solution to this question. This is for a

division by zero. You will learn how to get a solution without an error to

this query when you learn about CASE statements in a later section.

Let's see how we can use the CASE statement to get around this error.

SELECT id, account_id, standard_amt_usd/standard_qty AS unit_price

FROM orders

LIMIT 10;

Now, let's use a CASE statement. This way any time the standard_qty is zero, we

will return 0, and otherwise we will return the unit_price.

SELECT account_id, CASE WHEN standard_qty = 0 OR standard_qty IS NULL THEN 0


ELSE standard_amt_usd/standard_qty END AS unit_price

FROM orders

LIMIT 10;

Now the first part of the statement will catch any of those division by zero values

that were causing the error, and the other components will compute the division as

necessary. You will notice, we essentially charge all of our accounts 4.99 for

standard paper. It makes sense this doesn't fluctuate, and it is more accurate than

adding 1 in the denominator like our quick fix might have been in the earlier lesson.

You can try it yourself using the environment below.

This one is pretty tricky. Try running the query yourself to make sure you

understand what is happening. The next concept will give you some practice writing

CASE statements on your own. In this video, we showed that getting the same

information using a WHERE clause means only being able to get one set of data

from the CASE at a time.

There are some advantages to separating data into separate columns like this

depending on what you want to do, but often this level of separation might be

easier to do in another programming language - rather than with SQL.

Case statement very helpful for multiple conditions


Write a query to display for each order, the account ID, total amount of the order,

and the level of the order - ‘Large’ or ’Small’ - depending on if the order is $3000 or

more, or less than $3000.


SELECT account_id, total_amt_usd,

CASE WHEN total_amt_usd > 3000 THEN 'Large'

ELSE 'Small' END AS order_level

FROM orders;

1.

Write a query to display the number of orders in each of three categories, based on

the total number of items in each order. The three categories are: 'At Least 2000',

'Between 1000 and 2000' and 'Less than 1000'.


SELECT CASE WHEN total >= 2000 THEN 'At Least 2000'

WHEN total >= 1000 AND total < 2000 THEN 'Between 1000 and 2000'

ELSE 'Less than 1000' END AS order_category,

COUNT(*) AS order_count

FROM orders

GROUP BY 1;
2.

We would like to understand 3 different branches of customers based on the

amount associated with their purchases. The top branch includes anyone with a

Lifetime Value (total sales of all orders) greater than 200,000 usd. The second

branch is between 200,000 and 100,000 usd. The lowest branch is anyone under

100,000 usd. Provide a table that includes the level associated with each account.

You should provide the account name, the total sales of all orders for the

customer, and the level. Order with the top spending customers listed first.
SELECT a.name, SUM(total_amt_usd) total_spent,

CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top'

WHEN SUM(total_amt_usd) > 100000 THEN 'middle'

ELSE 'low' END AS customer_level

FROM orders o

JOIN accounts a

ON o.account_id = a.id

GROUP BY a.name

ORDER BY 2 DESC;
3.

We would now like to perform a similar calculation to the first, but we want to

obtain the total amount spent by customers only in 2016 and 2017. Keep the same

levels as in the previous question. Order with the top spending customers listed

first.
SELECT a.name, SUM(total_amt_usd) total_spent,

CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top'

WHEN SUM(total_amt_usd) > 100000 THEN 'middle'

ELSE 'low' END AS customer_level

FROM orders o

JOIN accounts a

ON o.account_id = a.id

WHERE occurred_at > '2015-12-31'

GROUP BY 1

ORDER BY 2 DESC;

4.
We would like to identify top performing sales reps, which are sales reps

associated with more than 200 orders. Create a table with the sales rep name, the

total number of orders, and a column with top or not depending on if they have

more than 200 orders. Place the top sales people first in your final table.
SELECT s.name, COUNT(*) num_ords,

CASE WHEN COUNT(*) > 200 THEN 'top'

ELSE 'not' END AS sales_rep_level

FROM orders o

JOIN accounts a

ON o.account_id = a.id

JOIN sales_reps s

ON s.id = a.sales_rep_id

GROUP BY s.name

ORDER BY 2 DESC;

5.

It is worth mentioning that this assumes each name is unique - which has

been done a few times. We otherwise would want to break by the name and
the id of the table.

The previous didn't account for the middle, nor the dollar amount associated with

the sales. Management decides they want to see these characteristics represented

as well. We would like to identify top performing sales reps, which are sales reps

associated with more than 200 orders or more than 750000 in total sales. The

middle group has any rep with more than 150 orders or 500000 in sales. Create a

table with the sales rep name, the total number of orders, total sales across all

orders, and a column with top, middle, or low depending on this criteria. Place the

top sales people based on dollar amount of sales first in your final table.
SELECT s.name, COUNT(*), SUM(o.total_amt_usd) total_spent,

CASE WHEN COUNT(*) > 200 OR SUM(o.total_amt_usd) > 750000 THEN 'top'

WHEN COUNT(*) > 150 OR SUM(o.total_amt_usd) > 500000 THEN 'middle'

ELSE 'low' END AS sales_rep_level

FROM orders o

JOIN accounts a

ON o.account_id = a.id

JOIN sales_reps s
ON s.id = a.sales_rep_id

GROUP BY s.name

6. ORDER BY 3 DESC;

1. Subqueries

2. Table Expressions

3. Persistent Derived Tables

Both subqueries and table expressions are methods for being able to write a

query that creates a table, and then write a query that interacts with this newly

created table. Sometimes the question you are trying to answer doesn't have an

answer when working directly with existing tables in database.

However, if we were able to create new tables from the existing tables, we know we

could query these new tables to answer our question. This is where the queries of

this lesson come to the rescue.

If you can't yet think of a question that might require such a query, don't worry

because you are about to see a whole bunch of them!

In the first subquery you wrote, you created a table that you could then query again in

the FROM statement. However, if you are only returning a single value, you might use
that value in a logical statement like WHERE, HAVING, or even SELECT - the value could

be nested within a CASE statement.

On the next concept, we will work through this example, and then you will get some

practice on answering some questions on your own.

Expert Tip

Note that you should not include an alias when you write a subquery in a conditional

statement. This is because the subquery is treated as an individual value (or set of

values in the IN case) rather than as a table.

Also, notice the query here compared a single value. If we returned an entire column

IN would need to be used to perform a logical argument. If we are returning an entire

table, then we must use an ALIAS for the table, and perform additional logic on the

entire table.

NEXT

The WITH statement is often called a Common Table Expression or CTE. Though

these expressions serve the exact same purpose as subqueries, they are more

common in practice, as they tend to be cleaner for a future reader to follow the

logic.

In the next concept, we will walk through this example a bit more slowly to make

sure you have all the similarities between subqueries and these expressions down
for you to use in practice! If you are already feeling comfortable skip ahead to

practice the quiz section.

SUBQUERIES VERY IMPPP

Solution: Subquery Mania


Provide the name of the sales_rep in each region with the largest amount of
total_amt_usd sales.

First, I wanted to find the total_amt_usd totals associated with each sales rep, and
I also wanted the region in which they were located. The query below provided this
information.
SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC;

Next, I pulled the max for each region, and then we can use this to pull those rows
in our final result.
SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) t1
GROUP BY 1;

Essentially, this is a JOIN of these two tables, where the region and amount match.
SELECT t3.rep_name, t3.region_name, t3.total_amt
FROM(SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) t1
GROUP BY 1) t2
JOIN (SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC) t3
ON t3.region_name = t2.region_name AND t3.total_amt = t2.total_amt;
1.

For the region with the largest sales total_amt_usd, how many total orders were
placed?

The first query I wrote was to pull the total_amt_usd for each region.
SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name;

Then we just want the region with the max amount from this table. There are two
ways I considered getting this amount. One was to pull the max using a subquery.
Another way is to order descending and just pull the top value.
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub;

Finally, we want to pull the total orders for the region with this amount:
SELECT r.name, COUNT(o.total) total_orders
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
HAVING SUM(o.total_amt_usd) = (
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub);
2.

This provides the Northeast with 2357 orders.


How many accounts had more total purchases than the account name which has
bought the most standard_qty paper throughout their lifetime as a customer?

First, we want to find the account that had the most standard_qty paper. The
query here pulls that account, as well as the total amount:
SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;

Now, I want to use this to pull all the accounts with more total sales:
SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std,
SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) sub);

This is now a list of all the accounts with more total orders. We can get the count
with just another simple subquery.
SELECT COUNT(*)
FROM (SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std,
SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) inner_tab)
) counter_tab;
3.

For the customer that spent the most (in total over their lifetime as a customer)
total_amt_usd, how many web_events did they have for each channel?

Here, we first want to pull the customer with the most spent in lifetime value.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1;

Now, we want to look at the number of events on each channel this company had,
which we can match with just the id.
SELECT a.name, w.channel, COUNT(*)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id AND a.id = (SELECT id
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1) inner_table)
GROUP BY 1, 2
ORDER BY 3 DESC;
4.

I added an ORDER BY for no real reason, and the account name to assure I

was only pulling from one account.

What is the lifetime average amount spent in terms of total_amt_usd for the top 10
total spending accounts?

First, we just want to find the top 10 accounts in terms of highest total_amt_usd.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10;

Now, we just want the average of these 10 amounts.


SELECT AVG(tot_spent)
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10) temp;
5.

What is the lifetime average amount spent in terms of total_amt_usd, including


only the companies that spent more per order, on average, than the average of all
orders.

First, we want to pull the average of all accounts in terms of total_amt_usd:


SELECT AVG(o.total_amt_usd) avg_all
FROM orders o

Then, we want to only pull the accounts with more than this average amount.
SELECT o.account_id, AVG(o.total_amt_usd)
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
FROM orders o);

Finally, we just want the average of these values.


SELECT AVG(avg_amt)
FROM (SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
6. FROM orders o)) temp_table;
Solution: Subquery Mania
Provide the name of the sales_rep in each region with the largest amount of
total_amt_usd sales.

First, I wanted to find the total_amt_usd totals associated with each sales rep, and
I also wanted the region in which they were located. The query below provided this
information.
SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC;

Next, I pulled the max for each region, and then we can use this to pull those rows
in our final result.
SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) t1
GROUP BY 1;

Essentially, this is a JOIN of these two tables, where the region and amount match.
SELECT t3.rep_name, t3.region_name, t3.total_amt
FROM(SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) t1
GROUP BY 1) t2
JOIN (SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC) t3
ON t3.region_name = t2.region_name AND t3.total_amt = t2.total_amt;
1.

For the region with the largest sales total_amt_usd, how many total orders were
placed?

The first query I wrote was to pull the total_amt_usd for each region.
SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name;

Then we just want the region with the max amount from this table. There are two
ways I considered getting this amount. One was to pull the max using a subquery.
Another way is to order descending and just pull the top value.
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub;
Finally, we want to pull the total orders for the region with this amount:
SELECT r.name, COUNT(o.total) total_orders
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
HAVING SUM(o.total_amt_usd) = (
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub);
2.

This provides the Northeast with 2357 orders.

How many accounts had more total purchases than the account name which has
bought the most standard_qty paper throughout their lifetime as a customer?

First, we want to find the account that had the most standard_qty paper. The
query here pulls that account, as well as the total amount:
SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;

Now, I want to use this to pull all the accounts with more total sales:
SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std,
SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) sub);

This is now a list of all the accounts with more total orders. We can get the count
with just another simple subquery.
SELECT COUNT(*)
FROM (SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std,
SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) inner_tab)
) counter_tab;
3.

For the customer that spent the most (in total over their lifetime as a customer)
total_amt_usd, how many web_events did they have for each channel?

Here, we first want to pull the customer with the most spent in lifetime value.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1;

Now, we want to look at the number of events on each channel this company had,
which we can match with just the id.
SELECT a.name, w.channel, COUNT(*)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id AND a.id = (SELECT id
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1) inner_table)
GROUP BY 1, 2
ORDER BY 3 DESC;
4.

I added an ORDER BY for no real reason, and the account name to assure I

was only pulling from one account.

What is the lifetime average amount spent in terms of total_amt_usd for the top 10
total spending accounts?

First, we just want to find the top 10 accounts in terms of highest total_amt_usd.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10;

Now, we just want the average of these 10 amounts.


SELECT AVG(tot_spent)
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10) temp;
5.

What is the lifetime average amount spent in terms of total_amt_usd, including


only the companies that spent more per order, on average, than the average of all
orders.

First, we want to pull the average of all accounts in terms of total_amt_usd:


SELECT AVG(o.total_amt_usd) avg_all
FROM orders o

Then, we want to only pull the accounts with more than this average amount.
SELECT o.account_id, AVG(o.total_amt_usd)
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
FROM orders o);

Finally, we just want the average of these values.


SELECT AVG(avg_amt)
FROM (SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
6. FROM orders o)) temp_table;

he WITH statement is often called a Common Table Expression or CTE. Though

these expressions serve the exact same purpose as subqueries, they are more

common in practice, as they tend to be cleaner for a future reader to follow the

logic.
In the next concept, we will walk through this example a bit more slowly to make

sure you have all the similarities between subqueries and these expressions down

for you to use in practice! If you are already feeling comfortable skip ahead to

practice the quiz section.


WITH table1 AS (
SELECT *
FROM web_events),

table2 AS (
SELECT *
FROM accounts)

SELECT *
FROM table1
JOIN table2

ON table1.account_id = table2.id;

Provide the name of the sales_rep in each region with the largest amount of
total_amt_usd sales.

WITH t1 AS (

SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt

FROM sales_reps s

JOIN accounts a

ON a.sales_rep_id = s.id

JOIN orders o
ON o.account_id = a.id

JOIN region r

ON r.id = s.region_id

GROUP BY 1,2

ORDER BY 3 DESC),

t2 AS (

SELECT region_name, MAX(total_amt) total_amt

FROM t1

GROUP BY 1)

SELECT t1.rep_name, t1.region_name, t1.total_amt

FROM t1

JOIN t2

ON t1.region_name = t2.region_name AND t1.total_amt = t2.total_amt;

1.

For the region with the largest sales total_amt_usd, how many total orders were
placed?

WITH t1 AS (

SELECT r.name region_name, SUM(o.total_amt_usd) total_amt

FROM sales_reps s
JOIN accounts a

ON a.sales_rep_id = s.id

JOIN orders o

ON o.account_id = a.id

JOIN region r

ON r.id = s.region_id

GROUP BY r.name),

t2 AS (

SELECT MAX(total_amt)

FROM t1)

SELECT r.name, COUNT(o.total) total_orders

FROM sales_reps s

JOIN accounts a

ON a.sales_rep_id = s.id

JOIN orders o

ON o.account_id = a.id

JOIN region r

ON r.id = s.region_id

GROUP BY r.name

HAVING SUM(o.total_amt_usd) = (SELECT * FROM t2);


2.

For the account that purchased the most (in total over their lifetime as a customer)
standard_qty paper, how many accounts still had more in total purchases?

WITH t1 AS (

SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total

FROM accounts a

JOIN orders o

ON o.account_id = a.id

GROUP BY 1

ORDER BY 2 DESC

LIMIT 1),

t2 AS (

SELECT a.name

FROM orders o

JOIN accounts a

ON a.id = o.account_id

GROUP BY 1

HAVING SUM(o.total) > (SELECT total FROM t1))

SELECT COUNT(*)
FROM t2;

3.

For the customer that spent the most (in total over their lifetime as a customer)
total_amt_usd, how many web_events did they have for each channel?

WITH t1 AS (

SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent

FROM orders o

JOIN accounts a

ON a.id = o.account_id

GROUP BY a.id, a.name

ORDER BY 3 DESC

LIMIT 1)

SELECT a.name, w.channel, COUNT(*)

FROM accounts a

JOIN web_events w

ON a.id = w.account_id AND a.id = (SELECT id FROM t1)

GROUP BY 1, 2

ORDER BY 3 DESC;
4.

What is the lifetime average amount spent in terms of total_amt_usd for the top 10
total spending accounts?

WITH t1 AS (

SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent

FROM orders o

JOIN accounts a

ON a.id = o.account_id

GROUP BY a.id, a.name

ORDER BY 3 DESC

LIMIT 10)

SELECT AVG(tot_spent)

FROM t1;

5.

What is the lifetime average amount spent in terms of total_amt_usd, including


only the companies that spent more per order, on average, than the average of all
orders.
WITH t1 AS (

SELECT AVG(o.total_amt_usd) avg_all

FROM orders o

JOIN accounts a

ON a.id = o.account_id),

t2 AS (

SELECT o.account_id, AVG(o.total_amt_usd) avg_amt

FROM orders o

GROUP BY 1

HAVING AVG(o.total_amt_usd) > (SELECT * FROM t1))

SELECT AVG(avg_amt)

6. FROM t2;

SQL Cleaning
Here we looked at three new functions:

1. LEFT

2. RIGHT

3. LENGTH
LEFT pulls a specified number of characters for each row in a specified column

starting at the beginning (or from the left). As you saw here, you can pull the first

three digits of a phone number using LEFT(phone_number, 3).

RIGHT pulls a specified number of characters for each row in a specified column

starting at the end (or from the right). As you saw here, you can pull the last eight

digits of a phone number using RIGHT(phone_number, 8).

LENGTH provides the number of characters for each row of a specified column.

Here, you saw that we could use this to get the length of each phone number as

LENGTH(phone_number).

SELECT RIGHT(website, 3) AS domain, COUNT(*) num_companies

FROM accounts

GROUP BY 1

ORDER BY 2 DESC;

1.

SELECT LEFT(UPPER(name), 1) AS first_letter, COUNT(*) num_companies

FROM accounts

GROUP BY 1

ORDER BY 2 DESC;

2.
There are 350 company names that start with a letter and 1 that starts with a

number. This gives a ratio of 350/351 that are company names that start with a

letter or 99.7%.
SELECT SUM(num) nums, SUM(letter) letters

FROM (SELECT name, CASE WHEN LEFT(UPPER(name), 1) IN

('0','1','2','3','4','5','6','7','8','9')

THEN 1 ELSE 0 END AS num,

CASE WHEN LEFT(UPPER(name), 1) IN

('0','1','2','3','4','5','6','7','8','9')

THEN 0 ELSE 1 END AS letter

FROM accounts) t1;

3.

There are 80 company names that start with a vowel and 271 that start with other

characters. Therefore 80/351 are vowels or 22.8%. Therefore, 77.2% of company

names do not start with vowels.


SELECT SUM(vowels) vowels, SUM(other) other

FROM (SELECT name, CASE WHEN LEFT(UPPER(name), 1) IN ('A','E','I','O','U')

THEN 1 ELSE 0 END AS vowels,

CASE WHEN LEFT(UPPER(name), 1) IN ('A','E','I','O','U')

THEN 0 ELSE 1 END AS other

4. FROM accounts) t1;

In this lesson, you learned about:

1. POSITION

2. STRPOS

3. LOWER
4. UPPER

POSITION takes a character and a column, and provides the index where that

character is for each row. The index of the first position is 1 in SQL. If you come

from another programming language, many begin indexing at 0. Here, you saw that

you can pull the index of a comma as POSITION(',' IN city_state).

STRPOS provides the same result as POSITION, but the syntax for achieving those

results is a bit different as shown here: STRPOS(city_state, ',').

Note, both POSITION and STRPOS are case sensitive, so looking for A is different

than looking for a.

Therefore, if you want to pull an index regardless of the case of a letter, you might

want to use LOWER or UPPER to make all of the characters lower or uppercase.
In this video, you saw additional functionality for working with dates including:

1. TO_DATE

2. CAST

3. Casting with ::

DATE_PART('month', TO_DATE(month, 'month')) here changed a month name

into the number associated with that particular month.

Then you can change a string to a date using CAST. CAST is actually useful to

change lots of column types. Commonly you might be doing as you saw here,

where you change a string to a date using CAST(date_column AS DATE).

However, you might want to make other changes to your columns in terms of their

data types. You can see other examples here.

In this example, you also saw that instead of CAST(date_column AS DATE), you can

use date_column::DATE.

Expert Tip
Most of the functions presented in this lesson are specific to strings. They won’t

work with dates, integers or floating-point numbers. However, using any of these

functions will automatically change the data to the appropriate type.

LEFT, RIGHT, and TRIM are all used to select only certain elements of strings, but

using them to select elements of a number or date will treat them as strings for the

purpose of the function. Though we didn't cover TRIM in this lesson explicitly, it can

be used to remove characters from the beginning and end of a string. This can
remove unwanted spaces at the beginning or end of a row that often happen with

data being moved from Excel or other storage systems.

There are a number of variations of these functions, as well as several other string

functions not covered here. Different databases use subtle variations on these

functions, so be sure to look up the appropriate database’s syntax if you’re

connected to a private database.The Postgres literature contains a lot of the related

functions.

Ifnull function - only 2 parameters

Select dept_no, ifnull(dept_name, ‘dept not provided) ’ from table.

This returns the name if value not null

Or the string if the value is null

coalesce( ) ifnull with more than 2 parameters

They do not make changes in the dataset.

coalesce(x,y,’N/a’)

First it checks x if x is null then it checks corresponding y value

If value of y is not null then it shows that otherwise if y is also null then it puts N/A

Coalesce can help you visualize the prototype of the final database

Use only 1 argument in the function to display that fake_col

Union_all is used to combine few select statement into a single output


Here we have to assign managerId = 20 with all employees with id 1 to 20

Thus we use subquery as a column value IMPP

Select emp_no from dept where emp_no = x -

Next assign manager_id = 39 to emp id 21 to 40

Here union comes in handy

We do two select statements subset A union subset b

Self join - Combine rows of a table with other rows of the same table

Create or replace view as v_


SQL Window function

a window function performs a calculation across a set of table rows that are

somehow related to the current row. This is comparable to the type of calculation

that can be done with an aggregate function. But unlike regular aggregate

functions, use of a window function does not cause rows to become grouped into a

single output row — the rows retain their separate identities. Behind the scenes,

the window function is able to access more than just the current row of the query

result.

Through introducing window functions, we have also introduced two statements

that you may not be familiar with: OVER and PARTITION BY. These are key to

window functions. Not every window function uses PARTITION BY; we can also use

ORDER BY or no statement at all depending on the query we want to run. You will

practice using these clauses in the upcoming quizzes. If you want more details right

now, this resource from Pinal Dave is helpful.

Note: You can’t use window functions and standard aggregations in the same query.

More specifically, you can’t include window functions in a GROUP BY clause.

Creating a Partitioned Running Total Using Window Functions

Now, modify your query from the previous quiz to include partitions. Still create a

running total of standard_amt_usd (in the orders table) over order time, but this

time, date truncate occurred_at by year and partition by that same year-truncated

occurred_at variable. Your final table should have three columns: One with the
amount being added for each row, one for the truncated date, and a final column with

the running total within each year.

SELECT standard_amt_usd,

DATE_TRUNC('year', occurred_at) as year,

SUM(standard_amt_usd) OVER (PARTITION BY DATE_TRUNC('year',

occurred_at) ORDER BY occurred_at) AS running_total

FROM orders

ROW_NUMBER() - as the name suggests it gives row num to rows based on sum

clause

Rank() - same as above but the rank is same for same value

Select the id, account_id, and total variable from the orders table, then create a

column called total_rank that ranks this total amount of paper ordered (from

highest to lowest) for each account using a partition. Your final table should have

these four columns.

SELECT id,

account_id,

total,

RANK() OVER (PARTITION BY account_id ORDER BY total DESC) AS total_rank

FROM orders
Aggregates in window function
SELECT id,

account_id,

standard_qty,

DATE_TRUNC('month', occurred_at) AS month,

DENSE_RANK() OVER (PARTITION BY account_id ORDER BY

DATE_TRUNC('month',occurred_at)) AS dense_rank,

SUM(standard_qty) OVER (PARTITION BY account_id ORDER BY

DATE_TRUNC('month',occurred_at)) AS sum_std_qty,

COUNT(standard_qty) OVER (PARTITION BY account_id ORDER BY

DATE_TRUNC('month',occurred_at)) AS count_std_qty,

AVG(standard_qty) OVER (PARTITION BY account_id ORDER BY

DATE_TRUNC('month',occurred_at)) AS avg_std_qty,

MIN(standard_qty) OVER (PARTITION BY account_id ORDER BY

DATE_TRUNC('month',occurred_at)) AS min_std_qty,

MAX(standard_qty) OVER (PARTITION BY account_id ORDER BY

DATE_TRUNC('month',occurred_at)) AS max_std_qty

FROM orders

Because of order by the rows having same order by are grouped together

Aggregates in Window Functions with and without ORDER BY


The ORDER BY clause is one of two clauses integral to window functions. The ORDER

and PARTITION define what is referred to as the “window”—the ordered subset of

data over which calculations are made. Removing ORDER BY just leaves an
unordered partition; in our query's case, each column's value is simply an

aggregation (e.g., sum, count, average, minimum, or maximum) of all the

standard_qty values in its respective account_id.

As Stack Overflow user mathguy explains:

The easiest way to think about this - leaving the ORDER BY out is equivalent to

"ordering" in a way that all rows in the partition are "equal" to each other. Indeed,

you can get the same effect by explicitly adding the ORDER BY clause like this: ORDER

BY 0 (or "order by" any constant expression), or even, more emphatically, ORDER BY

NULL.

Aliases for Multiple Window Functions

SELECT id,

account_id,

DATE_TRUNC('year',occurred_at) AS year,

DENSE_RANK() OVER account_year_window AS dense_rank,

total_amt_usd,

SUM(total_amt_usd) OVER account_year_window AS sum_total_amt_usd,

COUNT(total_amt_usd) OVER account_year_window AS count_total_amt_usd,

AVG(total_amt_usd) OVER account_year_window AS avg_total_amt_usd,


MIN(total_amt_usd) OVER account_year_window AS min_total_amt_usd,

MAX(total_amt_usd) OVER account_year_window AS max_total_amt_usd

FROM orders

WINDOW account_year_window AS (PARTITION BY account_id ORDER BY


DATE_TRUNC('year',occurred_at))

LAG function

Purpose

It returns the value from a previous row to the current row in the table.

Step 1:

Let’s first look at the inner query and see what this creates.

SELECT account_id, SUM(standard_qty) AS standard_sum

FROM orders

GROUP BY 1
What you see after running this SQL code:

1. The query sums the standard_qty amounts for each account_id to give the

standard paper each account has purchased over all time. E.g., account_id 2951

has purchased 8181 units of standard paper.

2. Notice that the results are not ordered by account_id or standard_qty.

Step 2:

We start building the outer query, and name the inner query as sub.
SELECT account_id, standard_sum

FROM (

SELECT account_id, SUM(standard_qty) AS standard_sum

FROM orders

GROUP BY 1

) sub

This still returns the same table you see above, which is also shown below.

Step 3 (Part A):


We add the Window Function OVER (ORDER BY standard_sum) in the outer query

that will create a result set in ascending order based on the standard_sum column.

SELECT account_id,

standard_sum,

LAG(standard_sum) OVER (ORDER BY standard_sum) AS lag

FROM (

SELECT account_id, SUM(standard_qty) AS standard_sum

FROM orders

GROUP BY 1

) sub

This ordered column will set us up for the other part of the Window Function (see

below).

Step 3 (Part B):


The LAG function creates a new column called lag as part of the outer query:

LAG(standard_sum) OVER (ORDER BY standard_sum) AS lag. This new column

named lag uses the values from the ordered standard_sum (Part A within Step 3).

SELECT account_id,

standard_sum,

LAG(standard_sum) OVER (ORDER BY standard_sum) AS lag

FROM (

SELECT account_id,

SUM(standard_qty) AS standard_sum

FROM demo.orders

GROUP BY 1

) sub

Each row’s value in lag is pulled from the previous row. E.g., for account_id 1901, the

value in lag will come from the previous row. However, since there is no previous row

to pull from, the value in lag for account_id 1901 will be NULL. For account_id 3371, the
value in lag will be pulled from the previous row (i.e., account_id 1901), which will be 0.

This goes on for each row in the table.

What you see after running this SQL code:

Step 4:

To compare the values between the rows, we need to use both columns (standard_sum

and lag). We add a new column named lag_difference, which subtracts the lag value

from the value in standard_sum for each row in the table:

standard_sum - LAG(standard_sum) OVER (ORDER BY standard_sum) AS

lag_difference

SELECT account_id,
standard_sum,

LAG(standard_sum) OVER (ORDER BY standard_sum) AS lag,

standard_sum - LAG(standard_sum) OVER (ORDER BY standard_sum)

AS lag_difference

FROM (

SELECT account_id,

SUM(standard_qty) AS standard_sum

FROM orders

GROUP BY 1

) sub

Each value in lag_difference is comparing the row values between the 2 columns

(standard_sum and lag). E.g., since the value for lag in the case of account_id 1901 is

NULL, the value in lag_difference for account_id 1901 will be NULL. However, for

account_id 3371, the value in lag_difference will compare the value 79 (standard_sum for

account_id 3371) with 0 (lag for account_id 3371) resulting in 79. This goes on for each

row in the table.


What you see after running this SQL code:

Now let’s look at the LEAD function.

LEAD function

Purpose:

Return the value from the row following the current row in the table.

Step 1:
Let’s first look at the inner query and see what this creates.

SELECT account_id,

SUM(standard_qty) AS standard_sum

FROM demo.orders

GROUP BY 1

What you see after running this SQL code:

1. The query sums the standard_qty amounts for each account_id to give the

standard paper each account has purchased over all time. E.g., account_id 2951

has purchased 8181 units of standard paper.

2. Notice that the results are not ordered by account_id or standard_qty.


Step 2:

We start building the outer query, and name the inner query as sub.

SELECT account_id,

standard_sum

FROM (

SELECT account_id,

SUM(standard_qty) AS standard_sum

FROM demo.orders

GROUP BY 1

) sub
This will produce the same table as above, but sets us up for the next part.

Step 3 (Part A):

We add the Window Function (OVER BY standard_sum) in the outer query that will

create a result set ordered in ascending order of the standard_sum column.

SELECT account_id,

standard_sum,

LEAD(standard_sum) OVER (ORDER BY standard_sum) AS lead

FROM (
SELECT account_id,

SUM(standard_qty) AS standard_sum

FROM demo.orders

GROUP BY 1

) sub

This ordered column will set us up for the other part of the Window Function (see

below).

Step 3 (Part B):

The LEAD function in the Window Function statement creates a new column called lead

as part of the outer query: LEAD(standard_sum) OVER (ORDER BY standard_sum)

AS lead

This new column named lead uses the values from standard_sum (in the ordered table

from Step 3 (Part A)). Each row’s value in lead is pulled from the row after it. E.g., for

account_id 1901, the value in lead will come from the row following it (i.e., for

account_id 3371). Since the value is 79, the value in lead for account_id 1901 will be 79.
For account_id 3371, the value in lead will be pulled from the following row (i.e.,

account_id 1961), which will be 102. This goes on for each row in the table.

SELECT account_id,

standard_sum,

LEAD(standard_sum) OVER (ORDER BY standard_sum) AS lead

FROM (

SELECT account_id,

SUM(standard_qty) AS standard_sum

FROM demo.orders

GROUP BY 1

) sub

What you see after running this SQL code:


Step 4: To compare the values between the rows, we need to use both columns

(standard_sum and lag). We add a column named lead_difference, which subtracts the

value in standard_sum from lead for each row in the table: LEAD(standard_sum) OVER

(ORDER BY standard_sum) - standard_sum AS lead_difference

SELECT account_id,

standard_sum,

LEAD(standard_sum) OVER (ORDER BY standard_sum) AS lead,

LEAD(standard_sum) OVER (ORDER BY standard_sum) -

standard_sum AS lead_difference

FROM (

SELECT account_id,
SUM(standard_qty) AS standard_sum

FROM orders

GROUP BY 1

) sub

Each value in lead_difference is comparing the row values between the 2 columns

(standard_sum and lead). E.g., for account_id 1901, the value in lead_difference will

compare the value 0 (standard_sum for account_id 1901) with 79 (lead for account_id

1901) resulting in 79. This goes on for each row in the table.

What you see after running this SQL code:


Scenarios for using LAG and LEAD functions

You can use LAG and LEAD functions whenever you are trying to compare the values in

adjacent rows or rows that are offset by a certain number.

Example 1: You have a sales dataset with the following data and need to compare how

the market segments fare against each other on profits earned.

Market Profits earned by each market


Segment segment

A $550
B $500

C $670

D $730

E $982

Example 2: You have an inventory dataset with the following data and need to compare

the number of days elapsed between each subsequent order placed for Item A.

Inventor Dates when orders were


y Order_id placed

Item A 001 11/2/2017

Item A 002 11/5/2017

Item A 003 11/8/2017


Item A 004 11/15/2017

Item A 005 11/28/2017

As you can see, these are useful data analysis tools that you can use for more complex

analysis!

Percentiles with Partitions


You can use partitions with percentiles to determine the percentile of a specific

subset of all rows. Imagine you're an analyst at Parch & Posey and you want to

determine the largest orders (in terms of quantity) a specific customer has made to

encourage them to order more similarly sized large orders. You only want to

consider the NTILE for that customer's account_id.

In the SQL Explorer below, write three queries (separately) that reflect each of the

following:

1. Use the NTILE functionality to divide the accounts into 4 levels in terms of

the amount of standard_qty for their orders. Your resulting table should

have the account_id, the occurred_at time for each order, the total

amount of standard_qty paper purchased, and one of four levels in a

standard_quartile column.
2. Use the NTILE functionality to divide the accounts into two levels in terms of

the amount of gloss_qty for their orders. Your resulting table should have

the account_id, the occurred_at time for each order, the total amount of

gloss_qty paper purchased, and one of two levels in a gloss_half column.

3. Use the NTILE functionality to divide the orders for each account into 100

levels in terms of the amount of total_amt_usd for their orders. Your

resulting table should have the account_id, the occurred_at time for each

order, the total amount of total_amt_usd paper purchased, and one of 100

levels in a total_percentile column.

Note: To make it easier to interpret the results, order by the account_id in each of

the queries.

FULL OUTER JOIN

In earlier lessons, we covered inner joins, which produce results for which the join

condition is matched in both tables.

Venn diagrams, which are helpful for visualizing table joins, are provided below along with

sample queries. Consider the circle on the left Table A and the circle on the right Table B.
INNER JOIN Venn Diagram

SELECT column_name(s)

FROM Table_A

INNER JOIN Table_B ON Table_A.column_name = Table_B.column_name;

Left joins also include unmatched rows from the left table, which is indicated in the

“FROM” clause.
LEFT JOIN Venn Diagram

SELECT column_name(s)

FROM Table_A

LEFT JOIN Table_B ON Table_A.column_name = Table_B.column_name;

Right joins are similar to left joins, but include unmatched data from the right table --

the one that’s indicated in the JOIN clause.


RIGHT JOIN Venn Diagram

SELECT column_name(s)

FROM Table_A

RIGHT JOIN Table_B ON Table_A.column_name = Table_B.column_name;

In some cases, you might want to include unmatched rows from both tables being

joined. You can do this with a full outer join.


FULL OUTER JOIN Venn Diagram

SELECT column_name(s)

FROM Table_A

FULL OUTER JOIN Table_B ON Table_A.column_name = Table_B.column_name;

A common application of this is when joining two tables on a timestamp. Let’s say

you’ve got one table containing the number of item 1 sold each day, and another

containing the number of item 2 sold. If a certain date, like January 1, 2018, exists in the
left table but not the right, while another date, like January 2, 2018, exists in the right

table but not the left:

● a left join would drop the row with January 2, 2018 from the result set

● a right join would drop January 1, 2018 from the result set

The only way to make sure both January 1, 2018 and January 2, 2018 make it into the

results is to do a full outer join. A full outer join returns unmatched records in each

table with null values for the columns that came from the opposite table.

If you wanted to return unmatched rows only, which is useful for some cases of data

assessment, you can isolate them by adding the following line to the end of the query:

WHERE Table_A.column_name IS NULL OR Table_B.column_name IS NULL


FULL OUTER JOIN with WHERE A.Key IS NULL OR B.Key IS NULL Venn Diagram

You might also like