SQL Notes
SQL Notes
Any jobs which require to store large amounts of data. Therefore before carrying out any
analysis we need to retrieve it.
Extracting information from database.
Record row
Field column
If C or java If SQL
Step 1 open the door Fetch the bucket please
Step 2 walk outside Feel le
Take the bucket khali what should be done
Come inside
Feel le
How something should be done
Relationships
They tell you how much of the data from a foreign key field can be seen in the PK column of the
table the data is related to and vice versa
Varchar
Date YYYY-MM-DD
Date- Time
YYYY-MM-DD HH::MM:Ss
Timestamp - to find duration between two events
Between...And
Helps us designate the interval to which a given value belongs.
Aggregate functions apply to multiple rows of a single column of a table and return an
output of a single value.
Count()
Counts the number of non-null records in a field.
Sum()
Sums all the non-null values in a column
Having Clause
Refines the output from records that do not satisfy a certain conditon
Applied to GROUP BY block
If some condition is applied on aggregate function use having clause
SQL JOINS
Joins allow to construct relationships between objects.
Null values or values appearing in either of the table are not shown in the result.
Dont assume
There might be duplicate rows in your data
Take care when using aggregate functions with mysql check the output for proper
explaination.
select
e.first_name, e.last_name
from
employees e
where
e.emp_no in ( select dm.emp_no from dept_manager dm);
Subquery
Use OrderBy in outer query.
Self join -
Entity Relationship Diagrams
An entity relationship diagram (ERD) is a common way to view data in a database.
Below is the ERD for the database we will use from Parch & Posey. These diagrams help
you visualize the data you are analyzing including:
1. The names of the tables.
2. The columns in each table.
3. The way the tables work together.
You can think of each of the boxes below as a spreadsheet.
In the Parch & Posey database there are five tables (essentially 5 spreadsheets):
1. web_events
2. accounts
3. orders
4. sales_reps
5. region
You can think of each of these tables as an individual spreadsheet. Then the columns in
each spreadsheet are listed below the table name. For example, the region table has
two columns: id and name. Alternatively the web_events table has four columns.
There are some major advantages to using traditional relational databases, which we
interact with using SQL. The five most apparent are:
● SQL is easy to understand.
● Traditional databases allow us to access data directly.
● Traditional databases allow us to audit and replicate our data.
he LIKE operator is extremely useful for working with text. You will use LIKE within a
WHERE clause. The LIKE operator is frequently used with %. The % tells us that we might
want any number of characters leading up to a particular set of characters or following
a certain set of characters, as we saw with the google syntax above. Remember you will
need to use single quotes for the text you pass to the LIKE operator, because of this
lower and uppercase letters are not the same within the string. Searching for 'T' is not
the same as searching for 't'. In other SQL environments (outside the classroom), you
can use either single or double quotes.
Hopefully you are starting to get more comfortable with SQL, as we are starting to move
toward operations that have more applications, but this also means we can't show you
every use case. Hopefully, you can start to think about how you might use these types of
applications to identify phone numbers from a certain region, or an individual where you
can't quite remember the full name.
2. Data can be accessed quickly - SQL allows you to obtain results very quickly from
the data stored in a database. Code can be optimized to quickly pull results.
3. Data is easily shared - multiple individuals can access data stored in a database,
and the data is the same for all users allowing for consistent results for anyone
with access to your database.
few key points about data stored in SQL databases:
1. Data in databases is stored in tables that can be thought of just like Excel
spreadsheets.
For the most part, you can think of a database as a bunch of Excel spreadsheets.
Each spreadsheet has rows and columns. Where each row holds data on a
transaction, a person, a company, etc., while each column holds data pertaining
to a particular aspect of one of the rows you care about like a name, location, a
unique id, etc.
2. **All the data in the same column must match in terms of data type. **
An entire column is considered quantitative, discrete, or as some sort of string.
This means if you have one row with a string in a particular column, the entire
column might change to a text data type. This can be very bad if you want to do
math with this column!
3. Consistent column types are one of the main reasons working with databases is
fast.
Often databases hold a LOT of data. So, knowing that the columns are all of the
same type of data means that obtaining data from a database can still be fast.
The ORDER BY statement is always after the SELECT and FROM statements, but it is
before the LIMIT statement. As you learn additional commands, the order of these
statements will matter more. If we are using the LIMIT statement, it will always appear
last.
FROM FROM Table Provide the table where the columns exist
LIKE WHERE Col LIKE '%me%' Only pulls rows where column has 'me'
within the text
IN WHERE Col IN ('Y', 'N') A filter for only rows with column of 'Y' or
'N'
NOT WHERE Col NOT IN ('Y', 'N') NOT is frequently used with LIKE and IN
AND WHERE **Col1 > 5 AND Filter rows where two or more conditions
Col2 < 3 ** must be true
OR WHERE Col1 > 5 OR Col2 < Filter rows where at least one condition
3 must be true
BETWEEN WHERE Col BETWEEN 3 Often easier syntax than using an AND
AND 5
Again - JOINs are useful for allowing us to pull data from multiple tables. This is both
simple and powerful all at the same time.
With the addition of the JOIN statement to our toolkit, we will also be adding the ON
statement.
NULLs are a datatype that specifies where no data exists in SQL. They are often ignored
in our aggregation functions, which you will get a first look at in the next concept using
COUNT.
When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don't
use =, because NULL isn't considered a value in SQL. Rather, it is a property of the data.
Notice that COUNT does not consider rows that have NULL values. Therefore, this
can be useful for quickly identifying which rows have missing data. You will learn
GROUP BY in an upcoming concept, and then each of these aggregators will
become much more useful.
Aggregation Reminder
An important thing to remember: aggregators only aggregate vertically - the
values of a column. If you want to perform a calculation across rows, you would
We saw this in the first lesson if you need a refresher, but the quiz in the next
concept should assure you still remember how to aggregate across rows.
this data, but finding the median happens to be a pretty difficult thing to get using
question.
The key takeaways here:
● GROUP BY can be used to aggregate data within subsets of the data. For
representatives.
● Any column in the SELECT statement that is not within an aggregator must
Which account (by name) placed the earliest order? Your solution should have the
account name and the date of the order.
SELECT a.name, o.occurred_at
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY occurred_at
LIMIT 1;
1.
Find the total sales in usd for each account. You should include two columns - the
total sales for each company's orders in usd and the company name.
SELECT a.name, SUM(total_amt_usd) total_sales
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.name;
2.
Via what channel did the most recent (latest) web_event occur, which account
was associated with this web_event? Your query should return only three values -
the date, channel, and account name.
SELECT w.occurred_at, w.channel, a.name
FROM web_events w
JOIN accounts a
ON w.account_id = a.id
ORDER BY w.occurred_at DESC
LIMIT 1;
3.
Find the total number of times each type of channel from the web_events was
used. Your final table should have two columns - the channel and the number of
times the channel was used.
SELECT w.channel, COUNT(*)
FROM web_events w
GROUP BY w.channel
4.
Who was the primary contact associated with the earliest web_event?
SELECT a.primary_poc
FROM web_events w
JOIN accounts a
ON a.id = w.account_id
ORDER BY w.occurred_at
LIMIT 1;
5.
What was the smallest order placed by each account in terms of total usd. Provide
only two columns - the account name and the total usd. Order from smallest
dollar amounts to largest.
SELECT a.name, MIN(total_amt_usd) smallest_order
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY smallest_order;
6.
Find the number of sales reps in each region. Your final table should have two
columns - the region and the number of sales_reps. Order from fewest reps to
most reps.
SELECT r.name, COUNT(*) num_reps
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
GROUP BY r.name
7. ORDER BY num_reps;
9. The order of columns listed in the ORDER BY clause does make a difference.
results will be the same regardless. If we run the same query and reverse the
order in the GROUP BY clause, you can see we get the same results.
● As with ORDER BY, you can substitute numbers for column names in the
● A reminder here that any column that is not within an aggregation must
show up in your GROUP BY statement. If you forget, you will likely get an
error. However, in the off chance that your query does work, you might not
For each account, determine the average amount of each type of paper they
purchased across their orders. Your result should have four columns - one for the
account name and one for the average spent on each of the paper types.
SELECT a.name, AVG(o.standard_qty) avg_stand, AVG(o.gloss_qty) avg_gloss,
AVG(o.poster_qty) avg_post
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
For each account, determine the average amount spent per order on each paper
type. Your result should have four columns - one for the account name and one for
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
Determine the number of times a particular channel was used in the web_events
table for each sales rep. Your final table should have three columns - the name of
the sales rep, the channel, and the number of occurrences. Order your table with
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
Determine the number of times a particular channel was used in the web_events
table for each region. Your final table should have three columns - the region
name, the channel, and the number of occurrences. Order your table with the
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id
DISTINCT is always used in SELECT statements, and it provides the unique rows for
all columns written in the SELECT statement. Therefore, you only use DISTINCT
FROM table1;
which would return the unique (or DISTINCT) rows across all three columns.
FROM table1;
You can think of DISTINCT the same way you might think of the statement
"unique".
region.
The below two queries have the same number of resulting rows (351), so we know
that every account is associated with only one region. If each account was
associated with more than one region, the first query should have returned more
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id;
and
SELECT DISTINCT id, name
FROM accounts;
1.
Actually all of the sales reps have worked on more than one account. The fewest
number of accounts any sales rep works on is 3. There are 50 sales reps, and they
all have more than one account. Using DISTINCT in the second query assures that
all of the sales reps are accounted for in the first query.
SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
ORDER BY num_accounts;
and
SELECT DISTINCT id, name
2. FROM sales_reps;
also commonly done using a subquery. Essentially, any time you want to perform a
WHERE on an element of your query that was created by an aggregate, you need to
GROUPing BY a date column is not usually very useful in SQL, as these columns
tend to have transaction data down to a second. Keeping date information at such
a granular data is both a blessing and a curse, as it gives really precise information
How many of the sales reps have more than 5 accounts that they manage?
SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
ORDER BY num_accounts;
and technically, we can get this using a SUBQUERY as shown below. This same logic
can be used for the other queries, but this will not be shown.
SELECT COUNT(*) num_reps_above5
JOIN sales_reps s
ON s.id = a.sales_rep_id
1.
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY num_orders;
2.
JOIN orders o
ON a.id = o.account_id
LIMIT 1;
3.
How many accounts spent more than 30,000 usd total across all orders?
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY total_spent;
4.
How many accounts spent less than 1,000 usd total across all orders?
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY total_spent;
5.
FROM accounts a
JOIN orders o
ON a.id = o.account_id
LIMIT 1;
6.
FROM accounts a
JOIN orders o
ON a.id = o.account_id
LIMIT 1;
7.
times?
SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
ORDER BY use_of_channel;
8.
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
LIMIT 1;
9.
Note: This query above only works if there are no ties for the account that
used facebook the most. It is a best practice to use a larger limit number first
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
Lucky for us, there are a number of built in SQL functions that are aimed at helping
Here we saw that dates are stored in year, month, day, hour, minute, second,
which helps us in truncating. In the next concept, you will see a number of
date-time column. Common trunctions are day, month, and year. Here is a great
DATE_PART can be useful for pulling a specific portion of a date, but notice pulling
month or day of the week (dow) means that you are no longer keeping the years in
order. Rather you are grouping for certain components regardless of which year
For additional functions you can use with dates, check out the documentation here,
but the DATE_TRUNC and DATE_PART functions definitely give you a great start!
You can reference the columns in your select statement in GROUP BY and ORDER
BY clauses with numbers that follow the order they appear in the select statement.
For example
FROM orders
GROUP BY 1 (this 1 refers to standard_qty since it is the first of the columns included in
ORDER BY 1 (this 1 refers to standard_qty since it is the first of the columns included in
greatest to least. Do you notice any trends in the yearly sales totals?
total_spent
FROM orders
GROUP BY 1
ORDER BY 2 DESC;
1.
When we look at the yearly totals, you might notice that 2013 and 2017 have
much smaller totals than all other years. If we look further at the monthly
data, we see that for 2013 and 2017 there is only one month of sales for each
of these years (12 for 2013 and 1 for 2017). Therefore, neither of these are
evenly represented. Sales have been increasing year over year, with 2016
being the largest sales to date. At this rate, we might expect 2017 to have the
largest sales.
Which month did Parch & Posey have the greatest sales in terms of total dollars?
total_spent
FROM orders
GROUP BY 1
ORDER BY 2 DESC;
2.
Which year did Parch & Posey have the greatest sales in terms of total number of
FROM orders
GROUP BY 1
ORDER BY 2 DESC;
3.
Again, 2016 by far has the most amount of orders, but again 2013 and 2017
Which month did Parch & Posey have the greatest sales in terms of total number of
FROM orders
GROUP BY 1
ORDER BY 2 DESC;
4.
December still has the most sales, but interestingly, November has the
second most sales (but not the most dollar sales. To make a fair comparison
from one month to another 2017 and 2013 data were removed.
In which month of which year did Walmart spend the most on gloss paper in terms
of dollars?
SELECT DATE_TRUNC('month', o.occurred_at) ord_date, SUM(o.gloss_amt_usd)
tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
5.
May 2016 was when Walmart spent the most on gloss paper.
● CASE must include the following components: WHEN, THEN, and END. ELSE is
an optional component to catch cases that didn’t meet any of the other
● You can make any conditional statement using any conditional operator (like
WHERE) between WHEN and THEN. This includes stringing together multiple
Example
In a quiz question in the previous Basic SQL lesson, you saw this question:
find the unit price for standard paper for each order. Limit the results to the
first 10 orders, and include the id and account_id fields. NOTE - you will be
thrown an error with the correct solution to this question. This is for a
division by zero. You will learn how to get a solution without an error to
this query when you learn about CASE statements in a later section.
Let's see how we can use the CASE statement to get around this error.
FROM orders
LIMIT 10;
Now, let's use a CASE statement. This way any time the standard_qty is zero, we
FROM orders
LIMIT 10;
Now the first part of the statement will catch any of those division by zero values
that were causing the error, and the other components will compute the division as
necessary. You will notice, we essentially charge all of our accounts 4.99 for
standard paper. It makes sense this doesn't fluctuate, and it is more accurate than
adding 1 in the denominator like our quick fix might have been in the earlier lesson.
This one is pretty tricky. Try running the query yourself to make sure you
understand what is happening. The next concept will give you some practice writing
CASE statements on your own. In this video, we showed that getting the same
information using a WHERE clause means only being able to get one set of data
There are some advantages to separating data into separate columns like this
depending on what you want to do, but often this level of separation might be
and the level of the order - ‘Large’ or ’Small’ - depending on if the order is $3000 or
FROM orders;
1.
Write a query to display the number of orders in each of three categories, based on
the total number of items in each order. The three categories are: 'At Least 2000',
WHEN total >= 1000 AND total < 2000 THEN 'Between 1000 and 2000'
COUNT(*) AS order_count
FROM orders
GROUP BY 1;
2.
amount associated with their purchases. The top branch includes anyone with a
Lifetime Value (total sales of all orders) greater than 200,000 usd. The second
branch is between 200,000 and 100,000 usd. The lowest branch is anyone under
100,000 usd. Provide a table that includes the level associated with each account.
You should provide the account name, the total sales of all orders for the
customer, and the level. Order with the top spending customers listed first.
SELECT a.name, SUM(total_amt_usd) total_spent,
FROM orders o
JOIN accounts a
ON o.account_id = a.id
GROUP BY a.name
ORDER BY 2 DESC;
3.
We would now like to perform a similar calculation to the first, but we want to
obtain the total amount spent by customers only in 2016 and 2017. Keep the same
levels as in the previous question. Order with the top spending customers listed
first.
SELECT a.name, SUM(total_amt_usd) total_spent,
FROM orders o
JOIN accounts a
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC;
4.
We would like to identify top performing sales reps, which are sales reps
associated with more than 200 orders. Create a table with the sales rep name, the
total number of orders, and a column with top or not depending on if they have
more than 200 orders. Place the top sales people first in your final table.
SELECT s.name, COUNT(*) num_ords,
FROM orders o
JOIN accounts a
ON o.account_id = a.id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name
ORDER BY 2 DESC;
5.
It is worth mentioning that this assumes each name is unique - which has
been done a few times. We otherwise would want to break by the name and
the id of the table.
The previous didn't account for the middle, nor the dollar amount associated with
the sales. Management decides they want to see these characteristics represented
as well. We would like to identify top performing sales reps, which are sales reps
associated with more than 200 orders or more than 750000 in total sales. The
middle group has any rep with more than 150 orders or 500000 in sales. Create a
table with the sales rep name, the total number of orders, total sales across all
orders, and a column with top, middle, or low depending on this criteria. Place the
top sales people based on dollar amount of sales first in your final table.
SELECT s.name, COUNT(*), SUM(o.total_amt_usd) total_spent,
CASE WHEN COUNT(*) > 200 OR SUM(o.total_amt_usd) > 750000 THEN 'top'
FROM orders o
JOIN accounts a
ON o.account_id = a.id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name
6. ORDER BY 3 DESC;
1. Subqueries
2. Table Expressions
Both subqueries and table expressions are methods for being able to write a
query that creates a table, and then write a query that interacts with this newly
created table. Sometimes the question you are trying to answer doesn't have an
However, if we were able to create new tables from the existing tables, we know we
could query these new tables to answer our question. This is where the queries of
If you can't yet think of a question that might require such a query, don't worry
In the first subquery you wrote, you created a table that you could then query again in
the FROM statement. However, if you are only returning a single value, you might use
that value in a logical statement like WHERE, HAVING, or even SELECT - the value could
On the next concept, we will work through this example, and then you will get some
Expert Tip
Note that you should not include an alias when you write a subquery in a conditional
statement. This is because the subquery is treated as an individual value (or set of
Also, notice the query here compared a single value. If we returned an entire column
table, then we must use an ALIAS for the table, and perform additional logic on the
entire table.
NEXT
The WITH statement is often called a Common Table Expression or CTE. Though
these expressions serve the exact same purpose as subqueries, they are more
common in practice, as they tend to be cleaner for a future reader to follow the
logic.
In the next concept, we will walk through this example a bit more slowly to make
sure you have all the similarities between subqueries and these expressions down
for you to use in practice! If you are already feeling comfortable skip ahead to
First, I wanted to find the total_amt_usd totals associated with each sales rep, and
I also wanted the region in which they were located. The query below provided this
information.
SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC;
Next, I pulled the max for each region, and then we can use this to pull those rows
in our final result.
SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) t1
GROUP BY 1;
Essentially, this is a JOIN of these two tables, where the region and amount match.
SELECT t3.rep_name, t3.region_name, t3.total_amt
FROM(SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) t1
GROUP BY 1) t2
JOIN (SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC) t3
ON t3.region_name = t2.region_name AND t3.total_amt = t2.total_amt;
1.
For the region with the largest sales total_amt_usd, how many total orders were
placed?
The first query I wrote was to pull the total_amt_usd for each region.
SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name;
Then we just want the region with the max amount from this table. There are two
ways I considered getting this amount. One was to pull the max using a subquery.
Another way is to order descending and just pull the top value.
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub;
Finally, we want to pull the total orders for the region with this amount:
SELECT r.name, COUNT(o.total) total_orders
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
HAVING SUM(o.total_amt_usd) = (
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub);
2.
First, we want to find the account that had the most standard_qty paper. The
query here pulls that account, as well as the total amount:
SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
Now, I want to use this to pull all the accounts with more total sales:
SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std,
SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) sub);
This is now a list of all the accounts with more total orders. We can get the count
with just another simple subquery.
SELECT COUNT(*)
FROM (SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std,
SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) inner_tab)
) counter_tab;
3.
For the customer that spent the most (in total over their lifetime as a customer)
total_amt_usd, how many web_events did they have for each channel?
Here, we first want to pull the customer with the most spent in lifetime value.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1;
Now, we want to look at the number of events on each channel this company had,
which we can match with just the id.
SELECT a.name, w.channel, COUNT(*)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id AND a.id = (SELECT id
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1) inner_table)
GROUP BY 1, 2
ORDER BY 3 DESC;
4.
I added an ORDER BY for no real reason, and the account name to assure I
What is the lifetime average amount spent in terms of total_amt_usd for the top 10
total spending accounts?
First, we just want to find the top 10 accounts in terms of highest total_amt_usd.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10;
Then, we want to only pull the accounts with more than this average amount.
SELECT o.account_id, AVG(o.total_amt_usd)
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
FROM orders o);
First, I wanted to find the total_amt_usd totals associated with each sales rep, and
I also wanted the region in which they were located. The query below provided this
information.
SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC;
Next, I pulled the max for each region, and then we can use this to pull those rows
in our final result.
SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) t1
GROUP BY 1;
Essentially, this is a JOIN of these two tables, where the region and amount match.
SELECT t3.rep_name, t3.region_name, t3.total_amt
FROM(SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) t1
GROUP BY 1) t2
JOIN (SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd)
total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC) t3
ON t3.region_name = t2.region_name AND t3.total_amt = t2.total_amt;
1.
For the region with the largest sales total_amt_usd, how many total orders were
placed?
The first query I wrote was to pull the total_amt_usd for each region.
SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name;
Then we just want the region with the max amount from this table. There are two
ways I considered getting this amount. One was to pull the max using a subquery.
Another way is to order descending and just pull the top value.
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub;
Finally, we want to pull the total orders for the region with this amount:
SELECT r.name, COUNT(o.total) total_orders
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
HAVING SUM(o.total_amt_usd) = (
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub);
2.
How many accounts had more total purchases than the account name which has
bought the most standard_qty paper throughout their lifetime as a customer?
First, we want to find the account that had the most standard_qty paper. The
query here pulls that account, as well as the total amount:
SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
Now, I want to use this to pull all the accounts with more total sales:
SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std,
SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) sub);
This is now a list of all the accounts with more total orders. We can get the count
with just another simple subquery.
SELECT COUNT(*)
FROM (SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std,
SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) inner_tab)
) counter_tab;
3.
For the customer that spent the most (in total over their lifetime as a customer)
total_amt_usd, how many web_events did they have for each channel?
Here, we first want to pull the customer with the most spent in lifetime value.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1;
Now, we want to look at the number of events on each channel this company had,
which we can match with just the id.
SELECT a.name, w.channel, COUNT(*)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id AND a.id = (SELECT id
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1) inner_table)
GROUP BY 1, 2
ORDER BY 3 DESC;
4.
I added an ORDER BY for no real reason, and the account name to assure I
What is the lifetime average amount spent in terms of total_amt_usd for the top 10
total spending accounts?
First, we just want to find the top 10 accounts in terms of highest total_amt_usd.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10;
Then, we want to only pull the accounts with more than this average amount.
SELECT o.account_id, AVG(o.total_amt_usd)
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
FROM orders o);
these expressions serve the exact same purpose as subqueries, they are more
common in practice, as they tend to be cleaner for a future reader to follow the
logic.
In the next concept, we will walk through this example a bit more slowly to make
sure you have all the similarities between subqueries and these expressions down
for you to use in practice! If you are already feeling comfortable skip ahead to
table2 AS (
SELECT *
FROM accounts)
SELECT *
FROM table1
JOIN table2
ON table1.account_id = table2.id;
Provide the name of the sales_rep in each region with the largest amount of
total_amt_usd sales.
WITH t1 AS (
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC),
t2 AS (
FROM t1
GROUP BY 1)
FROM t1
JOIN t2
1.
For the region with the largest sales total_amt_usd, how many total orders were
placed?
WITH t1 AS (
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name),
t2 AS (
SELECT MAX(total_amt)
FROM t1)
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
For the account that purchased the most (in total over their lifetime as a customer)
standard_qty paper, how many accounts still had more in total purchases?
WITH t1 AS (
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1),
t2 AS (
SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
SELECT COUNT(*)
FROM t2;
3.
For the customer that spent the most (in total over their lifetime as a customer)
total_amt_usd, how many web_events did they have for each channel?
WITH t1 AS (
FROM orders o
JOIN accounts a
ON a.id = o.account_id
ORDER BY 3 DESC
LIMIT 1)
FROM accounts a
JOIN web_events w
GROUP BY 1, 2
ORDER BY 3 DESC;
4.
What is the lifetime average amount spent in terms of total_amt_usd for the top 10
total spending accounts?
WITH t1 AS (
FROM orders o
JOIN accounts a
ON a.id = o.account_id
ORDER BY 3 DESC
LIMIT 10)
SELECT AVG(tot_spent)
FROM t1;
5.
FROM orders o
JOIN accounts a
ON a.id = o.account_id),
t2 AS (
FROM orders o
GROUP BY 1
SELECT AVG(avg_amt)
6. FROM t2;
SQL Cleaning
Here we looked at three new functions:
1. LEFT
2. RIGHT
3. LENGTH
LEFT pulls a specified number of characters for each row in a specified column
starting at the beginning (or from the left). As you saw here, you can pull the first
RIGHT pulls a specified number of characters for each row in a specified column
starting at the end (or from the right). As you saw here, you can pull the last eight
LENGTH provides the number of characters for each row of a specified column.
Here, you saw that we could use this to get the length of each phone number as
LENGTH(phone_number).
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
1.
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
2.
There are 350 company names that start with a letter and 1 that starts with a
number. This gives a ratio of 350/351 that are company names that start with a
letter or 99.7%.
SELECT SUM(num) nums, SUM(letter) letters
('0','1','2','3','4','5','6','7','8','9')
('0','1','2','3','4','5','6','7','8','9')
3.
There are 80 company names that start with a vowel and 271 that start with other
1. POSITION
2. STRPOS
3. LOWER
4. UPPER
POSITION takes a character and a column, and provides the index where that
character is for each row. The index of the first position is 1 in SQL. If you come
from another programming language, many begin indexing at 0. Here, you saw that
STRPOS provides the same result as POSITION, but the syntax for achieving those
Note, both POSITION and STRPOS are case sensitive, so looking for A is different
Therefore, if you want to pull an index regardless of the case of a letter, you might
want to use LOWER or UPPER to make all of the characters lower or uppercase.
In this video, you saw additional functionality for working with dates including:
1. TO_DATE
2. CAST
3. Casting with ::
Then you can change a string to a date using CAST. CAST is actually useful to
change lots of column types. Commonly you might be doing as you saw here,
However, you might want to make other changes to your columns in terms of their
In this example, you also saw that instead of CAST(date_column AS DATE), you can
use date_column::DATE.
Expert Tip
Most of the functions presented in this lesson are specific to strings. They won’t
work with dates, integers or floating-point numbers. However, using any of these
LEFT, RIGHT, and TRIM are all used to select only certain elements of strings, but
using them to select elements of a number or date will treat them as strings for the
purpose of the function. Though we didn't cover TRIM in this lesson explicitly, it can
be used to remove characters from the beginning and end of a string. This can
remove unwanted spaces at the beginning or end of a row that often happen with
There are a number of variations of these functions, as well as several other string
functions not covered here. Different databases use subtle variations on these
functions.
coalesce(x,y,’N/a’)
If value of y is not null then it shows that otherwise if y is also null then it puts N/A
Coalesce can help you visualize the prototype of the final database
Self join - Combine rows of a table with other rows of the same table
a window function performs a calculation across a set of table rows that are
somehow related to the current row. This is comparable to the type of calculation
that can be done with an aggregate function. But unlike regular aggregate
functions, use of a window function does not cause rows to become grouped into a
single output row — the rows retain their separate identities. Behind the scenes,
the window function is able to access more than just the current row of the query
result.
that you may not be familiar with: OVER and PARTITION BY. These are key to
window functions. Not every window function uses PARTITION BY; we can also use
ORDER BY or no statement at all depending on the query we want to run. You will
practice using these clauses in the upcoming quizzes. If you want more details right
Note: You can’t use window functions and standard aggregations in the same query.
Now, modify your query from the previous quiz to include partitions. Still create a
running total of standard_amt_usd (in the orders table) over order time, but this
time, date truncate occurred_at by year and partition by that same year-truncated
occurred_at variable. Your final table should have three columns: One with the
amount being added for each row, one for the truncated date, and a final column with
SELECT standard_amt_usd,
FROM orders
ROW_NUMBER() - as the name suggests it gives row num to rows based on sum
clause
Rank() - same as above but the rank is same for same value
Select the id, account_id, and total variable from the orders table, then create a
column called total_rank that ranks this total amount of paper ordered (from
highest to lowest) for each account using a partition. Your final table should have
SELECT id,
account_id,
total,
FROM orders
Aggregates in window function
SELECT id,
account_id,
standard_qty,
DATE_TRUNC('month',occurred_at)) AS dense_rank,
DATE_TRUNC('month',occurred_at)) AS sum_std_qty,
DATE_TRUNC('month',occurred_at)) AS count_std_qty,
DATE_TRUNC('month',occurred_at)) AS avg_std_qty,
DATE_TRUNC('month',occurred_at)) AS min_std_qty,
DATE_TRUNC('month',occurred_at)) AS max_std_qty
FROM orders
Because of order by the rows having same order by are grouped together
data over which calculations are made. Removing ORDER BY just leaves an
unordered partition; in our query's case, each column's value is simply an
The easiest way to think about this - leaving the ORDER BY out is equivalent to
"ordering" in a way that all rows in the partition are "equal" to each other. Indeed,
you can get the same effect by explicitly adding the ORDER BY clause like this: ORDER
BY 0 (or "order by" any constant expression), or even, more emphatically, ORDER BY
NULL.
SELECT id,
account_id,
DATE_TRUNC('year',occurred_at) AS year,
total_amt_usd,
FROM orders
LAG function
Purpose
It returns the value from a previous row to the current row in the table.
Step 1:
Let’s first look at the inner query and see what this creates.
FROM orders
GROUP BY 1
What you see after running this SQL code:
1. The query sums the standard_qty amounts for each account_id to give the
standard paper each account has purchased over all time. E.g., account_id 2951
Step 2:
We start building the outer query, and name the inner query as sub.
SELECT account_id, standard_sum
FROM (
FROM orders
GROUP BY 1
) sub
This still returns the same table you see above, which is also shown below.
that will create a result set in ascending order based on the standard_sum column.
SELECT account_id,
standard_sum,
FROM (
FROM orders
GROUP BY 1
) sub
This ordered column will set us up for the other part of the Window Function (see
below).
named lag uses the values from the ordered standard_sum (Part A within Step 3).
SELECT account_id,
standard_sum,
FROM (
SELECT account_id,
SUM(standard_qty) AS standard_sum
FROM demo.orders
GROUP BY 1
) sub
Each row’s value in lag is pulled from the previous row. E.g., for account_id 1901, the
value in lag will come from the previous row. However, since there is no previous row
to pull from, the value in lag for account_id 1901 will be NULL. For account_id 3371, the
value in lag will be pulled from the previous row (i.e., account_id 1901), which will be 0.
Step 4:
To compare the values between the rows, we need to use both columns (standard_sum
and lag). We add a new column named lag_difference, which subtracts the lag value
lag_difference
SELECT account_id,
standard_sum,
AS lag_difference
FROM (
SELECT account_id,
SUM(standard_qty) AS standard_sum
FROM orders
GROUP BY 1
) sub
Each value in lag_difference is comparing the row values between the 2 columns
(standard_sum and lag). E.g., since the value for lag in the case of account_id 1901 is
NULL, the value in lag_difference for account_id 1901 will be NULL. However, for
account_id 3371, the value in lag_difference will compare the value 79 (standard_sum for
account_id 3371) with 0 (lag for account_id 3371) resulting in 79. This goes on for each
LEAD function
Purpose:
Return the value from the row following the current row in the table.
Step 1:
Let’s first look at the inner query and see what this creates.
SELECT account_id,
SUM(standard_qty) AS standard_sum
FROM demo.orders
GROUP BY 1
1. The query sums the standard_qty amounts for each account_id to give the
standard paper each account has purchased over all time. E.g., account_id 2951
We start building the outer query, and name the inner query as sub.
SELECT account_id,
standard_sum
FROM (
SELECT account_id,
SUM(standard_qty) AS standard_sum
FROM demo.orders
GROUP BY 1
) sub
This will produce the same table as above, but sets us up for the next part.
We add the Window Function (OVER BY standard_sum) in the outer query that will
SELECT account_id,
standard_sum,
FROM (
SELECT account_id,
SUM(standard_qty) AS standard_sum
FROM demo.orders
GROUP BY 1
) sub
This ordered column will set us up for the other part of the Window Function (see
below).
The LEAD function in the Window Function statement creates a new column called lead
AS lead
This new column named lead uses the values from standard_sum (in the ordered table
from Step 3 (Part A)). Each row’s value in lead is pulled from the row after it. E.g., for
account_id 1901, the value in lead will come from the row following it (i.e., for
account_id 3371). Since the value is 79, the value in lead for account_id 1901 will be 79.
For account_id 3371, the value in lead will be pulled from the following row (i.e.,
account_id 1961), which will be 102. This goes on for each row in the table.
SELECT account_id,
standard_sum,
FROM (
SELECT account_id,
SUM(standard_qty) AS standard_sum
FROM demo.orders
GROUP BY 1
) sub
(standard_sum and lag). We add a column named lead_difference, which subtracts the
value in standard_sum from lead for each row in the table: LEAD(standard_sum) OVER
SELECT account_id,
standard_sum,
standard_sum AS lead_difference
FROM (
SELECT account_id,
SUM(standard_qty) AS standard_sum
FROM orders
GROUP BY 1
) sub
Each value in lead_difference is comparing the row values between the 2 columns
(standard_sum and lead). E.g., for account_id 1901, the value in lead_difference will
compare the value 0 (standard_sum for account_id 1901) with 79 (lead for account_id
1901) resulting in 79. This goes on for each row in the table.
You can use LAG and LEAD functions whenever you are trying to compare the values in
Example 1: You have a sales dataset with the following data and need to compare how
A $550
B $500
C $670
D $730
E $982
Example 2: You have an inventory dataset with the following data and need to compare
the number of days elapsed between each subsequent order placed for Item A.
As you can see, these are useful data analysis tools that you can use for more complex
analysis!
subset of all rows. Imagine you're an analyst at Parch & Posey and you want to
determine the largest orders (in terms of quantity) a specific customer has made to
encourage them to order more similarly sized large orders. You only want to
In the SQL Explorer below, write three queries (separately) that reflect each of the
following:
1. Use the NTILE functionality to divide the accounts into 4 levels in terms of
the amount of standard_qty for their orders. Your resulting table should
have the account_id, the occurred_at time for each order, the total
standard_quartile column.
2. Use the NTILE functionality to divide the accounts into two levels in terms of
the amount of gloss_qty for their orders. Your resulting table should have
the account_id, the occurred_at time for each order, the total amount of
3. Use the NTILE functionality to divide the orders for each account into 100
resulting table should have the account_id, the occurred_at time for each
order, the total amount of total_amt_usd paper purchased, and one of 100
Note: To make it easier to interpret the results, order by the account_id in each of
the queries.
In earlier lessons, we covered inner joins, which produce results for which the join
Venn diagrams, which are helpful for visualizing table joins, are provided below along with
sample queries. Consider the circle on the left Table A and the circle on the right Table B.
INNER JOIN Venn Diagram
SELECT column_name(s)
FROM Table_A
Left joins also include unmatched rows from the left table, which is indicated in the
“FROM” clause.
LEFT JOIN Venn Diagram
SELECT column_name(s)
FROM Table_A
Right joins are similar to left joins, but include unmatched data from the right table --
SELECT column_name(s)
FROM Table_A
In some cases, you might want to include unmatched rows from both tables being
SELECT column_name(s)
FROM Table_A
A common application of this is when joining two tables on a timestamp. Let’s say
you’ve got one table containing the number of item 1 sold each day, and another
containing the number of item 2 sold. If a certain date, like January 1, 2018, exists in the
left table but not the right, while another date, like January 2, 2018, exists in the right
● a left join would drop the row with January 2, 2018 from the result set
● a right join would drop January 1, 2018 from the result set
The only way to make sure both January 1, 2018 and January 2, 2018 make it into the
results is to do a full outer join. A full outer join returns unmatched records in each
table with null values for the columns that came from the opposite table.
If you wanted to return unmatched rows only, which is useful for some cases of data
assessment, you can isolate them by adding the following line to the end of the query: