SQL For Data Analysis Notes
SQL For Data Analysis Notes
Basic SQL
1.1 SQL Introduction
On the importance of SQL in data analysis. While SQL has various applications in
software development, this course will primarily focus on its use in data analysis. The
course is designed to be accessible to individuals from different backgrounds,
including marketing, operations, and finance, and aims to equip learners with the skills
to use data effectively in decision-making. Derek highlights that a significant portion of
analysts' and data scientists' work is conducted using SQL.
1.3
introduces the concept of using spreadsheets to store data and explains how multiple
spreadsheets can be visualized using an Entity Relationship Diagram (ERD). It highlights that
SQL is a language used to interact with databases, allowing users to query data from one or
multiple tables. Understanding the relationships between tables is crucial for gaining insights
from data efficiently.
In the context of databases and SQL (Structured Query Language), a query is a request for data
or information from a database. Queries are used to retrieve, update, insert, or delete data in a
database.
For example, a simple SQL query to retrieve data from a table might look like this:
sql
If you meant something else by "quiry," please let me know, and I'll be happy to clarify!
knowing that the columns are all of the same type of data means that obtaining data from a
database can still be fast.
Types of Statements
The key to SQL is understanding statements. A few statements include:
3. SELECT allows you to read data and display it. This is called a query.
The SELECT statement is the common statement used by analysts, and you will be learning all
about them throughout this course!
1. SELECT indicates which column(s) you want to be given the data for. Separated by
commas are the columns.
2. FROM specifies from which table(s) you want to select the columns. Notice the
columns need to exist in this table.
If you want to be provided with the data from all columns in the table, you use "*", like so:
Note that using SELECT does not create a new table with these columns in the database, it just
provides the data to you as the results, or output, of this command.
You will use this SQL SELECT statement in every query in this course, but you will be learning a
few additional statements and operators that can be used along with them to ask more
advanced questions of your data.
Semicolon is used to indicate the end of a statement you’re executing. In summary, while the
semicolon is not always required at the end of a single SQL statement, it is recommended to
use it for clarity and to avoid errors when executing multiple statements.
LIMIT
We have already seen the SELECT (to choose columns) and FROM (to choose tables)
statements. The LIMIT statement is useful when you want to see just the first few rows
of a table. This can be much faster for loading than if we load the entire dataset.
The LIMIT command is always the very last part of a query. It’s the SQL equivalent of
Header in R I’d say.
ORDER BY
he ORDER BY statement allows us to sort our results using the data in any column. If
you are familiar with Excel or Google Sheets, using ORDER BY is similar to sorting a
sheet using a column. A key difference, however, is that using ORDER BY in a SQL
query only has temporary effects, for the results of that query, unlike sorting a
sheet by column in Excel or Sheets.
In other words, when you use ORDER BY in a SQL query, your output will be sorted that
way, but then the next query you run will encounter the unsorted data again. It's
important to keep in mind that this is different than using common spreadsheet
software, where sorting the spreadsheet by column actually alters the data in that sheet
until you undo or change that sorting. This highlights the meaning and function of a SQL
"query."
The ORDER BY statement always comes in a query after
the SELECT and FROM statements, but before the LIMIT statement. If you are using
the LIMIT statement, it will always appear last. As you learn additional commands, the
order of these statements will matter more.
Pro Tip
Remember DESC can be added after the column in your ORDER BY statement to sort in
descending order, as the default is to sort in ascending order.
Examples:
SELECT id, occurred_at, total_amt_usd FROM orders ORDER BY occurred_at LIMIT 10; -
10 earliest orders
SELECT id, occurred_at, total_amt_usd FROM orders ORDER BY total_amt_usd DESC
LIMIT 5; - five most expensive orders
SELECT id, occurred_at, total_amt_usd FROM orders ORDER BY total_amt_usd LIMIT
20; - 20 cheapest orders
Here, we saw that we can ORDER BY more than one column at a time. When you
provide a list of columns in an ORDER BY command, the sorting occurs using the
leftmost column in your list first, then the next column from the left, and so on. We still
have the ability to flip the way we order using DESC. If there are multiple identical
entries for the first column, the program then goes on to order the second column
according to the command for those identical first column entries and so on.
- Note, the ordering according to columns removes meaning and
interpretation along the rows as a result. NO IT DOES NOT LOSE
MEANING!
Example:
SELECT id, account_id, total_amt_usd
FROM orders
ORDER BY total_amt_usd DESC, account_id;
- Order by amount spent first in descending order, then account order
ascending.
A Better way of saying what I said highlighted in blue:
In query #1, all of the orders for each account ID are grouped together, and then within each of
those groupings, the orders appear from the greatest order amount to the least. In query #2,
since you sorted by the total dollar amount first, the orders appear from greatest to least
regardless of which account ID they were from. Then they are sorted by account ID next. (The
secondary sorting by account ID is difficult to see here, since only if there were two orders with
equal total dollar amounts would there need to be any sorting by account ID.)
Question: If you command SQL to order a table according to a non-numerical column, what
happens?
When you use the ORDER BY clause in SQL to sort a table based on a non-numerical column
(such as a text or string column), SQL will sort the results in alphabetical order by default.
1. Alphabetical Order: If the column contains text values, SQL will arrange the rows based
on the alphabetical order of those text values. For example, if you have a column with
names, the results will be sorted from A to Z.
2. Case Sensitivity: The sorting may be case-sensitive or case-insensitive depending on
the database system and its configuration. For instance, in some systems, uppercase
letters may be sorted before lowercase letters.
WHERE
Using the WHERE statement, we can display subsets of tables based on conditions that
must be met. You can also think of the WHERE command as filtering the data.
This video above shows how this can be used, and in the upcoming concepts, you will
learn some common operators that are useful with the WHERE' statement.
Common symbols used in WHERE statements include: > (greater than),
< (less than), >= (greater than or equal to), <= (less than or equal to), = (equal to), != (not
equal to)
Note, each row is a data point. Also, the WHERE statement comes before ORDER.
Examples:
SELECT* FROM orders WHERE gloss_amt_usd > 1000 LIMIT 5; - first five rows where
gloss amount is greater than 1000.
SELECT* FROM orders WHERE total_amt_usd < 500 LIMIT 10; - first 10 rows where total
amount is less than 500.
The WHERE statement can also be used with non-numeric data. We can use
the = and != operators here. You need to be sure to use single quotes (just be careful if
you have quotes in the original text) with the text data, not double quotes.
Commonly when we are using WHERE with non-numeric data fields, we use
the LIKE, NOT, or IN operators. We will see those before the end of this lesson!
Example:
1. SELECT name, website, primary_poc FROM accounts WHERE name = 'Exxon
Mobil'; - Filter the accounts table to include the company name, website, and the
primary point of contact (primary_poc) just for the Exxon Mobil company in
the accounts table.
DERIVED COLUMNS
Creating a new column that is a combination of existing columns is known as
a derived column (or "calculated" or "computed" column). Usually you want to give a
name, or "alias," to your new column using the AS keyword.
This derived column, and its alias, are generally only temporary, existing just for the
duration of your query. The next time you run a query and access this table, the new
column will not be there.
If you are deriving the new column from existing columns using a mathematical
expression, then these familiar mathematical operators will be useful:
* (Multiplication), + (Addition), - (Subtraction), / (Division)
Consider this example:
SELECT id, (standard_amt_usd/total_amt_usd)*100 AS std_percent, total_amt_usd
FROM orders LIMIT 10;
Here we divide the standard paper dollar amount by the total order amount to find the
standard paper percent for the order, and use the AS keyword to name this new column
"std_percent."
LOGIC OPERATORS
In the next concepts, you will be learning about Logical Operators. Logical
Operators include:
1. LIKE This allows you to perform operations similar to using WHERE and =, but for
cases when you might not know exactly what you are looking for.
2. IN This allows you to perform operations similar to using WHERE and =, but for
more than one condition.
3. NOT This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a
certain condition.
4. AND & BETWEEN These allow you to combine operations where all combined
conditions must be true.
5. OR This allows you to combine operations where at least one of the combined
conditions must be true.
Using LIKE
The LIKE operator is extremely useful for working with text. You will use LIKE within
a WHERE clause. The LIKE operator is frequently used with %. The % tells us that we
might want any number of characters leading up to a particular set of characters or
following a certain set of characters, as we saw with the google syntax above.
Remember you will need to use single quotes for the text you pass to the LIKE operator,
because of this lower and uppercase letters are not the same within the string.
Searching for 'T' is not the same as searching for 't'. In other SQL environments (outside
the classroom), you can use either single or double quotes.
Examples:
SELECT* FROM accounts WHERE name LIKE 'C%'; - All the companies whose names
start with 'C'.
Placing the Wild cards
• % at the beginning: Matches any string that ends with the specified characters.
• % at the end: Matches any string that starts with the specified characters.
• % in the middle: Matches any string that contains the specified characters.
- I tried asking the AI why these particular orders work this way, what I’m
getting is that it is simply a matter of it having been designed like this.
IN
The IN operator is useful for working with both numeric and text columns. This operator
allows you to use an =, but for more than one item of that particular column. We can
check one, two or many column values for which we want to pull data, but all within the
same query. In the upcoming concepts, you will see the OR operator that would also
allow us to perform these tasks, but the IN operator is a cleaner way to write these
queries.
you can use single or double quotation marks in most SQL environments.
Example:
SELECT name, primary_poc, sales_rep_id FROM accounts WHERE name IN ('Walmart',
'Target', 'Nordstrom'); - rows of the named companies
SELECT* FROM web_events WHERE channel = 'organic' OR channel = 'adwords'; -
channel used must be either of the two.
NOT
The NOT operator is an extremely useful operator for working with the previous two
operators we introduced: IN and LIKE. By specifying NOT LIKE or NOT IN, we can grab
all of the rows that do not meet a particular criteria.
Example:
SELECT name, primary_poc, sales_rep_id FROM accounts WHERE name NOT IN
('Walmart', 'Target', 'Nodstrom'); - We excluded companies Walmart, Target, and
Nodstrom
SELECT* FROM accounts WHERE name NOT LIKE 'C%'; - All companies whose name
does not start with C.
OR
Similar to the AND operator, the OR operator can combine multiple statements. Each
time you link a new statement with an OR, you will need to specify the column you are
interested in looking at. You may link as many statements as you would like to consider
at the same time. This operator works with all of the operations we have seen so far
including arithmetic operators (+, *, -, /), LIKE, IN, NOT, AND, and BETWEEN logic can
all be linked together using the OR operator.
When combining multiple of these operations, we frequently might need to use
parentheses to assure that logic we want to perform is being executed correctly. The
video below shows an example of one of these situations.
Examples:
SELECT* FROM orders WHERE standard_qty = 0 AND (gloss_qty > 1000 OR poster_qty >
1000); - orders with standard quantity of zero and either gloss or poster quantities
greater than 1000.
SELECT* FROM accounts WHERE name IN ('C%', 'W%') AND primary_poc IN ('%ana%',
'%Ana%') AND primary_poc NOT LIKE '%eana%'; - didn’t work…but the following does:
SELECT * FROM accounts WHERE (name LIKE 'C%' OR name LIKE 'W%') AND
((primary_poc LIKE '%ana%' OR primary_poc LIKE '%Ana%') AND primary_poc NOT LIKE
'%eana%');
SUMMARY
You have already learned a lot about writing code in SQL! Let's take a moment to recap
all that we have covered before moving on:
Statement How to Use It Other Details
SELECT SELECT Col1, Col2, ... Provide the columns you want
FROM FROM Table Provide the table where the columns exist
IN WHERE Col IN ('Y', 'N') A filter for only rows with column of 'Y' or 'N'
NOT WHERE Col NOT IN ('Y', 'N') NOT is frequently used with LIKE and IN
WHERE Col1 > 5 AND Col2 < Filter rows where two or more conditions must
AND
3 be true
Other Tips
Though SQL is not case sensitive (it doesn't care if you write your statements as all
uppercase or lowercase), we discussed some best practices. The order of the key
words does matter! Using what you know so far, you will want to write your statements
as:
SELECT col1, col2
FROM table1
WHERE col3 > 5 AND col4 LIKE '%os%'
ORDER BY col5
LIMIT 10;
Notice, you can retrieve different columns than those being used in the ORDER
BY and WHERE statements. Assuming all of these column names existed in this way
(col1, col2, col3, col4, col5) within a table called table1, this query would run just fine.
2. SQL Joins