0% found this document useful (0 votes)
3 views

SQL For Data Analysis Notes

The document provides an introduction to SQL, focusing on its application in data analysis and the use of a fictional dataset from a company called Parch and Posey. It covers fundamental SQL concepts such as querying data, creating tables, and using various statements like SELECT, WHERE, and ORDER BY to manipulate and retrieve data. Additionally, it explains best practices for writing SQL queries and introduces logical operators for filtering data effectively.

Uploaded by

srtmusara40
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

SQL For Data Analysis Notes

The document provides an introduction to SQL, focusing on its application in data analysis and the use of a fictional dataset from a company called Parch and Posey. It covers fundamental SQL concepts such as querying data, creating tables, and using various statements like SELECT, WHERE, and ORDER BY to manipulate and retrieve data. Additionally, it explains best practices for writing SQL queries and introduces logical operators for filtering data effectively.

Uploaded by

srtmusara40
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1.

Basic SQL
1.1 SQL Introduction
On the importance of SQL in data analysis. While SQL has various applications in
software development, this course will primarily focus on its use in data analysis. The
course is designed to be accessible to individuals from different backgrounds,
including marketing, operations, and finance, and aims to equip learners with the skills
to use data effectively in decision-making. Derek highlights that a significant portion of
analysts' and data scientists' work is conducted using SQL.

1.2 The Parch and Posey Dataset


he Parch & Posey Database" introduces the fictional company Parch and Posey, which sells
paper and has 50 sales representatives across four regions in the United States. They offer three
types of paper: regular, poster, and glossy, primarily targeting large Fortune 100 companies
through advertising on platforms like Google, Facebook, and Twitter. The video explains that the
data from Parch and Posey will be used throughout the course to simulate real-world problems,
allowing learners to use SQL to answer questions such as which product line is
underperforming and which marketing channels warrant greater investment.

1.3
introduces the concept of using spreadsheets to store data and explains how multiple
spreadsheets can be visualized using an Entity Relationship Diagram (ERD). It highlights that
SQL is a language used to interact with databases, allowing users to query data from one or
multiple tables. Understanding the relationships between tables is crucial for gaining insights
from data efficiently.
In the context of databases and SQL (Structured Query Language), a query is a request for data
or information from a database. Queries are used to retrieve, update, insert, or delete data in a
database.

For example, a simple SQL query to retrieve data from a table might look like this:

sql

SELECT * FROM employees;

This query requests all columns from the "employees" table.

If you meant something else by "quiry," please let me know, and I'll be happy to clarify!

How DataBases Store Data


The video titled "How Databases Store Data" explains the fundamental structure of databases
and how they organize data. It compares database tables to Excel spreadsheets, noting that
both consist of rows and columns. However, in databases, each column must have a unique
name, and all data within a column must be of the same type (e.g., all numbers or all text). This
consistency in data types allows for efficient data analysis and retrieval. The video emphasizes
the importance of descriptive column names to clarify the meaning of the data stored in each
column.

An entire column is considered quantitative, discrete, or as some sort of string

knowing that the columns are all of the same type of data means that obtaining data from a
database can still be fast.

Types of Statements
The key to SQL is understanding statements. A few statements include:

1. CREATE TABLE is a statement that creates a new table in a database.

2. DROP TABLE is a statement that removes a table in a database.

3. SELECT allows you to read data and display it. This is called a query.

The SELECT statement is the common statement used by analysts, and you will be learning all
about them throughout this course!

SELECT & FROM


Here you were introduced to the SQL command that will be used in every query you write:
SELECT ... FROM ....

1. SELECT indicates which column(s) you want to be given the data for. Separated by
commas are the columns.

2. FROM specifies from which table(s) you want to select the columns. Notice the
columns need to exist in this table.

If you want to be provided with the data from all columns in the table, you use "*", like so:

• SELECT * FROM orders

Note that using SELECT does not create a new table with these columns in the database, it just
provides the data to you as the results, or output, of this command.

You will use this SQL SELECT statement in every query in this course, but you will be learning a
few additional statements and operators that can be used along with them to ask more
advanced questions of your data.

Semicolon is used to indicate the end of a statement you’re executing. In summary, while the
semicolon is not always required at the end of a single SQL statement, it is recommended to
use it for clarity and to avoid errors when executing multiple statements.

Formatting Best Practices


SQL is case-insensitive, it is however, common and best practice to capitalize all SQL
commands, like SELECT and FROM, and keep everything else in your query lower case.
Capitalizing command words makes queries easier to read, which will matter more as
you write more complex queries.
It is common to use underscores and avoid spaces in column names. It is a bit annoying
to work with spaces in SQL.
SQL queries ignore spaces, so you can add as many spaces and blank lines between
code as you want, and the queries are the same. Take for example:
SELECT account_id FROM orders
Depending on your SQL environment, your query may need a semicolon at the end to
execute. Other environments are more flexible in terms of this being a "requirement." It
is considered best practice to put a semicolon at the end of each statement, which also
allows you to run multiple queries at once if your environment allows this. Not at the
end of each line though!

LIMIT
We have already seen the SELECT (to choose columns) and FROM (to choose tables)
statements. The LIMIT statement is useful when you want to see just the first few rows
of a table. This can be much faster for loading than if we load the entire dataset.
The LIMIT command is always the very last part of a query. It’s the SQL equivalent of
Header in R I’d say.

ORDER BY
he ORDER BY statement allows us to sort our results using the data in any column. If
you are familiar with Excel or Google Sheets, using ORDER BY is similar to sorting a
sheet using a column. A key difference, however, is that using ORDER BY in a SQL
query only has temporary effects, for the results of that query, unlike sorting a
sheet by column in Excel or Sheets.
In other words, when you use ORDER BY in a SQL query, your output will be sorted that
way, but then the next query you run will encounter the unsorted data again. It's
important to keep in mind that this is different than using common spreadsheet
software, where sorting the spreadsheet by column actually alters the data in that sheet
until you undo or change that sorting. This highlights the meaning and function of a SQL
"query."
The ORDER BY statement always comes in a query after
the SELECT and FROM statements, but before the LIMIT statement. If you are using
the LIMIT statement, it will always appear last. As you learn additional commands, the
order of these statements will matter more.
Pro Tip
Remember DESC can be added after the column in your ORDER BY statement to sort in
descending order, as the default is to sort in ascending order.
Examples:
SELECT id, occurred_at, total_amt_usd FROM orders ORDER BY occurred_at LIMIT 10; -
10 earliest orders
SELECT id, occurred_at, total_amt_usd FROM orders ORDER BY total_amt_usd DESC
LIMIT 5; - five most expensive orders
SELECT id, occurred_at, total_amt_usd FROM orders ORDER BY total_amt_usd LIMIT
20; - 20 cheapest orders
Here, we saw that we can ORDER BY more than one column at a time. When you
provide a list of columns in an ORDER BY command, the sorting occurs using the
leftmost column in your list first, then the next column from the left, and so on. We still
have the ability to flip the way we order using DESC. If there are multiple identical
entries for the first column, the program then goes on to order the second column
according to the command for those identical first column entries and so on.
- Note, the ordering according to columns removes meaning and
interpretation along the rows as a result. NO IT DOES NOT LOSE
MEANING!
Example:
SELECT id, account_id, total_amt_usd
FROM orders
ORDER BY total_amt_usd DESC, account_id;
- Order by amount spent first in descending order, then account order
ascending.
A Better way of saying what I said highlighted in blue:
In query #1, all of the orders for each account ID are grouped together, and then within each of
those groupings, the orders appear from the greatest order amount to the least. In query #2,
since you sorted by the total dollar amount first, the orders appear from greatest to least
regardless of which account ID they were from. Then they are sorted by account ID next. (The
secondary sorting by account ID is difficult to see here, since only if there were two orders with
equal total dollar amounts would there need to be any sorting by account ID.)

Question: If you command SQL to order a table according to a non-numerical column, what
happens?

When you use the ORDER BY clause in SQL to sort a table based on a non-numerical column
(such as a text or string column), SQL will sort the results in alphabetical order by default.

Here's how it works:

1. Alphabetical Order: If the column contains text values, SQL will arrange the rows based
on the alphabetical order of those text values. For example, if you have a column with
names, the results will be sorted from A to Z.
2. Case Sensitivity: The sorting may be case-sensitive or case-insensitive depending on
the database system and its configuration. For instance, in some systems, uppercase
letters may be sorted before lowercase letters.

WHERE
Using the WHERE statement, we can display subsets of tables based on conditions that
must be met. You can also think of the WHERE command as filtering the data.
This video above shows how this can be used, and in the upcoming concepts, you will
learn some common operators that are useful with the WHERE' statement.
Common symbols used in WHERE statements include: > (greater than),
< (less than), >= (greater than or equal to), <= (less than or equal to), = (equal to), != (not
equal to)
Note, each row is a data point. Also, the WHERE statement comes before ORDER.
Examples:
SELECT* FROM orders WHERE gloss_amt_usd > 1000 LIMIT 5; - first five rows where
gloss amount is greater than 1000.
SELECT* FROM orders WHERE total_amt_usd < 500 LIMIT 10; - first 10 rows where total
amount is less than 500.

The WHERE statement can also be used with non-numeric data. We can use
the = and != operators here. You need to be sure to use single quotes (just be careful if
you have quotes in the original text) with the text data, not double quotes.
Commonly when we are using WHERE with non-numeric data fields, we use
the LIKE, NOT, or IN operators. We will see those before the end of this lesson!
Example:
1. SELECT name, website, primary_poc FROM accounts WHERE name = 'Exxon
Mobil'; - Filter the accounts table to include the company name, website, and the
primary point of contact (primary_poc) just for the Exxon Mobil company in
the accounts table.

DERIVED COLUMNS
Creating a new column that is a combination of existing columns is known as
a derived column (or "calculated" or "computed" column). Usually you want to give a
name, or "alias," to your new column using the AS keyword.
This derived column, and its alias, are generally only temporary, existing just for the
duration of your query. The next time you run a query and access this table, the new
column will not be there.
If you are deriving the new column from existing columns using a mathematical
expression, then these familiar mathematical operators will be useful:
* (Multiplication), + (Addition), - (Subtraction), / (Division)
Consider this example:
SELECT id, (standard_amt_usd/total_amt_usd)*100 AS std_percent, total_amt_usd
FROM orders LIMIT 10;
Here we divide the standard paper dollar amount by the total order amount to find the
standard paper percent for the order, and use the AS keyword to name this new column
"std_percent."

LOGIC OPERATORS
In the next concepts, you will be learning about Logical Operators. Logical
Operators include:
1. LIKE This allows you to perform operations similar to using WHERE and =, but for
cases when you might not know exactly what you are looking for.
2. IN This allows you to perform operations similar to using WHERE and =, but for
more than one condition.
3. NOT This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a
certain condition.
4. AND & BETWEEN These allow you to combine operations where all combined
conditions must be true.
5. OR This allows you to combine operations where at least one of the combined
conditions must be true.

Using LIKE
The LIKE operator is extremely useful for working with text. You will use LIKE within
a WHERE clause. The LIKE operator is frequently used with %. The % tells us that we
might want any number of characters leading up to a particular set of characters or
following a certain set of characters, as we saw with the google syntax above.
Remember you will need to use single quotes for the text you pass to the LIKE operator,
because of this lower and uppercase letters are not the same within the string.
Searching for 'T' is not the same as searching for 't'. In other SQL environments (outside
the classroom), you can use either single or double quotes.
Examples:
SELECT* FROM accounts WHERE name LIKE 'C%'; - All the companies whose names
start with 'C'.
Placing the Wild cards
• % at the beginning: Matches any string that ends with the specified characters.
• % at the end: Matches any string that starts with the specified characters.
• % in the middle: Matches any string that contains the specified characters.
- I tried asking the AI why these particular orders work this way, what I’m
getting is that it is simply a matter of it having been designed like this.

IN
The IN operator is useful for working with both numeric and text columns. This operator
allows you to use an =, but for more than one item of that particular column. We can
check one, two or many column values for which we want to pull data, but all within the
same query. In the upcoming concepts, you will see the OR operator that would also
allow us to perform these tasks, but the IN operator is a cleaner way to write these
queries.
you can use single or double quotation marks in most SQL environments.
Example:
SELECT name, primary_poc, sales_rep_id FROM accounts WHERE name IN ('Walmart',
'Target', 'Nordstrom'); - rows of the named companies
SELECT* FROM web_events WHERE channel = 'organic' OR channel = 'adwords'; -
channel used must be either of the two.

NOT
The NOT operator is an extremely useful operator for working with the previous two
operators we introduced: IN and LIKE. By specifying NOT LIKE or NOT IN, we can grab
all of the rows that do not meet a particular criteria.
Example:
SELECT name, primary_poc, sales_rep_id FROM accounts WHERE name NOT IN
('Walmart', 'Target', 'Nodstrom'); - We excluded companies Walmart, Target, and
Nodstrom
SELECT* FROM accounts WHERE name NOT LIKE 'C%'; - All companies whose name
does not start with C.

AND and BETWEEN


The AND operator is used within a WHERE statement to consider more than one logical
clause at a time. Each time you link a new statement with an AND, you will need to
specify the column you are interested in looking at. You may link as many statements as
you would like to consider at the same time. This operator works with all of the
operations we have seen so far including arithmetic operators (+, *, -, /). LIKE, IN,
and NOT logic can also be linked together using the AND operator. When working with
the same column, you can write cleaner code by using the BETWEEN statement
instead. When you use BETWEEN, the end points are included!
Example:
SELECT* FROM orders WHERE standard_qty >= 1000 AND poster_qty = 0 AND gloss_qty
= 0; - Note if you used commas, it’d be an OR statement.
SELECT* FROM accounts WHERE name NOT LIKE 'C%' AND name LIKE '%S'; - Company
names do not start with C but end with S.
SELECT * FROM web_events WHERE channel IN ('organic', 'adwords') AND occurred_at
BETWEEN '2016-01-01' AND '2017-01-01' ORDER BY occurred_at DESC;
- You will notice that using BETWEEN is tricky for dates! While BETWEEN is
generally inclusive of endpoints, it assumes the time is at 00:00:00 (i.e.
midnight) for dates. This is the reason why we set the right-side endpoint
of the period at '2017-01-01'.

OR
Similar to the AND operator, the OR operator can combine multiple statements. Each
time you link a new statement with an OR, you will need to specify the column you are
interested in looking at. You may link as many statements as you would like to consider
at the same time. This operator works with all of the operations we have seen so far
including arithmetic operators (+, *, -, /), LIKE, IN, NOT, AND, and BETWEEN logic can
all be linked together using the OR operator.
When combining multiple of these operations, we frequently might need to use
parentheses to assure that logic we want to perform is being executed correctly. The
video below shows an example of one of these situations.
Examples:
SELECT* FROM orders WHERE standard_qty = 0 AND (gloss_qty > 1000 OR poster_qty >
1000); - orders with standard quantity of zero and either gloss or poster quantities
greater than 1000.
SELECT* FROM accounts WHERE name IN ('C%', 'W%') AND primary_poc IN ('%ana%',
'%Ana%') AND primary_poc NOT LIKE '%eana%'; - didn’t work…but the following does:
SELECT * FROM accounts WHERE (name LIKE 'C%' OR name LIKE 'W%') AND
((primary_poc LIKE '%ana%' OR primary_poc LIKE '%Ana%') AND primary_poc NOT LIKE
'%eana%');

CHECK THIS TOMORROW PROPERLY:


The SQL query you provided is designed to retrieve records from the accounts table
based on specific conditions for the name and primary_poc (primary point of contact)
columns. Here’s a breakdown of what each part of the query is doing:
1. SELECT * FROM accounts: This part of the query selects all columns from
the accounts table.
2. WHERE name IN ('C%', 'W%'): This condition checks if the name column
matches either 'C%' or 'W%'. However, note that using IN with wildcard
characters like % does not work as intended. Instead, it should be used with
the LIKE operator for pattern matching. The correct usage would be:
sql
Copy
WHERE name LIKE 'C%' OR name LIKE 'W%'
3. AND primary_poc IN ('%ana%', '%Ana%'): Similar to the previous point, this
condition checks if the primary_poc column matches either '%ana%' or
'%Ana%'. Again, IN is not the correct operator here. Instead, it should be:
sql
Copy
WHERE primary_poc LIKE '%ana%' OR primary_poc LIKE '%Ana%'
4. AND primary_poc NOT LIKE '%eana%': This condition ensures that
the primary_poc does not contain the substring 'eana' anywhere in the value.
Summary:
The intended purpose of the query is to select all records from the accounts table
where:
• The name starts with 'C' or 'W'.
• The primary_poc contains 'ana' or 'Ana'.
• The primary_poc does not contain 'eana'.
However, the use of IN with wildcards is incorrect, and the query may not return the
expected results. It would be better to use the LIKE operator for the pattern matching as
described above.
If you have more questions or need further clarification, feel free to ask!

SUMMARY
You have already learned a lot about writing code in SQL! Let's take a moment to recap
all that we have covered before moving on:
Statement How to Use It Other Details

SELECT SELECT Col1, Col2, ... Provide the columns you want

FROM FROM Table Provide the table where the columns exist

LIMIT LIMIT 10 Limits based number of rows returned

Orders table based on the column. Used


ORDER BY ORDER BY Col
with DESC.

WHERE WHERE Col > 5 A conditional statement to filter your results

Only pulls rows where column has 'me' within


LIKE WHERE Col LIKE '%me%'
the text

IN WHERE Col IN ('Y', 'N') A filter for only rows with column of 'Y' or 'N'

NOT WHERE Col NOT IN ('Y', 'N') NOT is frequently used with LIKE and IN

WHERE Col1 > 5 AND Col2 < Filter rows where two or more conditions must
AND
3 be true

Filter rows where at least one condition must


OR WHERE Col1 > 5 OR Col2 < 3
be true

WHERE Col BETWEEN 3


BETWEEN Often easier syntax than using an AND
AND 5

Other Tips
Though SQL is not case sensitive (it doesn't care if you write your statements as all
uppercase or lowercase), we discussed some best practices. The order of the key
words does matter! Using what you know so far, you will want to write your statements
as:
SELECT col1, col2
FROM table1
WHERE col3 > 5 AND col4 LIKE '%os%'
ORDER BY col5
LIMIT 10;
Notice, you can retrieve different columns than those being used in the ORDER
BY and WHERE statements. Assuming all of these column names existed in this way
(col1, col2, col3, col4, col5) within a table called table1, this query would run just fine.
2. SQL Joins

You might also like