Exploratory Data Analysis in SQL - Edited
Exploratory Data Analysis in SQL - Edited
One note before we start. This course uses PostgreSQL. Many of the functions we'll use are
also available in other SQL database systems, but their names or syntax may be different. If you're
using another database system, you should refer to the system's documentation to learn the correct
syntax. With that,
Let's get started. You've finally been granted access to your company's database. Yay! But
where do you begin? What are the tables? How are they related? What columns exist in the tables? A
database client is a program used to connect to, and work with, a database. There are many different
database clients. Each one has a different way to retrieve information on the table names, the columns in
each table, and the formal relationships between the tables. Refer to your client program's
documentation to find the commands to extract this information.
1.3. Entity relationship diagram
You may also be given information about the structure of the database from the database
owner or creator. One type of documentation is an entity-relationship diagram that shows the tables,
their columns, and the relationships between the tables. Here is the entity-relationship diagram for the
database for this course. There are six tables.
1.3.1. ER diagram: Evanston311
The evanston311 table contains help requests sent to the city of Evanston, Illinois.
1.3.2. ER diagram: fortune500
fortune500 contains information on the 500 largest US companies by revenue from 2017.
stackoverflow contains data from the popular programming question and answer site. It
includes daily counts of the number of questions that were tagged as being related to select technology
companies.
1.3.4. ER diagram: supporting
Once you know the names of the tables in the database, one way to get a sense of what's in a
table is to simply select a few rows from it. Here we use the star to select all columns from the company
table and use limit to return only five rows. Remember that the rows returned from a table are in no
particular order by default.
1.5. A few reminders
Code Note
Null Missing
IS NULL, IS NOT NULL Don’t use = NULL
Count(*) Number of rows
Count(column_name) Number of non-NULL values
Count(DISTINCT column_name) Number of different non-NULL values
SELECT DISTINCT column_name … Distinct values, including NULL
As you start to explore the contents of a table, keep a few additional things in mind. NULL
indicates missing data in a database. To check which values are NULL, use "is NULL" or "is not
NULL", not an equals sign. The count function with a star counts the number of rows. If you instead
supply a column name to the count function, it counts the number of non-NULL observations in the
column. This is equal to the total number of rows, minus the number of NULL values. If you count the
distinct values of a column, you'll get the number of different non-NULL values in the column. But if
you select those distinct values directly, NULL will be included as a value if it exists in the column,
even though it isn't counted by the count function.
Exercise
Count missing values
Which column of fortune500 has the most missing values? To find out, you'll need to check each
column individually, although here we'll check just two: ticker and industry.
Course Note: While you're unlikely to encounter this issue during this exercise, note that if you
run a query that takes more than a few seconds to execute, your session may expire or you may
be disconnected from the server. You will not have this issue with any of the exercise solutions,
so if your session expires or disconnects, there's an error with your query.
Instructions 1/2:
• Subtract the count of the non-null ticker values from the total number of rows in
fortune500; alias the difference as missing.
Instructions 2/2:
• Repeat for the industry column: subtract the count of the non-null industry values from
the total number of rows in fortune500; alias the difference as missing.
Part of exploring a database is figuring out how tables relate to each other.
The company and fortune500 tables don't have a formal relationship between them in the
database, but this doesn't prevent you from joining them.
To join the tables, you need to find a column that they have in common where the values are
consistent across the tables. Remember: just because two tables have a column with the same
name, it doesn't mean those columns necessarily contain compatible data. If you find more than
one pair of columns with similar data, you may need to try joining with each in turn to see if you
get the same number of results.
Instructions
• Closely inspect the contents of the company and fortune500 tables to find a column
present in both tables that can also be considered to uniquely identify each company.
• Join the company and fortune500 tables with an INNER JOIN.
• SELECT company.name
• -- Table(s) to select from
• FROM company
• INNER JOIN fortune500
• ON company.ticker=fortune500.ticker;
2. The keys to the database
Foreign keys are the formal way that database tables are linked together. In this example, the actor_id
column in the film_actor table is a foreign key that references the id column of the actor table.
• Reference another row
o In a different table or the same table
o Via a unique ID
➢ Primary key column containing unique, non-NULL values
• Values restricted to values in referenced column OR NULL
A foreign key is a column that references a single, specific row in the database. The referenced row is
usually in a different table, but foreign keys can reference rows in the same table as well. Foreign keys
reference other rows using a unique identifier for the row. The unique ID often comes from a primary
key column in the referenced table. Primary keys are specially designated columns where each row has
a unique, non-null value. Foreign key columns are restricted to contain either a value that is in the
referenced column, or null. If the value is null, it indicates that there's no relationship for that row.
2.2. ER diagram
Let's look at the entity relationship diagram for our database. In the diagram, foreign keys are indicated
on the arrows between tables.
The value before the colon is the name of the column in the table from which the arrow
originates. The value after the colon is the name of the referenced column in the table
the arrow is pointing to. So the company_id column in the tag_company table refers to
the id column in the company table.
When an arrow points from and to the same table, this is a self reference. parent_id in the company table
references the id column in the same table.
Note that there's no foreign key linking the company table to the fortune500 table. But this doesn't
prevent us from joining these tables. Both tables have ticker columns with comparable values that can
be used to join the tables. The lack of a foreign key relationship just means that the values in the ticker
columns aren't restricted to the set of values in the other table.
2.3. Primary Keys
The diagram also shows which columns are primary keys. Primary keys have a
border around them at the top of each list of columns. Primary keys uniquely identify
the rows in the table.
2.4. Coalesce function
Before you return to the exercises, let's add the coalesce function to your toolkit. coalesce takes two or
more values or column names as arguments. The three dots in square brackets here indicate that
additional values can be supplied as inputs. The coalesce function operates row-wise on the input. It
returns the first non-NULL value in each row, checking the columns in the order they're supplied to
the function.
Here's an example. We have a table called prices with two columns. Remember that blanks are null
values. We can use coalesce to combine these two columns. If column_1 is not null, coalesce
returns that value. If column_1 is null, coalesce returns the value of column_2. In this example, the first
value returned by coalesce is 10. This is because, in the first row of prices, the value of column_1 is
NULL. So coalesce returns the value of column_2. Coalesce returned four values because there
were four rows in the input. Coalesce is useful for specifying default or backup values when selecting
a column that might contain NULL values.
Exercise
Read an entity relationship diagram
The information you need is sometimes split across multiple tables in the database.
What is the most common stackoverflow tag_type? What companies have a tag of that
type? To generate a list of such companies, you'll need to join three tables together.
Reference the entity relationship diagram as needed when determining which columns to use
when joining tables.
Instructions 1/2:
• First, using the tag_type table, count the number of tags with each type.
• Order the results to find the most common tag type.
• -- Count the number of tags with each type
• SELECT type, COUNT(*) AS count
• FROM tag_type
• -- To get the count for each type, what do you need to do?
• GROUP BY type
-
- Order the results with the most common tag types listed f
irst
ORDER BY count DESC;
Instructions 2/2:
• Join the tag_company, company, and tag_type tables, keeping only mutually occurring
records.
• Select company.name, tag_type.tag, and tag_type.type for tags with the most common
type from the previous step.
• -- Select the 3 columns desired
• SELECT company.name, tag_type.tag, tag_type.type
• FROM company
• -- Join to the tag_company table
• INNER JOIN tag_company
• ON company.id = tag_company.company_id
• -- Join to the tag_type table
• INNER JOIN tag_type
• ON tag_company.tag = tag_type.tag
• -- Filter to most common type
• WHERE type='cloud';
Exercise
Coalesce
The coalesce() function can be useful for specifying a default or backup value when a column
contains NULL values.
coalesce() checks arguments in order and returns the first non-NULL value, if one exists.
• coalesce(NULL, 1, 2) =1
• coalesce(NULL, NULL) = NULL
• coalesce(2, 3, NULL) =2
In the fortune500 data, industry contains some missing values. Use coalesce() to use the value of
sector as the industry when industry is NULL. Then find the most common industry.
Instructions:
• Use coalesce() to select the first non-NULL value from industry, sector, or 'Unknown' as
a fallback value.
• Alias the result of the call to coalesce() as industry2.
• Count the number of rows with each industry2 value.
• Find the most common value of industry2.
• -- Use coalesce
• SELECT COALESCE(industry, sector, 'Unknown') AS industry2,
• -- Don't forget to count!
• COUNT(*)
• FROM fortune500
• -- Group by what? (What are you counting by?)
• GROUP BY industry2
• -- Order results to see most common first
• ORDER BY COUNT(*) DESC
• -- Limit results to get just the one value you want
• LIMIT 1;
3. Column types and constraints
Now it's time to turn to the contents of individual columns: the data types and the constraints on what
values can exist in each column.
Foreign keys and primary keys are two types of constraints that limit the values in a column, but
columns can also be constrained in other ways. Unique means that each value except NULL must be
different from the values in all other rows. Not NULL means what it says - the column cannot contain
null values. Check constraints are a way of implementing additional conditions on the values of a
column, such as requiring the column only contain positive values, or ensuring that the value of one
column is greater than the value of another column.
Common
• Numeric
• Character
• Date/Time
• Boolean
Special
• Arrays
• Monetary
• Binary
• Geometric
• Network Address
• XML
• JSON
• And more!
Constraints can limit the values in a column, but the main thing that determines what values
are allowed is the column's type. Each column in the database can only store one type of data. In this
course, we're talking about three of the most common types of data: numeric, character, and date/time.
These three, along with boolean - which holds true or false values - are the most common types you'll
encounter, but they're not the only ones. There are also special data types to hold monetary values,
geometric data like points or lines, and structured data types like XML and JSON. These special types
differ more across database implementations than the four common ones.
Within the broad categories of numeric, character, or date/time data, there are multiple column
types with different details. For example, different numeric types require different amounts of memory
per row and can store different ranges of values. In the upcoming chapters, we'll talk more about these
specific types, so no need to worry about the details at this point.
3.2.2. Types in entity relationship diagrams
You can find the type of each column in the entity relationship diagram. Here is the
fortune500 table. There are three different numeric data types used in the table: integer,
real, and numeric. Even if you don't have an entity relationship diagram, the column
type is a core piece of information you can expect to find in other kinds of
documentation.
Values can be converted temporarily from one type to another through a process called
casting. When you cast a column as a different type, the data is converted to the new type only for the
current query. To change a value's type, use the cast function, first, specify the value you want to cast.
This can be a single value or the name of a column. Then use the keyword AS. Finally, specify the name
of the type you want to convert the data to. Here's an example of casting the single numeric value 3-
point-7 as an integer. Casting from numeric to integer rounds the value to the nearest integer, which is
4. To convert the type of an entire column, enter the name of the column as the value. Here, a column
called total is converted to type integer. We need a from clause to specify which table the column
comes from.
3.4. Casting with ::
There's an alternate notation for casting values: a double colon. It does the same thing as the
cast function, but it's more compact. Put the value to convert before the double colon and the type to
cast it as after the double colon. The examples here are the same as those on the previous slide, except
with the double colon notation instead of the cast function.
Exercise
Effects of casting
When you cast data from one type to another, information can be lost or changed. See how the
casting changes values and practice casting data using the CAST() function and the :: syntax.
SELECT value::new_type;
Instructions 1/3:
• Select profits_change and profits_change cast as integer from fortune500.
• Look at how the values were converted.
• -- Select the original value
• SELECT profits_change,
• -- Cast profits_change
• CAST(profits_change AS integer) AS profits_change_in
t
• FROM fortune500;
Instructions 2/3:
• Compare the results of casting of dividing the integer value 10 by 3 to the result of
dividing the numeric value 10 by 3.
-- Divide 10 by 3
SELECT 10/3,
-- Cast 10 as numeric and divide by 3
10::numeric/3;
Instructions 3/3:
• Now cast numbers that appear as text as numeric.
• Note: 1e3 is scientific notation.
SELECT '3.2'::numeric,
'-123'::numeric,
'1e3'::numeric,
'1e-3'::numeric,
'02314'::numeric,
'0002'::numeric;
Exercise
Summarize the distribution of numeric values
Was 2017 a good or bad year for revenue of Fortune 500 companies? Examine how
revenue changed from 2016 to 2017 by first looking at the distribution of revenues_change and
then counting companies whose revenue increased.
Instructions 1/3:
• Use GROUP BY and count() to examine the values of revenues_change.
• Order the results by revenues_change to see the distribution.
-- Select the count of each value of revenues_change
SELECT revenues_change, COUNT(*)
FROM fortune500
GROUP BY revenues_change
-- order by the values of revenues_change
ORDER BY revenues_change;
Instructions 2/3:
• Repeat step 1, but this time, cast revenues_change as an integer to reduce the number of
different values.
• -- Select the count of each revenues_change integer value
• SELECT revenues_change::integer, count(*)
• FROM fortune500
• GROUP BY revenues_change::integer
• -- order by the values of revenues_change
• ORDER BY revenues_change;
Instructions 3/3:
• How many of the Fortune 500 companies had revenues increase in 2017 compared to
2016? To find out, count the rows of fortune500 where revenues_change indicates an
increase.
-- Count rows
SELECT COUNT(*)
FROM fortune500
-- Where...
WHERE revenues_change > 0;
In this chapter, we'll focus on numeric data. This includes both columns, or variables,
that only take on integer whole number values and variables with decimal values.
4.2. Division
The most notable example is division. When you divide integers, the result is truncated to also be an
integer. So integer 10 divided by integer 4 returns integer value 2. But integer 10 divided by numeric 4-
point-0 returns 2-point-5. Now that we've covered the different data types, how do we start exploring
numeric data?
4.3. Range: min and max
It's always good to check the range and summary statistics of the values in a column. Get the
range with the min and max functions, which return the minimum and maximum values of their
input respectively. Here, we take the min and max of the question_pct column in the
stackoverflow table. The column tells us the proportion of total questions for a day with the
specified tag.
4.4. Average or mean
Population Variance
Sample variance
Variance is a statistical measure of the amount of dispersion in a set of values. It tells you
how far spread values are from their mean. Larger values indicate greater dispersion. Variance
can be computed for a sample of data or for the population. The formula is the same except that
population variance divides by the number of values, while the sample variance divides by the
number of values minus one. The var_pop function computes population variance. The var_samp
function computes sample variance. The sample variance will always be slightly larger than the
population variance. The variance function is an alias for var_samp.
4.6. Standard deviation
Standard deviation is another measure of variance. It is the square root of the variance. Like variance,
there are also functions for both sample and population versions of standard deviation.
4.7. Round
Functions can return results with many decimal places. To make results easier to read, use the
round function to round a value of numeric type to a specified number of decimal places. The
round function takes a numeric value or column as the first argument, and the number of decimal
places to keep as the second argument.
4.8. Summarize by group
In addition to computing summary measures for entire columns, it's also good practice to
summarize variables by groups in the data. For example, in addition to summarizing the
question_pct column in the stackoverflow table overall, we also want to compute summary
measures for each tag. The output here is truncated. The numbers with an e in them are in
scientific notation.
Exercise
Division
Compute the average revenue per employee for Fortune 500 companies by sector.
Instructions:
• Compute revenue per employee by dividing revenues by employees; casting is used here
to produce a numeric result.
• Take the average of revenue per employee with avg(); alias this as avg_rev_employee.
• Group by sector.
• Order by the average revenue per employee.
• -- Select average revenue per employee by sector
• SELECT sector,
• AVG(revenues/employees::numeric) AS avg_rev_employee
• FROM fortune500
• GROUP BY sector
• -- Use the column alias to order the results
• ORDER BY avg_rev_employee;
Exercise
Explore with division
In exploring a new database, it can be unclear what the data means and how columns are related
to each other.
What information does the unanswered_pct column in the stackoverflow table contain? Is it the
percent of questions with the tag that are unanswered (unanswered ?s with tag/all ?s with tag)?
Or is it something else, such as the percent of all unanswered questions on the site with the tag
(unanswered ?s with tag/all unanswered ?s)?
Divide unanswered_count (unanswered ?s with tag) by question_count (all ?s with tag) to see if
the value matches that of unanswered_pct to determine the answer.
Instructions:
• Exclude rows where question_count is 0 to avoid a divide by zero error.
• Limit the result to 10 rows.
• -- Divide unanswered_count by question_count
• SELECT unanswered_count/question_count::numeric AS computed
_pct,
• -- What are you comparing the above quantity to?
• unanswered_pct
• FROM stackoverflow
• -- Select rows where question_count is not 0
• WHERE question_count > 0
• LIMIT 10;
Exercise
Summarize numeric columns
Summarize the profit column in the fortune500 table using the functions you've learned.
You can access the course slides for reference using the PDF icon in the upper right corner of the
screen.
Instructions 1/2
• Compute the min(), avg(), max(), and stddev() of profits; don't use any aliases here.
-- Select min, avg, max, and stddev of fortune500 profits
SELECT min(profits),
avg(profits),
max(profits),
stddev(profits)
FROM fortune500;
Instructions 2/2
• Repeat Step 1, but this time, creating a grouped summary of profits by sector, ordering
the results by the average profits for each sector; don't use any aliases here.
• -- Select sector and summary measures of fortune500 profits
• SELECT sector,min(profits),
• avg(profits),
• max(profits),
• stddev(profits)
•
• FROM fortune500
• -- What to group by?
• GROUP BY sector
• -- Order by the average profits
• ORDER BY avg;
Exercise
Summarize group statistics
Sometimes you want to understand how a value varies across groups. For example, how does the
maximum value per group vary across groups?
To find out, first summarize by group, and then compute summary statistics of the group results.
One way to do this is to compute group values in a subquery, and then summarize the results of
the subquery.
For this exercise, what is the standard deviation across tags in the maximum number of Stack
Overflow questions per day? What about the mean, min, and max of the maximums as well?
Instructions
• Start by writing a subquery to compute the max() of question_count per tag; alias the
subquery result as maxval.
• Then compute the standard deviation of maxval with stddev().
• Compute the min(), max(), and avg() of maxval too.
• -- Compute standard deviation of maximum values
• SELECT stddev(maxval),
• -- min
• min(maxval),
• -- max
• max(maxval),
• -- avg
• avg(maxval)
• -- Subquery to compute max of question_count by tag
• FROM (SELECT max(question_count) AS maxval
• FROM stackoverflow
-- Compute max by...
GROUP BY tag) AS max_results; -
- alias for subquery
5. Exploring distributions
Understanding the distribution of a variable is crucial for finding errors, outliers, and
other anomalies in the data.
For columns with a small number of discrete values, we can view the distribution by
counting the number of observations with each distinct value. We group by, and order the results
by, the column of interest. There are 20 distinct values in the unanswered_count column in the
stackoverflow data with the tag amazon-ebs. Only partial results are shown here. Twenty values
are manageable to examine, but when the variable you're interested in takes on many different
values, binning or grouping the values can make the output more useful.
5.2. Truncate
One way to do this is with the trunc function. Trunc is short for truncate. The trunc
function reduces the precision of a number. This means replacing the smallest numeric places -
the right-most digits - with zeros. Truncating is not the same as rounding: you'll never get a
result with a larger absolute value than the original number. Trunc takes two arguments: the
value to truncate and the number of places to truncate it to. Positive values for the second
argument indicate the number of digits after the decimal to keep. For example, truncating 42-
point-1256 to 2 places keeps only the first two digits after the decimal. Negative values for the
second argument indicate places before the decimal to replace with zero. For example, truncating
12,345 to -3 replaces the three digits to the
left of the decimal with zero.
We can use the trunc function to group values in the unanswered_count column into three
groups based on the digit in the tens place of the number. Note that the second argument to the trunc
function here is a -1. There are 74 values between 30 and 39.
5.3. Generate series
What if you want to group values by a quantity other than the place value of a number, such as by
units of 5 or 20? The generate_series function can help. It generates a series of numbers from a
starting value to an ending value, inclusive, by steps of a third value.
For example, we can generate a series from 1 to 10 by steps of 2, or a series from 0 to 1 by steps of
1/10th.
5.4. Create bins: output
generate_series can be used to group values into bins. Here's an example of what we want to
create: a series of lower and upper values, and the count of the number of observations falling in each
bin.
5.4.1. Create bins: query
Let's build the query to create that output. A WITH clause allows us to alias the results of
a subquery to use later in the query. Here, we generate two series: one for the lower bounds of
the bins and another for the upper. We name this "bins." Because we're only summarizing data
for tag amazon-ebs, we also create that subset of the stackoverflow table and call it ebs. Then
write the main select query to join the results of the subqueries we created and count the values.
We join ebs to bins where the column unanswered_count is greater than or equal to the lower
bound and strictly less than the upper bound. A left join keeps all bins in the result, even those
with no values in them. Finally, group by the lower and upper bin values to count the values in
each bin.
Each row in the output has the count of days where the number of unanswered questions
was greater than or equal to the lower bound and strictly less than the upper bound. Note that the
result contains bins with 0 values. This is because we counted non-null values of
unanswered_count instead of just the number of rows.
Exercise
Truncate
Use trunc() to examine the distributions of attributes of the Fortune 500 companies.
Remember that trunc() truncates numbers by replacing lower place value digits with zeros:
trunc(value_to_truncate, places_to_truncate)
Negative values for places_to_truncate indicate digits to the left of the decimal to replace, while
positive values indicate digits to the right of the decimal to keep.
Instructions 1/2:
• Use trunc() to truncate employees to the 100,000s (5 zeros).
• Count the number of observations with each truncated value.
• -- Truncate employees
• SELECT trunc(employees, -5) AS employee_bin,
-
- Count number of companies with each truncated value
COUNT(*)
FROM fortune500
-- Use alias to group
GROUP BY employee_bin
-- Use alias to order
ORDER BY employee_bin;
Instructions 2/2:
• Repeat step 1 for companies with < 100,000 employees (most common).
• This time, truncate employees to the 10,000s place.
• -- Truncate employees
• SELECT TRUNC(employees, -4) AS employee_bin,
• -
- Count number of companies with each truncated value
• COUNT(*)
• FROM fortune500
• -- Limit to which companies?
• WHERE employees < 100000
• -- Use alias to group
• GROUP BY employee_bin
• -- Use alias to order
• ORDER BY employee_bin;
Exercise
Generate series
Summarize the distribution of the number of questions with the tag "dropbox" on Stack
Overflow per day by binning the data.
Recall:
You can reference the slides using the PDF icon in the upper right corner of the screen.
Instructions 1/3:
• Start by selecting the minimum and maximum of the question_count column for the tag
'dropbox' so you know the range of values to cover with the bins.
-- Select the min and max of question_count
SELECT min(question_count),
max(question_count)
-- From what table?
FROM stackoverflow
-- For tag dropbox
WHERE tag = 'dropbox';
Instructions 2/3:
• Next, use generate_series() to create bins of size 50 from 2200 to 3100.
o To do this, you need an upper and lower bound to define a bin.
o This will require you to modify the stopping value of the lower bound and the
starting value of the upper bound by the bin width.
• -- Create lower and upper bounds of bins
• SELECT generate_series(2200, 3050, 50) AS lower,
• generate_series(2250, 3100, 50) AS upper;
Instructions 3/3:
• Select lower and upper from bins, along with the count of values within each bin
bounds.
• To do this, you'll need to join 'dropbox', which contains the question_count for tag
"dropbox", to the bins created by generate_series().
• The join should occur where the count is greater than or equal to the lower bound,
and strictly less than the upper bound.
• -- Bins created in Step 2
• WITH bins AS (
• SELECT generate_series(2200, 3050, 50) AS lower,
• generate_series(2250, 3100, 50) AS upper),
• -- Subset stackoverflow to just tag dropbox (Step 1)
• dropbox AS (
• SELECT question_count
• FROM stackoverflow
• WHERE tag='dropbox')
• -- Select columns for result
• -- What column are you counting to summarize?
• SELECT lower, upper, count(question_count)
• FROM bins -- Created above
• -
- Join to dropbox (created above), keeping all rows from th e bins table in
the join
• LEFT JOIN dropbox
• -- Compare question_count to lower and upper
• ON question_count >= lower
• AND question_count < upper
• -- Group by lower and upper to count values in each bin
• GROUP BY lower, upper
• -- Order by lower to put bins in order
• ORDER BY lower;
6. More summary functions
You've learned several functions to help you explore numeric data. Now it's time to add a few more.
6.1. Correlation
So far, we've summarized individual columns. But sometimes we want to understand the
relationship between two columns. Correlation is one measure of the relationship between two
variables. A correlation coefficient can range from 1 to -1, with larger values indicating a
stronger positive relationship, and more negative values indicating a stronger negative
relationship.
6.1.1. Correlation function
The corr function takes the names of two columns as arguments and returns the
correlation between them. Rows with a null value in either column are excluded.
6.2. Median
Another common summary measure is the median. The median is the 50th percentile,
or midpoint, in a sorted list of values.
To get the median, use a percentile function. The syntax for the percentile functions is
different than for other functions you've seen because the data must be ordered to do the
computation. It's called ordered-set aggregate syntax. The only argument to the function is a
number between 0 and 1 corresponding to the percentile you want. You then type "within
group", and then, inside parentheses, order by and the name of the column you want to compute
the percentile for. percentile d-i-s-c, or discrete, always returns a value that exists in the column.
percentile c-o-n-t, or continuous, interpolates between values around the specified percentile. It
can return a value that is not in the original data.
6.2.2. Percentile examples
Here's an example. We have four numbers: 1, 3, 4, and 5. The two percentile functions return
different values for the median. The discrete percentile function returns 3, while the
continuous percentile function interpolates between 3 and 4, to return 3-point-5. The
formula used to compute percentiles is fairly complex, and sometimes the results may
not be intuitive. In particular, you may be used to computing the median of an even
number of values as the average of the two middle
values. Be aware that these functions may not always return that value as the 50th percentile.
Exercise
Correlation
What's the relationship between a company's revenue and its other financial attributes? Compute
the correlation between revenues and other financial variables with the corr() function.
Instructions:
• Compute the correlation between revenues and profits.
• Compute the correlation between revenues and assets.
• Compute the correlation between revenues and equity.
-- Correlation between revenues and profit
SELECT corr(revenues,profits) AS rev_profits,
-- Correlation between revenues and assets
corr(revenues,assets) AS rev_assets,
-- Correlation between revenues and equity
corr(revenues,equity) AS rev_equity
FROM fortune500;
Exercise
Mean and Median
Compute the mean (avg()) and median assets of Fortune 500 companies by sector.
percentile_disc(0.5)
WITHIN GROUP (ORDER BY column_name)
Instructions:
• Select the mean and median of assets.
• Group by sector.
• Order the results by the mean.
-- What groups are you computing statistics by?
SELECT sector,
-- Select the mean of assets with the avg function
avg(assets) AS mean,
-- Select the median
percentile_disc(0.5) WITHIN GROUP (ORDER BY assets)
AS median
FROM fortune500
-- Computing statistics for each what?
GROUP BY sector
-- Order results by a value of interest
ORDER BY mean;
7. Creating temporary tables
Up to this point, you've run queries and viewed the results. But what if you want to keep
the results of a query around for reference? You need special permissions in a database to create
or update tables, but most users can create temporary tables that only they can see and that only
last for the duration of a database session.
7.1. Syntax
One way to create a temporary table is with a select query. The results of the query are saved
as a table that you can use later. To do this, we preface any select query with the words create
temp table, then a name for the table we're creating, and finally the keyword as. This copies the
result of the select query into a new table that has no connection to the original table. There are
other ways to create temporary tables as well. You may have seen the "select into" syntax before.
You add a special clause into the middle of a select query to direct the results into a new temp
table. In this example, the added clause is the middle line of code. Both of these queries do the
same thing, just with different syntax. We're going to use the create table syntax in this course.
It's the method recommended by Postgres, and it allows you to use options not available with the
"select into" syntax.
7.2. Create a table
As an example let's make a temporary table called top_companies with just the rank and title of
the top 10 companies in fortune500. We preface our select query with the create temp table
syntax. After we've created the table, we can then select from it. Note that the column names are
taken from the column names of the result
7.3. Insert into table
We can also insert new rows into a table after we've created it. We use an "insert into"
statement with the name of the table, followed by a select query that will generate the rows we
want to add to the table. The columns generated by the select query must match those already in
the table. Here we add companies with ranks 11 to 20 to the table. In many database clients, after
you run the command,you'll get a confirmation message that 10 rows were inserted into the table.
In the DataCamp editor, you won't see any message when rows are inserted. Now if we select
from the temp table top_companies again, you can see the new rows have been added.
7.4. Delete(drop) table
To delete a table, use the drop table command. The table will be deleted immediately
without warning. Dropping a table can be useful if you made a mistake when creating it or when
inserting values into it. Temporary tables will also be deleted automatically when you disconnect
from the database. A variation on the drop table command adds the clause if exists before the
table name. This means to only try to delete the table after confirming that such a table exists.
This variation is often used in scripts because it won't cause an error if the table doesn't exist.
Exercise
Create a temp table
Find the Fortune 500 companies that have profits in the top 20% for their sector (compared to
other Fortune 500 companies).
To do this, first, find the 80th percentile of profit for each sector with
percentile_disc(fraction)
WITHIN GROUP (ORDER BY sort_expression)
Then join fortune500 to the temporary table to select companies with profits greater than the
80th percentile cut-off.
Instructions 1/2:
• Create a temporary table called profit80 containing the sector and 80th percentile of
profits for each sector.
• Alias the percentile column as pct80.
-
- To clear table if it already exists; fill in name of temp
table
DROP TABLE IF EXISTS profit80;
Find out how many questions had each tag on the first date for which data for the tag is available,
as well as how many questions had the tag on the last day. Also, compute the difference between
these two values.
Then use the minimum dates to select the question_count on both the first and last day. To do
this, join the temp table startdates to two different copies of the stackoverflow table: one for each
column - first day and last day - aliased with different names.
Instructions 1/2:
• First, create a temporary table called startdates with each tag and the min() date for the
tag in stackoverflow.
• -- To clear table if it already exists
• DROP TABLE IF EXISTS startdates;
•
• -- Create temp table syntax
• CREATE TEMP TABLE startdates AS
• -- Compute the minimum date for each what?
• SELECT tag,
• min(date) AS mindate
• FROM stackoverflow
• -- What do you need to compute the min date for each tag?
• GROUP BY tag;
•
• -- Look at the table you created
• SELECT *
• FROM startdates;
Instructions 2/2:
• Join startdates to stackoverflow twice using different table aliases.
• For each tag, select mindate, question_count on the mindate, and question_count on
2018-09-25 (the max date).
• Compute the change in question_count over time.
• -- To clear table if it already exists
• DROP TABLE IF EXISTS startdates;
•
• CREATE TEMP TABLE startdates AS
• SELECT tag, min(date) AS mindate
• FROM stackoverflow
• GROUP BY tag;
•
• -- Select tag (Remember the table name!) and mindate
• SELECT so_min.tag,
• mindate,
• -- Select question count on the min and max days
• so_min.question_count AS min_date_question_count,
• so_max.question_count AS max_date_question_count,
• -- Compute the change in question_count (max- min)
• so_max.question_count - so_min.question_count AS cha nge
• FROM startdates
• -
- Join startdates to stackoverflow with alias so_min
• INNER JOIN stackoverflow AS so_min
• -- What needs to match between tables?
• ON startdates.tag = so_min.tag
• AND startdates.mindate = so_min.date
• -- Join to stackoverflow again with alias so_max
• INNER JOIN stackoverflow AS so_max
• -- Again, what needs to match between tables?
• ON so_min.tag = so_max.tag
• AND so_max.date = '2018-09-25';
Exercise
Insert into a temp table
While you can join the results of multiple similar queries together with UNION, sometimes it's
easier to break a query down into steps. You can do this by creating a temporary table and
inserting rows into it.
Compute the correlations between each pair of profits, profits_change, and revenues_change
from the Fortune 500 data.
profits 1.00 # #
profits_change # 1.00 #
revenues_change # # 1.00
round(column_name::numeric, decimal_places)
Note that Steps 1 and 2 do not produce output. It is normal for the query result pane to say "Your
query did not generate any results."
Instructions 1/3:
Create a temp table correlations.
• Compute the correlation between profits and each of the three variables (i.e. correlate
profits with profits, profits with profits_change, etc).
• Alias columns by the name of the variable for which the correlation with profits is being
computed.
Instructions 2/3:
• Insert rows into the correlations table for profits_change and revenues_change.
• DROP TABLE IF EXISTS correlations;
•
• CREATE TEMP TABLE correlations AS
• SELECT 'profits'::varchar AS measure,
• corr(profits, profits) AS profits,
• corr(profits, profits_change) AS profits_change,
• corr(profits, revenues_change) AS revenues_change
• FROM fortune500;
•
• -- Add a row for profits_change
• -- Insert into what table?
• INSERT INTO correlations
• -
- Follow the pattern of the select statement above using profits_change
instead of profits
• SELECT 'profits_change'::varchar AS measure,
• corr(profits_change,profits) AS profits,
• corr(profits_change,profits_change) AS profits_change,
• corr(profits_change,revenues_change) AS revenues_change
• FROM fortune500;
•
• -- Repeat the above, but for revenues_change
• INSERT INTO correlations
• SELECT 'revenues_change'::varchar AS measure,
• corr(revenues_change,profits) AS profits,
• corr(revenues_change,profits_change) AS profits_change,
• corr(revenues_change,revenues_change) AS revenues_change
• from fortune500;
Instructions 3/3:
• Select all rows and columns from the correlations table to view the correlation matrix.
• First, you will need to round each correlation to 2 decimal places.
• The output of corr() is of type double precision, so you will need to also cast columns to
numeric.
• DROP TABLE IF EXISTS
• correlations;
• CREATE TEMP TABLE correlations
AS
• SELECT 'profits'::varchar AS measure,
• corr(profits, profits) AS profits,
• corr(profits, profits_change) AS profits_change,
• corr(profits, revenues_change) AS revenues_change
• FROM fortune500;
•
• INSERT INTO correlations
• SELECT 'profits_change'::varchar AS measure,
• corr(profits_change, profits) AS profits,
• corr(profits_change, profits_change) AS profits_change,
• corr(profits_change, revenues_change) AS revenues_change
• FROM fortune500;
•
• INSERT INTO correlations
• SELECT 'revenues_change'::varchar AS measure,
• corr(revenues_change, profits) AS profits,
• corr(revenues_change, profits_change) AS profits_change,
• corr(revenues_change, revenues_change) AS revenues_c
hange
• FROM fortune500;
•
• -- Select each column, rounding the correlations
• SELECT measure,
• ROUND(profits::numeric,2) AS profits,
• ROUND(profits_change::numeric,2) AS profits_change,
• ROUND(revenues_change::numeric,2) AS revenues_change
• FROM correlations;
8. Character data types and common issues
The next type of data we’ll be exploring is character or text data.
Character(n) or char(n)
• Fixed length n
• Trailing spaces ignored in comparisons
Text or varchar
• Unlimited length
There are three types of character columns to store strings of text: character (which can be
shortened to char), character varying (which can be shortened to varchar), and text. They differ
in the length of the string of text they store. The length of a string is defined as the number of
characters in it. Character columns store a fixed length string; spaces are added to the end of
shorter strings to make up any difference in length. Spaces at the end of char fields are ignored
when comparing values. Varchar columns can optionally specify a maximum string length; they
allow strings of any size up to the specified maximum. Text, or varchar columns without a
maximum length specified, can store strings of unlimited length.
8.2. Types of text data
Categorical
I really like this product. I use it every day. It’s my favorite color.
We’ve redesigned your favorite t-shirt to make it even better. You’ll love…
Four score and seven years ago our fathers brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition that all men are created equal…
Regardless of the formal column type, for analysis, we want to distinguish between two types
of text data: categorical variables and unstructured text. Categorical variables are short strings of
text with values that are repeated across multiple rows. They take on a finite and manageable set
of distinct values. Days of the week, product categories, and multiple-choice survey question
responses are all examples of categorical variables. Unstructured text consists of longer strings of
unique values, such as answers to open-ended survey questions or product reviews. To analyze
unstructured text, we can create new variables that extract features from the text or indicate
whether the text has particular characteristics. For example, we could create binary indicator
variables that denote whether the text contains keywords of particular interest.
For now, we'll focus on categorical variables. The first things to check with categorical
variables are the set of distinct categories and the number of observations, or rows, for each
category. We do this with GROUP BY and count. Without ordering the results, it's hard to tell
which categories are commonly used and whether any categories should be grouped together.
8.4. Order: most frequent values
Ordering by the count of each value helps us see the most, and least, frequent categories.
It's good to check whether categories with only a few observations have errors - such as spelling,
capitalization, or spacing mistakes.
8.4.1. Order: category value
It's also a good idea to try ordering the results by the category. Doing so can help us
identify possible duplicates and other errors in the data. Does the order of the categories in the
results match what you were expecting?
8.4.2. Alphabetical order
Character types are sorted in alphabetical order. Spaces come before letters, and
uppercase letters come before lowercase letters. Looking at the first character of each category
shows that the results are in alphabetical order.
8.5. Common issues
Case matters
Spaces count
Punctuation differences
So what are you looking for when grouping and counting values? Common inconsistencies
and issues with character data include: Differences in case: for example, when there are both
lower and upper case versions of the same value. White space differences, such as when values
only differ in the number or placement of spaces. One exception here is that when comparing
values of type char, trailing spaces are ignored. An empty string, which is a string of length zero,
is not the same as a string of all spaces. An empty string is also not the same as null. These are
distinct values. And finally punctuation differences. Punctuation differences can sometimes be
subtle. For example, there are multiple types of hyphens and dashes that look similar but are
different characters.
Exercise
Count the categories
In this chapter, we'll be working mostly with the Evanston 311 data in table evanston311. This is
data on help requests submitted to the city of Evanston, IL.
This data has several character columns. Start by examining the most frequent values in some of
these columns to get familiar with the common categories.
Instructions 1/4:
• How many rows does each priority level have?
Instructions 3/4:
• How many distinct values of source appear in at least 100 rows?
• -
- Find the 5 most common values of street and the count of
each
• SELECT street, COUNT(*)
• FROM evanston311
• GROUP BY street
• ORDER BY COUNT(*) DESC
• LIMIT 5;
9. Cases and spaces
Two of the most common inconsistencies in text data are differences in the case of
characters and in the spaces in a string. We can deal with these issues by using functions to
change character case or remove spaces and by querying data with the LIKE operator.
9.1. Converting case
First, one of the easiest ways to handle inconsistencies in case is to convert character data
to either be all upper or all lower case. The upper and lower functions do just that. The functions
have no effect on punctuation or numbers.
9.2. Case insensitive comparisons
You can use the lower or upper function to make comparisons case insensitive. For
example, the fruit data here has 8 entries corresponding to apple, but there are 6 different ways
the data is entered. To select rows from the fruit table with the value apple - regardless of case -
we can convert all fav_fruit values to lower case with the lower function. Then select rows where
the result of the function is equal to 'apple', all lower case. Note that while we got both upper and
lower case versions of apple in our 5 results, we are still missing 3 values with spaces at the
beginning or end of the word apple, or with the plural apples instead of apple.
9.3. Case insensitive searches
The LIKE operator can help us match values of apple that might have extra spaces or s-es
at the end. By using a LIKE pattern with a percentage sign before and after apple, we match
fav_fruit entries where apple is anywhere in the string. Remember that with LIKE, percentage
matches any number of characters, including 0, while an underscore matches exactly one
character. Now we have values of apple with spaces and s-es, but only lower case. To make this
query case insensitive, we can use ILIKE instead of LIKE. The I stands for insensitive. ILIKE
queries take longer to run than LIKE queries, so only use them when you need to. Using ILIKE
we also select variations of apple with upper case characters. All 8 variations of apple are now in
the result.
9.3.1. Watch out!
Remember though that LIKE searches can match more than you may intend. Our query
to select apple values would also select pineapple!
9.4. Trimming spaces
While the functions remove spaces by default, you can specify other characters that
should be removed instead. You can remove a single character, such as an exclamation point, or
a set of characters, all together in a single string. The trim functions are case sensitive, so in the
second example, we include both an upper and lower case W.
9.6. Combining functions
Instead of specifying both lower and upper case versions of the same letter, we can combine
functions. Remember that we can nest the call to one function inside another function. The inner
function is executed first, then the result is sent to the outer function. Here, we first convert all of
the characters to lower case with the lower function, then we use the trim function to remove
exclamation points and lower case w's.
Exercise
Trimming
Some of the street values in evanston311 include house numbers with # or / in them. In addition,
some street values end in a ..
Remove the house numbers, extra punctuation, and any spaces from the beginning and end of the
street values as a first attempt at cleaning up the values.
Instructions:
• Trim digits 0-9, #, /, ., and spaces from the beginning and end of street.
• Select distinct original street value and the corrected street value.
• Order the results by the original street value.
• SELECT distinct street,
• -- Trim off unwanted characters from street
• trim(street, '0123456789 #/.') AS cleaned_street
• FROM evanston311
• ORDER BY street;
Exercise
Exploring unstructured text
The description column of evanston311 has the details of the inquiry, while the category column
groups inquiries into different types. How well does the category capture what's in the
description?
LIKE and ILIKE queries will help you find relevant descriptions and categories. Remember that
with LIKE queries, you can include a % on each side of a word to find values that contain the
word. For example:
Instructions 1/4:
• count rows in evanston311 where the description contains 'trash' or 'garbage' regardless of
case.
-- Count rows
SELECT COUNT(*)
FROM evanston311
-- Where description includes trash or garbage
WHERE description ILIKE '%trash%'
OR description ILIKE '%garbage%';
Instructions 2/4:
• category values are in title case. Use LIKE to find category values with 'Trash' or
'Garbage' in them.
-- Select categories containing Trash or Garbage
SELECT category
FROM evanston311
-- Use LIKE
WHERE category LIKE '%Trash%'
OR category LIKE '%Garbage%';
Instructions 3/4:
• Count rows where the description includes 'trash' or 'garbage' but the category does not.
-- Count rows
SELECT COUNT(*)
FROM evanston311
-- description contains trash or garbage (any case)
WHERE (description ILIKE '%trash%'
OR description ILIKE '%Garbage%')
-- category does not contain Trash or Garbage
AND category NOT LIKE '%Trash%'
AND category NOT LIKE '%Garbage%';
Instructions 4/4:
• Find the most common categories for rows with a description about trash that don't have
a trash-related category
• -- Count rows with each category
• SELECT category, COUNT(*)
• FROM evanston311
• WHERE (description ILIKE '%trash%'
• OR description ILIKE '%garbage%')
• AND category NOT LIKE '%Trash%'
• AND category NOT LIKE '%Garbage%'
• -- What are you counting?
• GROUP BY category
• ORDER BY count DESC
• LIMIT 10;
10. Splitting and concatenating text
When working with text values, you often need to break strings apart into multiple
pieces, extract part of a string to a new variable, or join, or concatenate, strings together. There
are functions to help us with these operations.
10.1. Substring
First, how do we extract just part of a string? The left and right functions take as arguments a
string, or the name of a column of strings, and the number of characters to keep. Left keeps characters
starting at the left, while right keeps characters counting from the end. Here, the first two characters in
the string abcde are a and b, while the last two characters are d and e. If the string contains fewer than the
requested number of characters, only the available characters are returned.
To extract characters from the middle of a string, use the substring function. The function takes
a string or column to operate on, and then the keyword FROM. Next comes the index of the character to
start with, counting from 1. Then the keyword FOR followed by the number of characters to include in
the substring. For example, if we take the substring of abcdef starting from position 2 and going for 3
characters, we get bcd. B was the second character in the string, and the function extracted 3
characters. You may also see an abbreviated version of substring with a shortened function name and
comma-separated arguments. It works the same way. The left, right, and substring functions can be
useful in situations such as extracting a snippet from a long unstructured text field, displaying just the
first or last few digits of an account number, or limiting a zip code to only the first 5 digits.
10.2. Delimiters
Fields/chunks:
1. Some text
2. More text
3. Still more text
The second string operation to know is how to split a string into parts based on a
delimiter. A delimiter is a character, such as a comma, or a string that separates fields or
chunks of text.
The function split_part takes a string, the delimiter to split the string on, and the number
position of the part of the split string to return, counting from one. For example, if we split the
string a-comma- bc-comma-d with a comma as the delimiter, the string would be split into 3
parts: a, bc, and d. If we ask for the second part, we get bc. Note that the delimiter is not included
in the returned value.
The delimiter can be a single character or a string of multiple characters. For example, if
we split the string "cats and dogs and fish" on "and" surrounded by spaces, the first group is cats.
Note that the string was split on the delimiter exactly as it appears, not on the set of characters
included in the delimiter. It is common to split strings on a delimiter value when multiple pieces
of information have been stored together in a single column.
10.3. Concatenating text
The third string operation is concatenation. The concat function takes any number of arguments. It joins
the text representation of all of the values together in a single string. You can concatenate both
character types and non-character types. Values can also be concatenated with a double pipe, which
looks like two vertical bars. This operator is the SQL standard for string concatenation. It works the
same as the concat function except when null values are included. The concat function omits null
values, while the double pipe will return null if any component is null. One example of when you
might concatenate strings is to join a first name and last name stored in separate columns to get a
person's full name.
Exercise
Concatenate strings
House number (house_num) and street are in two separate columns in evanston311. Concatenate
them together with concat() with a space in between the values.
Instructions:
• Concatenate house_num, a space ' ', and street into a single value using the concat().
• Use a trim function to remove any spaces from the start of the concatenated value.
-
- Concatenate house_num, a space, and street and trim space
s from the start of the result
SELECT ltrim(CONCAT(house_num,' ', street)) AS address
FROM evanston311;
Exercise
Split strings on a delimiter
The street suffix is the part of the street name that gives the type of street, such as Avenue, Road,
or Street. In the Evanston 311 data, sometimes the street suffix is the full word, while other times
it is the abbreviation.
Extract just the first word of each street value to find the most common streets regardless of the
suffix.
To do this, use
For displaying or quickly reviewing the data, you might want to only display the first few
characters. You can use the left() function to get a specified number of characters at the start of
each value.
To indicate that more data is available, concatenate '...' to the end of any shortened description.
To do this, you can use a CASE WHEN statement to add '...' only when the string length is
greater than 50.
Select the first 50 characters of description when description starts with the word "I".
Instructions:
• Select the first 50 characters of description with '...' concatenated on the end where the
length() of the description is greater than 50 characters. Otherwise just select the
description as is.
• Select only descriptions that begin with the word 'I' and not the letter 'I'.
o For example, you would want to select "I like using SQL!", but would not want to
select "In this course we use SQL!".
• -- Select the first 50 chars when length is greater than 50
• SELECT CASE WHEN length(description) > 50
• THEN left(description, 50) || '...'
• -- otherwise just select description
• ELSE description
• END
• FROM evanston311
• -- limit to descriptions that start with the word I
• WHERE description LIKE 'I %'
• ORDER BY description;
11. Strategies for multiple transformations
You've learned several ways to transform character data. But what do you do when you need to
use different transformations on different observations?
11.1. Multiple transformations
Here's an example of data where different delimiter characters were used to separate the
major industry categories of Agriculture and Education, from the industry subcategories.
Sometimes there's a colon, other times there's a pipe character or a dash. We can use the
split_part function to separate the category into its two parts, but how can we apply different
delimiters to different rows?
11.2. CASE WHEN
One option when you need to apply multiple transformations to subsets of the data that don't
overlap is to use a CASE WHEN statement. We want to extract just the initial major category
from the category column - the part before a delimiter. To do that, we have a case for each
different delimiter: a colon followed by space, a dash surrounded by spaces, and a pipe
surrounded by spaces. The last case here goes in the else clause. We use LIKE statements to
select the rows with each type of delimiter, then apply the split_part function with the delimiter
for those rows. We alias the result of the CASE WHEN statement as major_category. We can
then use the major_category we extracted to group and aggregate the data. This allows us to get
the number of businesses in each of the two major categories.
11.3. Recoding table
The first step is to create a temporary table with two columns: original, containing the
distinct values of fav_fruit and standardized, which will eventually contain the recoded values.
We initially populate the standardized column with the original values.
11.4.1. Initial table
Here, we need three update statements. In the first, we set the standardized value to be the
lower case version of the original value, with spaces trimmed from both ends. In the second, we
set the standardized value to banana only for rows that contained a double n. The third statement
updates the standardized value by removing s's from the end with the trim function.
11.5.1. Resulting recode table
Original only
Exercise
Group and recode values
There are almost 150 distinct values of evanston311.category. But some of these categories are
similar, with the form "Main Category - Details". We can get a better sense of what requests are
common if we aggregate by the main category.
To do this, create a temporary table recode mapping distinct category values to new,
standardized values. Make the standardized values the part of the category before a dash ('-').
Extract this value with the split_part() function:
You'll also need to do some additional cleanup of a few cases that don't fit this pattern.
Then the evanston311 table can be joined to recode to group requests by the new standardized
category values.
Instructions 1/4:
• Create recode with a standardized column; use split_part() and then rtrim() to remove any
remaining whitespace on the result of split_part().
-
- Fill in the command below with the name of the temp table
DROP TABLE IF EXISTS recode;
• Emails contain an @.
• Phone numbers have the pattern of three characters, dash, three characters, dash, four
characters. For example: 555-555-1212.
Use LIKE to match these patterns. Remember % matches any number of characters (even 0), and
_ matches a single character. Enclosing a pattern in % (i.e. before and after your pattern) allows
you to locate it within other text.
For example, '%___.com%' would allow you to search for a reference to a website with the top-
level domain '.com' and at least three characters preceding it.
Create and store indicator variables for email and phone in a temporary table. LIKE produces
True or False as a result, but casting a boolean (True or False) as an integer converts True to 1
and False to 0. This makes the values easier to summarize later.
Instructions 1/2:
• Create a temp table indicators from evanston311 with three columns: id, email, and
phone.
• Use LIKE comparisons to detect the email and phone patterns that are in the
description, and cast the result as an integer with CAST().
o Your phone indicator should use a combination of underscores _ and dashes - to
represent a standard 10-digit phone number format.
o Remember to start and end your patterns with % so that you can locate the pattern
within other text!
-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;
The last type of data we're exploring is date/time data. As the name suggests, date/time refers to
columns that store dates and or times.
Date
• YYYY-MM-DD
• Example: 2018-12-30
Timestamp
• YYYY-MM-DD HH:MM:SS
• Example: 2018-12-30 13:10:04.3
There are two main types: date and timestamp. Dates only include year, month, and day.
Timestamps include a date plus a time. Times are specified in terms of hours from 0 to 24, minutes, and
seconds. Seconds can be fractional down to microseconds.
12.2. Intervals
Interval examples:
There is also a third date/time type you should know: an interval. Intervals represent time
durations. For example, 6 days, 1 hour, 48 minutes, and 8 seconds, or 51 minutes and 3 seconds.
Columns can be of type interval, but it's more common to encounter intervals as a result of
subtracting one date or timestamp from another. Intervals will default to display the number of
days, if any, and the time.
12.3. Date/time format examples
01/10/18 1:00
10/01/18 01:00:00
01/10/2018 1pm
January
10th,2018
1pm 10 Jan
2018 1:00
01/10/18 01:00:00
01/10/18 13:00:00
Date/time data can be difficult to work with because people record dates in many different
formats. Consider some of the different ways people might write 1pm on January 10th, 2018.
They might write the date with either the month or day first. They can use two digits for the year
or four. They could spell out the month name, abbreviate it, or use numbers. They might specify
the time using a 12 hour clock or 24 hour clock.
12.4. ISO 8601
YYYY-MM-DD HH:MM:SS
Example: 2018-01-05 09:35:15
YYYY-MM-DD HH:MM:SS+HH
Timezones are another way datetime information can get complicated. Postgres stores
timestamps according to UTC, or Coordinated Universal Time. Timezones are defined in terms
of their offset from UTC. Timestamps in Postgres can include timezone information or not.
When timezones are included, they appear at the end with a plus or minus, followed by the
number of hours the timezone is offset from UTC. The example timestamp here is 2 hours ahead
of UTC.
12.6. Date and time comparisons
So how do we work with dates and timestamps? Date/time entries can be compared to each
other as numbers can: with greater than, less than, and equals signs. You can get the current timestamp
with the now function. This can be useful when comparing values to the current date and time. Note
how dates in these examples are specified in ISO 8601 format. They are surrounded by single quotes
like character data.
12.7. Date subtraction
In addition to comparing dates, you can also subtract them from each other. The result is of type
interval.
12.8. Date addition
You can also add time to or subtract time from existing dates. Adding an integer value to a
date will add days. Adding an integer to a timestamp, however, will cause an error. Other
amounts of time, from years to seconds, can be added with intervals. You specify an interval
with a combination of numbers and words inside single quotes, then cast this as an interval. For
example, you can add an interval of one year. Or, you can specify the interval in terms of
multiple units, such as 1 year, 2 days, and 3 minutes.
Exercise
Date comparisons
When working with timestamps, sometimes you want to find all observations on a given
day. However, if you specify only a date in a comparison, you may get unexpected results. This
query:
SELECT count(*)
FROM evanston311
WHERE date_created = '2018-01-02';
This is because dates are automatically converted to timestamps when compared to a timestamp.
The time fields are all set to zero:
SELECT '2018-01-02'::timestamp;
2018-01-02 00:00:00
When working with both timestamps and dates, you'll need to keep this in mind.
Instructions 1/3:
• Count the number of Evanston 311 requests created on January 31, 2017 by casting
date_created to a date.
-- Count requests created on January 31, 2017
SELECT count(*)
FROM evanston311
WHERE date_created::date='2017-01-31';
Instructions 2/3:
• Count the number of Evanston 311 requests created on February 29, 2016 by using
>= and < operators.
-- Count requests created on February 29, 2016
SELECT count(*)
FROM evanston311
WHERE date_created >= '2016-02-29'
AND date_created < '2016-03-01' ;
Instructions 3/3:
Count the number of requests created on March 13, 2017.
Specify the upper bound by adding 1 to the lower bound.
• -- Count requests created on March 13, 2017
• SELECT count(*)
• FROM evanston311
• WHERE date_created >= '2017-03-13'
• AND date_created < '2017-03-13'::date + 1;
Exercise
Date arithmetic
You can subtract dates or timestamps from each other.
You can add time to dates or timestamps using intervals. An interval is specified with a number
of units and the name of a datetime field. For example:
• '3 days'::interval
• '6 months'::interval
• '1 month 2 years'::interval
• '1 hour 30 minutes'::interval
Practice date arithmetic with the Evanston 311 data and now().
Instructions 1/4:
• Subtract the minimum date_created from the maximum date_created.
Instructions 3/4:
• Add 100 days to the current timestamp.
• -- Add 100 days to the current timestamp
• SELECT now()+'100 days'::interval;
Instructions 4/4
• Select the current timestamp and the current timestamp plus 5 minutes.
Instructions:
• Compute the average difference between the completion timestamp and the creation
timestamp by category.
• Order the results with the largest average time to complete the request first.
• -
- Select the category and the average completion time by ca
tegory
• SELECT category,
• avg(date_completed-date_created) AS completion_time
• FROM evanston311
• GROUP BY category
• -- Order the results
• ORDER BY completion_time DESC;
13. Date/time components and aggregation
As with numerical and character data, sometimes we need to extract components of a
date/time, or truncate the value, to aggregate the data in a meaningful way.
13.1. Common date/time fields
Documentation Fields
Functions exist to extract individual components of date/time data. These components are
called fields. The fields are defined in the Postgres documentation. Many are based on the ISO
8601 standard. Let's look at some common fields starting with the largest unit of time. First, we
can get the century or decade that a timestamp belongs in. January 1st, 2019 is in century 21 and
decade 201. Date/time field definitions can be complicated and sometimes counterintuitive. It's
always a good idea to read the documentation before using unfamiliar fields. Next, we can get
the year, month, and day fields that make up a date. We can also get the hour, minute, and
second fields that make up a time. Week is the week number in the year, based on the ISO 8601
definition. D-O-W is day of week. The week starts with Sunday, which has a value of 0, and
ends on Saturday with a value of 6.
Individual sales
By month
Extracting fields from dates is useful when looking at how data varies by one unit of time across
a larger unit of time. For example, how do sales vary by month across years? Using sales from
2010- 2016, are sales in January usually higher than those in March?
13.4. Truncating dates
Instead of extracting single fields, you can also truncate dates and timestamps to a specified
level of precision. Remember that dates and timestamps are ordered from left to right, largest units to
smallest. You can use the date_trunc function, which is short for date truncate, to specify how much of
a timestamp to keep, as you might with a numeric value. Valid field types include all of those we
discussed except day of week. Date_trunc replaces fields smaller than, or less significant than, the one
specified with zero, or one, as appropriate. Month and day are set to 1, while time fields are set to 0.
Here, the year and month remain, and the rest of the fields are set to 0 or 1. The timezone remains
unchanged.
13.5. Truncate to keep large units
Individual sales
By month with year
Truncating dates is useful when you want to count, average, or sum data associated with
timestamps or dates by larger units of time. For example, starting from individual timestamped
sales transactions, what is the monthly trend in sales from June 2017 to January 2019?
Exercise
Date parts
The date_part() function is useful when you want to aggregate data by a unit of time across
multiple larger units of time. For example, aggregating data by month across different years, or
aggregating by hour across different days.
In this exercise, you'll use date_part() to gain insights about when Evanston 311 requests are
submitted and completed.
Instructions 1/3:
• How many requests are created in each of the 24 months during 2016-2017?
• -- Extract the month from date_created and count requests
• SELECT date_part('month',date_created) AS month,
• COUNT(*)
• FROM evanston311
• -- Limit the date range
• WHERE date_created >='2016-01-01'
• AND date_created<'2018-01-01'
• -- Group by what to get monthly counts?
• GROUP BY month;
Instructions 2/3:
• What is the most common hour of the day for requests to be created?
Exercise
Variation by day of week
Does the time required to complete a request vary by the day of the week on which the request
was created?
We can get the name of the day of the week by converting a timestamp to character data:
to_char(date_created, 'day')
But character names for the days of the week sort in alphabetical, not chronological, order. To
get the chronological order of days of the week with an integer value for each day, we can use:
date_trunc('field', timestamp)
Using date_trunc(), find the average number of Evanston 311 requests created per day for each
month of the data. Ignore days with no requests when taking the average.
Instructions:
• Write a subquery to count the number of requests created per day.
• Select the month and average count per month from the daily_count subquery.
• -- Aggregate daily counts by month
• SELECT date_trunc('month', day) AS month,
• avg(count)
• -- Subquery to compute daily counts
• FROM (SELECT date_trunc('day', date_created) AS day,
• COUNT(*) AS count
• FROM evanston311
• GROUP BY day) AS daily_count
• GROUP BY month
• ORDER BY month;
14. Aggregating with date/time series
When counting observations by month or day, the result only includes rows for values
that appear in your data. How do you find periods of time with no observations?
14.1. Generate series
Recall the generate_series function, which you used with numeric data. The same
function can be used with date/time data. generate_series expects timestamps for the from and to
arguments. Dates will automatically be cast to a timestamp. The last argument is an interval. For
example, here we have an interval of two days. The result is a series of timestamps between the
start and end values separated by the interval.
Here's an example with an interval of hours. The last value in the series will be less than
or equal to the ending timestamp specified. For example, here the series ends at 8pm on January
1st, because the next value in the series would be greater than the 0th hour of January 2nd.
14.2. Generate series from the beginning
To get consistent values, generate series using the beginning of a month or year, not the end. For
example, attempting to generate a series for the last day in each month produces unexpected results.
When you add one month to January 31st, you get the last day in February, the 28th, because there is
no 31st. But then 1 month after February 28th is March 28th, not March 31st.
To correctly generate a series for the last day of each month, generate a series using the
beginning of each month, then subtract 1 day from the result.
Normal aggregation
Series can also be used to find units of time with no observations. For example, you
might want to count sales by the hour of the day they occurred. Here's some sample sales data in
its original form. Then with the number of sales counted by hour. Looking at the counts, it's hard
to tell at a glance that there were no sales in the 11 o'clock hour.
14.3. Aggregation with series
To include hours with no sales, generate a series of hours, and then join this to the original
data to introduce rows for the missing hours. First, use a WITH clause to create the series of
hours from 9am to 2pm and call this hour_series. Then, join this to the sales data, matching the
hour from the series to the sales date truncated to the hour. Count the date column, instead of
counting the rows, because we don't want to count null values. Group and order by hours to get
the count of sales per hour.
The result now includes all hours between 9am and 2pm, with zeros for hours with no sales.
We're less likely to overlook that some hours have no sales.
14.4. Aggregation with bins
If you want to aggregate data by an interval that is not equal to one unit of a date/time field,
you can create bins. Recall this strategy from working with numeric data. Let's count sales in 3
hour intervals during the day. First, create two series, one for the lower bound of each bin and
one for the upper. The series for the upper bound starts and ends 3 hours after the lower bound.
This is the amount of the interval. We alias the result as bins. Then, join bins to the sales data,
where the sales date is greater than or equal to the lower bin and less than the upper bin. Then
group and order by the bin bounds.
The result is the count of sales made during each of the three hour intervals.
Exercise
Find missing dates
The generate_series() function can be useful for identifying missing dates.
Recall:
where from and to are dates or timestamps, and interval can be specified as a string with a
number and a unit of time, such as '1 month'.
Are there any days in the Evanston 311 data where no requests were created?
Instructions:
• Write a subquery using generate_series() to get all dates between the min() and max()
date_created in evanston311.
• Write another subquery to select all values of date_created as dates from
evanston311.
• Both subqueries should produce values of type date (look for the ::).
• Select dates (day) from the first subquery that are NOT IN the results of the second
subquery. This gives you days that are not in date_created.
5. SELECT day
6. -- 1) Subquery to generate all dates
7. -- from min to max date_created
8. FROM (SELECT generate_series(min(date_created),
9. max(date_created),
10. '1 day')::date AS day
11. -- What table is date_created in?
12. FROM evanston311) AS all_dates
13. -
- 4) Select dates (day from above) that are NOT IN the subq
uery
14. WHERE day NOT IN
15. -
- 2) Subquery to select all date_created values as dates
16. (SELECT date_created::date
17. FROM evanston311);
Exercise
Custom aggregation periods
Find the median number of Evanston 311 requests per day in each six month period from 2016-
01- 01 to 2018-06-30. Build the query following the three steps below.
Recall that to aggregate data by non-standard date/time intervals, such as six months, you can
use generate_series() to create bins with lower and upper bounds of time, and then summarize
observations that fall in each bin.
Remember: you can access the slides with an example of this type of query using the PDF icon
link in the upper right corner of the screen.
Instructions 1/3:
• Use generate_series() to create bins of 6 month intervals. Recall that the upper bin
values are exclusive, so the values need to be one day greater than the last day to be
included in the bin.
• Notice how in the sample code, the first bin value of the upper bound is July 1st, and
not June 30th.
• Use the same approach when creating the last bin values of the lower and upper
bounds (i.e. for 2018).
-- Generate 6 month bins covering 2016-01-01 to 2018-06-30
- Note that because we are not generating bins, you can use June 30th as your series
end date.
Instructions 3/3:
• Assign each daily count to a single 6 month bin by joining bins to daily_counts.
• Compute the median value per bin using percentile_disc().
• -- Bins from Step 1
• WITH bins AS (
• SELECT generate_series('2016-01-01',
• '2018-01-01',
• '6 months'::interval) AS lower,
• generate_series('2016-07-01',
• '2018-07-01',
• '6 months'::interval) AS upper)
,
• -- Daily counts from Step 2
• daily_counts AS (
• SELECT day, count(date_created) AS count
• FROM (SELECT generate_series('2016-01-01',
• '2018-06-30',
• '1 day'::interval)::dat
e AS day) AS daily_series
• LEFT JOIN evanston311
• ON day = date_created::date
• GROUP BY day)
• -- Select bin bounds
• SELECT lower,
• upper,
• -- Compute median of count for each bin
• percentile_disc(0.5) WITHIN GROUP (ORDER BY count) A
S median
• -- Join bins and daily_counts
• FROM bins
• LEFT JOIN daily_counts
• -- Where the day is between the bin bounds
• ON day >= lower
• AND day < upper
• -- Group by bin bounds
• GROUP BY lower, upper
• ORDER BY lower;
Exercise
Monthly average with missing dates
Find the average number of Evanston 311 requests created per day for each month of the data.
Instructions:
• Generate a series of dates from 2016-01-01 to 2018-06-30.
• Join the series to a subquery to count the number of requests created per day.
• Use date_trunc() to get months from date, which has all dates, NOT day.
• Use coalesce() to replace NULL count values with 0. Compute the average of this
value.
• -- generate series with all days from 2016-01-01 to 2018- 06-30
• WITH all_days AS
• (SELECT generate_series('2016-01-01',
• '2018-06-30',
• '1 day'::interval) AS date),
• -- Subquery to compute daily counts
• daily_count AS
• (SELECT date_trunc('day', date_created) AS day,
• count(*) AS count
• FROM evanston311
• GROUP BY day)
• -- Aggregate daily counts by month using date_trunc
• SELECT date_trunc('month', date) AS month,
• -- Use coalesce to replace NULL count values with 0
• avg(coalesce(count, 0)) AS average
• FROM all_days
• LEFT JOIN daily_count
• -- Joining condition
• ON all_days.date=daily_count.day
• GROUP BY month
• ORDER BY month;
15. Time between events
You know how to subtract one date from another. But how do you find out how much
time has passed between events, when the dates or timestamps are all saved in the same column?
15.1. The problem
For example, here is data from a sales table with a timestamp for each sale. Our question
is: how much time passes on average between each sale?
15.2. Lead and lag
The lead and lag functions let us offset the ordered values in a column by 1 row by default.
Then we can subtract the original values from the lead or lag of the values to get the difference
between events. Before we talk about the syntax, let's look at the results of the function calls.
The lag function pushes all of the values down one row. NULL is inserted at the beginning of the
lag column,so that the first sales time,at 9:07, is now the second value. The last sales time is
discarded. The lead function does the opposite,pulling all values up one row. The second sales
time of 9:13 becomes the first value,and NULL is added at the end of the lead column. The first
sales time is discarded.
Okay, back to the syntax. For the lead and lag functions to work, you have to specify how
rows should be ordered. Remember that the rows in a database table have no inherent order to
them - they are only ordered when you explicitly specify an order. Lead and lag are window
functions. You start with the function name, and supply the column you want to apply the lead or
lag to as the argument. You then add an "over" clause with the keyword OVER and an "order
by" statement specifying how the rows should be ordered. The "order by" statement goes in
parentheses. In this example, we have ordered by the same column that we want to lead and lag:
date, but this isn't a requirement. We'll see an example where these columns are different shortly.
15.3. Time between events
But first, how do we use lead and lag to compute the time between sales? If we order
dates from oldest to newest, we want to subtract the lagged date from the current date to compute
the gap between each sale and the previous sale. We have one less gap value than the number of
sales.
15.4. Average time between events
To compute the average gap, we need to use a subquery. We cannot simply wrap the
average function around the difference between sales because window functions can't be used
inside aggregation functions like average. The average time between sales is 32 minutes and 15
seconds.
15.5. Change in a time series
The lead and lag functions are not limited to date/time data. As mentioned before, you can
order the rows of the table by one column while getting the lead or lag of a different column. We
often want to do this to compute changes in a time series. A time series is any variable that has a
date or time associated with each value. Here, we want to see not how much time passes between
each sale, but how the amount sold changes from one sale to the next. We can't compute a
change for the first value in a time series because there is no previous value. Looking at the
change column, for the second sale, the amount was 19 less than the first sale. The last sale was
35 more than the previous sale.
Exercise
Longest gap
What is the longest time between Evanston 311 requests being submitted?
Instructions:
• Select date_created and the date_created of the previous request using lead() or lag()
as appropriate.
• Compute the gap between each request and the previous request.
• Select the row with the maximum gap.
• -- Compute the gaps
• WITH request_gaps AS (
• SELECT date_created,
• -- lead or lag
• lag(date_created) OVER (ORDER BY date_create
d) AS previous,
• -
- compute gap as date_created minus lead or lag
• date_created - lag(date_created) OVER (ORDER
BY date_created) AS gap
• FROM evanston311)
• -- Select the row with the maximum gap
• SELECT *
• FROM request_gaps
• -- Subquery to select maximum gap from request_gaps
• WHERE gap = (SELECT max(gap)
• FROM request_gaps);
Exercise
Rats!
Investigate in 4 steps:
1. Why is the average so high? Check the distribution of
completion times. Hint: date_trunc() can be used on
intervals.
2. See how excluding outliers influences average completion times.
3. Do requests made in busy months take longer to complete? Check the
correlation between the average completion time and requests per
month.
4. Compare the number of requests created per month to the number completed.
Instructions 1/4:
• Use date_trunc() to examine the distribution of rat request completion times by
number of days.
• -- Truncate the time to complete requests to the day
• SELECT date_trunc('day',date_completed-
date_created) AS completion_time,
• -- Count requests with each truncated time
• COUNT(*)
• FROM evanston311
• -- Where category is rats
• WHERE category = 'Rodents- Rats'
• -- Group and order by the variable of interest
• GROUP BY completion_time
• ORDER BY completion_time;
Instructions 2/4:
• Compute average completion time per category excluding the longest 5% of requests
(outliers).
• SELECT category,
• -- Compute average completion time per category
• avg(date_completed-
date_created) AS avg_completion_time
• FROM evanston311
• -
- Where completion time is less than the 95th percentile va
lue
• WHERE date_completed-date_created <
• -
- Compute the 95th percentile of completion time in a subqu
ery
• (SELECT percentile_disc(0.95) WITHIN GROUP (ORDER
BY date_completed-date_created)
• FROM evanston311)
• GROUP BY category
• -- Order the results
• ORDER BY avg_completion_time DESC;
Instructions 3/4:
• Get corr() between avg. completion time and monthly requests. EXTRACT(epoch
FROM interval) returns seconds in interval.
• -- Compute correlation (corr) between
• -- avg_completion time and count from the subquery
• SELECT corr(avg_completion, count)
• -- Convert date_created to its month with date_trunc
• FROM (SELECT date_trunc('month', date_created) AS month,
• -
- Compute average completion time in number of seconds