0% found this document useful (0 votes)

131 views119 pages

Exploratory Data Analysis in SQL - Edited

This document provides an overview of exploratory data analysis using SQL, specifically focusing on PostgreSQL. It covers essential topics such as understanding database structure, using entity-relationship diagrams, and performing operations like counting missing values and joining tables. Additionally, it discusses column types, constraints, and the use of functions like coalesce and casting in SQL queries.

Uploaded by

oishi761994

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

131 views119 pages

Exploratory Data Analysis in SQL - Edited

Uploaded by

oishi761994

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 119

Exploratory Data Analysis in SQL

1. What’s in the database?

Hi, my name is Christina Maimone. I'm a data scientist at Northwestern University,

where I help researchers with their data challenges. In this course, you'll learn how to explore a SQL
database, summarize different types of data, and deal with messy data. You've already learned a
number of skills in the previous SQL courses that will help you with these tasks. In this chapter, we'll
review some of those skills, pick up some new ones, and see how they apply to exploratory data
analysis.
1.1. PostgreSQL

One note before we start. This course uses PostgreSQL. Many of the functions we'll use are
also available in other SQL database systems, but their names or syntax may be different. If you're
using another database system, you should refer to the system's documentation to learn the correct
syntax. With that,

1.2. Database client

Let's get started. You've finally been granted access to your company's database. Yay! But
where do you begin? What are the tables? How are they related? What columns exist in the tables? A
database client is a program used to connect to, and work with, a database. There are many different
database clients. Each one has a different way to retrieve information on the table names, the columns in
each table, and the formal relationships between the tables. Refer to your client program's
documentation to find the commands to extract this information.
1.3. Entity relationship diagram

You may also be given information about the structure of the database from the database
owner or creator. One type of documentation is an entity-relationship diagram that shows the tables,
their columns, and the relationships between the tables. Here is the entity-relationship diagram for the
database for this course. There are six tables.
1.3.1. ER diagram: Evanston311

The evanston311 table contains help requests sent to the city of Evanston, Illinois.
1.3.2. ER diagram: fortune500

fortune500 contains information on the 500 largest US companies by revenue from 2017.

1.3.3. ER diagram: stackoverflow

stackoverflow contains data from the popular programming question and answer site. It
includes daily counts of the number of questions that were tagged as being related to select technology
companies.
1.3.4. ER diagram: supporting

company, tag_company, and tag_type are supporting tables with additional

information related to the stackoverflow data.
1.4. Select a few rows

Once you know the names of the tables in the database, one way to get a sense of what's in a
table is to simply select a few rows from it. Here we use the star to select all columns from the company
table and use limit to return only five rows. Remember that the rows returned from a table are in no
particular order by default.
1.5. A few reminders

Code Note
Null Missing
IS NULL, IS NOT NULL Don’t use = NULL
Count(*) Number of rows
Count(column_name) Number of non-NULL values
Count(DISTINCT column_name) Number of different non-NULL values
SELECT DISTINCT column_name … Distinct values, including NULL
As you start to explore the contents of a table, keep a few additional things in mind. NULL
indicates missing data in a database. To check which values are NULL, use "is NULL" or "is not
NULL", not an equals sign. The count function with a star counts the number of rows. If you instead
supply a column name to the count function, it counts the number of non-NULL observations in the
column. This is equal to the total number of rows, minus the number of NULL values. If you count the
distinct values of a column, you'll get the number of different non-NULL values in the column. But if
you select those distinct values directly, NULL will be included as a value if it exists in the column,
even though it isn't counted by the count function.
Exercise
Count missing values

Which column of fortune500 has the most missing values? To find out, you'll need to check each
column individually, although here we'll check just two: ticker and industry.

Course Note: While you're unlikely to encounter this issue during this exercise, note that if you
run a query that takes more than a few seconds to execute, your session may expire or you may
be disconnected from the server. You will not have this issue with any of the exercise solutions,
so if your session expires or disconnects, there's an error with your query.

Instructions 1/2:
• Subtract the count of the non-null ticker values from the total number of rows in
fortune500; alias the difference as missing.

-- Count the number of null values in the ticker column

SELECT count(*) - count(ticker) AS missing
FROM fortune500;

Instructions 2/2:
• Repeat for the industry column: subtract the count of the non-null industry values from
the total number of rows in fortune500; alias the difference as missing.

-- Count the number of null values in the industry column

SELECT
COUNT(*)-COUNT(industry) AS missing
FROM fortune500
Exercise
Join tables

Part of exploring a database is figuring out how tables relate to each other.
The company and fortune500 tables don't have a formal relationship between them in the
database, but this doesn't prevent you from joining them.

To join the tables, you need to find a column that they have in common where the values are
consistent across the tables. Remember: just because two tables have a column with the same
name, it doesn't mean those columns necessarily contain compatible data. If you find more than
one pair of columns with similar data, you may need to try joining with each in turn to see if you
get the same number of results.

Reference the entity relationship diagram if needed.

Instructions
• Closely inspect the contents of the company and fortune500 tables to find a column
present in both tables that can also be considered to uniquely identify each company.
• Join the company and fortune500 tables with an INNER JOIN.

• SELECT company.name
• -- Table(s) to select from
• FROM company
• INNER JOIN fortune500
• ON company.ticker=fortune500.ticker;
2. The keys to the database

The second stage of exploring a database is understanding the formal relationships, or

links, between tables. These explicit relationships are one of the benefits of having
your data in a database rather than in a set of individual data files.

2.1. Foreign keys

Foreign keys are the formal way that database tables are linked together. In this example, the actor_id
column in the film_actor table is a foreign key that references the id column of the actor table.
• Reference another row
o In a different table or the same table
o Via a unique ID
➢ Primary key column containing unique, non-NULL values
• Values restricted to values in referenced column OR NULL

A foreign key is a column that references a single, specific row in the database. The referenced row is
usually in a different table, but foreign keys can reference rows in the same table as well. Foreign keys
reference other rows using a unique identifier for the row. The unique ID often comes from a primary
key column in the referenced table. Primary keys are specially designated columns where each row has
a unique, non-null value. Foreign key columns are restricted to contain either a value that is in the
referenced column, or null. If the value is null, it indicates that there's no relationship for that row.
2.2. ER diagram

Let's look at the entity relationship diagram for our database. In the diagram, foreign keys are indicated
on the arrows between tables.

The value before the colon is the name of the column in the table from which the arrow
originates. The value after the colon is the name of the referenced column in the table
the arrow is pointing to. So the company_id column in the tag_company table refers to
the id column in the company table.
When an arrow points from and to the same table, this is a self reference. parent_id in the company table
references the id column in the same table.
Note that there's no foreign key linking the company table to the fortune500 table. But this doesn't
prevent us from joining these tables. Both tables have ticker columns with comparable values that can
be used to join the tables. The lack of a foreign key relationship just means that the values in the ticker
columns aren't restricted to the set of values in the other table.
2.3. Primary Keys

The diagram also shows which columns are primary keys. Primary keys have a
border around them at the top of each list of columns. Primary keys uniquely identify
the rows in the table.
2.4. Coalesce function

• Operates row by row

• Returns first non-NULL value

Before you return to the exercises, let's add the coalesce function to your toolkit. coalesce takes two or
more values or column names as arguments. The three dots in square brackets here indicate that
additional values can be supplied as inputs. The coalesce function operates row-wise on the input. It
returns the first non-NULL value in each row, checking the columns in the order they're supplied to
the function.

Here's an example. We have a table called prices with two columns. Remember that blanks are null
values. We can use coalesce to combine these two columns. If column_1 is not null, coalesce
returns that value. If column_1 is null, coalesce returns the value of column_2. In this example, the first
value returned by coalesce is 10. This is because, in the first row of prices, the value of column_1 is
NULL. So coalesce returns the value of column_2. Coalesce returned four values because there
were four rows in the input. Coalesce is useful for specifying default or backup values when selecting
a column that might contain NULL values.
Exercise
Read an entity relationship diagram
The information you need is sometimes split across multiple tables in the database.

What is the most common stackoverflow tag_type? What companies have a tag of that
type? To generate a list of such companies, you'll need to join three tables together.
Reference the entity relationship diagram as needed when determining which columns to use
when joining tables.

Instructions 1/2:
• First, using the tag_type table, count the number of tags with each type.
• Order the results to find the most common tag type.
• -- Count the number of tags with each type
• SELECT type, COUNT(*) AS count
• FROM tag_type
• -- To get the count for each type, what do you need to do?
• GROUP BY type
-
- Order the results with the most common tag types listed f
irst
ORDER BY count DESC;
Instructions 2/2:
• Join the tag_company, company, and tag_type tables, keeping only mutually occurring
records.
• Select company.name, tag_type.tag, and tag_type.type for tags with the most common
type from the previous step.
• -- Select the 3 columns desired
• SELECT company.name, tag_type.tag, tag_type.type
• FROM company
• -- Join to the tag_company table
• INNER JOIN tag_company
• ON company.id = tag_company.company_id
• -- Join to the tag_type table
• INNER JOIN tag_type
• ON tag_company.tag = tag_type.tag
• -- Filter to most common type
• WHERE type='cloud';
Exercise
Coalesce
The coalesce() function can be useful for specifying a default or backup value when a column
contains NULL values.

coalesce() checks arguments in order and returns the first non-NULL value, if one exists.

• coalesce(NULL, 1, 2) =1
• coalesce(NULL, NULL) = NULL
• coalesce(2, 3, NULL) =2

In the fortune500 data, industry contains some missing values. Use coalesce() to use the value of
sector as the industry when industry is NULL. Then find the most common industry.

Instructions:
• Use coalesce() to select the first non-NULL value from industry, sector, or 'Unknown' as
a fallback value.
• Alias the result of the call to coalesce() as industry2.
• Count the number of rows with each industry2 value.
• Find the most common value of industry2.
• -- Use coalesce
• SELECT COALESCE(industry, sector, 'Unknown') AS industry2,
• -- Don't forget to count!
• COUNT(*)
• FROM fortune500
• -- Group by what? (What are you counting by?)
• GROUP BY industry2
• -- Order results to see most common first
• ORDER BY COUNT(*) DESC
• -- Limit results to get just the one value you want
• LIMIT 1;
3. Column types and constraints

Now it's time to turn to the contents of individual columns: the data types and the constraints on what
values can exist in each column.

3.1. Column constraints

• Foreign key: value that exists in the referenced column, or NULL
• Primary key: unique, not NULL
• Unique: values must be all different except for NULL
• Not null: NULL not allowed: must have a value
• Check constraints: conditions on the values
o Column1>0
o ColumnA > ColumnB

Foreign keys and primary keys are two types of constraints that limit the values in a column, but
columns can also be constrained in other ways. Unique means that each value except NULL must be
different from the values in all other rows. Not NULL means what it says - the column cannot contain
null values. Check constraints are a way of implementing additional conditions on the values of a
column, such as requiring the column only contain positive values, or ensuring that the value of one
column is greater than the value of another column.

3.2. Data types

Common

• Numeric
• Character
• Date/Time
• Boolean

Special

• Arrays
• Monetary
• Binary
• Geometric
• Network Address
• XML
• JSON
• And more!
Constraints can limit the values in a column, but the main thing that determines what values
are allowed is the column's type. Each column in the database can only store one type of data. In this
course, we're talking about three of the most common types of data: numeric, character, and date/time.
These three, along with boolean - which holds true or false values - are the most common types you'll
encounter, but they're not the only ones. There are also special data types to hold monetary values,
geometric data like points or lines, and structured data types like XML and JSON. These special types
differ more across database implementations than the four common ones.

3.2.1. Numeric types: PostgreSQL documentation

Within the broad categories of numeric, character, or date/time data, there are multiple column
types with different details. For example, different numeric types require different amounts of memory
per row and can store different ranges of values. In the upcoming chapters, we'll talk more about these
specific types, so no need to worry about the details at this point.
3.2.2. Types in entity relationship diagrams
You can find the type of each column in the entity relationship diagram. Here is the
fortune500 table. There are three different numeric data types used in the table: integer,
real, and numeric. Even if you don't have an entity relationship diagram, the column
type is a core piece of information you can expect to find in other kinds of
documentation.

3.3. Casting with CAST()

Values can be converted temporarily from one type to another through a process called
casting. When you cast a column as a different type, the data is converted to the new type only for the
current query. To change a value's type, use the cast function, first, specify the value you want to cast.
This can be a single value or the name of a column. Then use the keyword AS. Finally, specify the name
of the type you want to convert the data to. Here's an example of casting the single numeric value 3-
point-7 as an integer. Casting from numeric to integer rounds the value to the nearest integer, which is
4. To convert the type of an entire column, enter the name of the column as the value. Here, a column
called total is converted to type integer. We need a from clause to specify which table the column
comes from.
3.4. Casting with ::

There's an alternate notation for casting values: a double colon. It does the same thing as the
cast function, but it's more compact. Put the value to convert before the double colon and the type to
cast it as after the double colon. The examples here are the same as those on the previous slide, except
with the double colon notation instead of the cast function.

Exercise
Effects of casting
When you cast data from one type to another, information can be lost or changed. See how the
casting changes values and practice casting data using the CAST() function and the :: syntax.

SELECT CAST(value AS new_type);

SELECT value::new_type;

Instructions 1/3:
• Select profits_change and profits_change cast as integer from fortune500.
• Look at how the values were converted.
• -- Select the original value
• SELECT profits_change,
• -- Cast profits_change
• CAST(profits_change AS integer) AS profits_change_in
t
• FROM fortune500;
Instructions 2/3:
• Compare the results of casting of dividing the integer value 10 by 3 to the result of
dividing the numeric value 10 by 3.

-- Divide 10 by 3
SELECT 10/3,
-- Cast 10 as numeric and divide by 3
10::numeric/3;

Instructions 3/3:
• Now cast numbers that appear as text as numeric.
• Note: 1e3 is scientific notation.

SELECT '3.2'::numeric,
'-123'::numeric,
'1e3'::numeric,
'1e-3'::numeric,
'02314'::numeric,
'0002'::numeric;
Exercise
Summarize the distribution of numeric values
Was 2017 a good or bad year for revenue of Fortune 500 companies? Examine how
revenue changed from 2016 to 2017 by first looking at the distribution of revenues_change and
then counting companies whose revenue increased.

Instructions 1/3:
• Use GROUP BY and count() to examine the values of revenues_change.
• Order the results by revenues_change to see the distribution.
-- Select the count of each value of revenues_change
SELECT revenues_change, COUNT(*)
FROM fortune500
GROUP BY revenues_change
-- order by the values of revenues_change
ORDER BY revenues_change;

Instructions 2/3:
• Repeat step 1, but this time, cast revenues_change as an integer to reduce the number of
different values.
• -- Select the count of each revenues_change integer value
• SELECT revenues_change::integer, count(*)
• FROM fortune500
• GROUP BY revenues_change::integer
• -- order by the values of revenues_change
• ORDER BY revenues_change;
Instructions 3/3:
• How many of the Fortune 500 companies had revenues increase in 2017 compared to
2016? To find out, count the rows of fortune500 where revenues_change indicates an
increase.
-- Count rows
SELECT COUNT(*)
FROM fortune500
-- Where...
WHERE revenues_change > 0;

4. Numeric data types and summary functions

In this chapter, we'll focus on numeric data. This includes both columns, or variables,
that only take on integer whole number values and variables with decimal values.

4.1. Numeric types: integer

Name Storage Size Description Range

Integer or int or int4 4 bytes Typical choice -2147483648 to
+2147483648
Smallint or int2 2 bytes Small-range -32768 to +32767
Bigint or int8 8 bytes Large-range -9223372036854775808 to
+9223372036854775808
Serial 4 bytes Auto-increment 1 to 2147483647
Smallserial 2 bytes Small auto- 1 to 32767
increment
Bigserial 8 bytes Large auto- 1 to
increment 9223372036854775807
There are 10 different numeric data types. First, the integer types. The base type is called integer, which
can be shortened to int. It allows whole numbers between approximately negative 2 billion and positive
2 billion. There are also smallint and bigint types that have different ranges. Serial types are used for
integer columns that autoincrement. They are used to generate ID columns for data that doesn't already
contain a unique identifier.

4.1.1. Numeric types: decimal

Name Storage Size Description Range

Decimal or numeric Variable User-specified Up to 131072 digits
precision, exact before the decimal
point; up to 16383
digits after the
decimal point
Real 4 bytes Variable-precision, 6 decimal digits
inexact precision
Double precision 8 bytes Variable-precision, 15 decimal digits
inexact precision
There are also 3 decimal types with different levels of precision. Decimal and numeric are two names
for the same type. They can store numbers with very high precision. Precision refers to the number of
digits in a number. Real and double precision types can store numbers with less precision, meaning
fewer digits in the number. One way in which column types matter is that functions and operators can
work differently for different types of data.

4.2. Division

The most notable example is division. When you divide integers, the result is truncated to also be an
integer. So integer 10 divided by integer 4 returns integer value 2. But integer 10 divided by numeric 4-
point-0 returns 2-point-5. Now that we've covered the different data types, how do we start exploring
numeric data?
4.3. Range: min and max

It's always good to check the range and summary statistics of the values in a column. Get the
range with the min and max functions, which return the minimum and maximum values of their
input respectively. Here, we take the min and max of the question_pct column in the
stackoverflow table. The column tells us the proportion of total questions for a day with the
specified tag.
4.4. Average or mean

The a-v-g function gives us the average or mean of values in a column.

4.5. Variance

Population Variance

Sample variance

Variance is a statistical measure of the amount of dispersion in a set of values. It tells you
how far spread values are from their mean. Larger values indicate greater dispersion. Variance
can be computed for a sample of data or for the population. The formula is the same except that
population variance divides by the number of values, while the sample variance divides by the
number of values minus one. The var_pop function computes population variance. The var_samp
function computes sample variance. The sample variance will always be slightly larger than the
population variance. The variance function is an alias for var_samp.
4.6. Standard deviation

Sample Standard deviation

Population Standard deviation

Standard deviation is another measure of variance. It is the square root of the variance. Like variance,
there are also functions for both sample and population versions of standard deviation.
4.7. Round

Functions can return results with many decimal places. To make results easier to read, use the
round function to round a value of numeric type to a specified number of decimal places. The
round function takes a numeric value or column as the first argument, and the number of decimal
places to keep as the second argument.
4.8. Summarize by group

In addition to computing summary measures for entire columns, it's also good practice to
summarize variables by groups in the data. For example, in addition to summarizing the
question_pct column in the stackoverflow table overall, we also want to compute summary
measures for each tag. The output here is truncated. The numbers with an e in them are in
scientific notation.
Exercise
Division
Compute the average revenue per employee for Fortune 500 companies by sector.

Instructions:
• Compute revenue per employee by dividing revenues by employees; casting is used here
to produce a numeric result.
• Take the average of revenue per employee with avg(); alias this as avg_rev_employee.
• Group by sector.
• Order by the average revenue per employee.
• -- Select average revenue per employee by sector
• SELECT sector,
• AVG(revenues/employees::numeric) AS avg_rev_employee
• FROM fortune500
• GROUP BY sector
• -- Use the column alias to order the results
• ORDER BY avg_rev_employee;
Exercise
Explore with division
In exploring a new database, it can be unclear what the data means and how columns are related
to each other.

What information does the unanswered_pct column in the stackoverflow table contain? Is it the
percent of questions with the tag that are unanswered (unanswered ?s with tag/all ?s with tag)?
Or is it something else, such as the percent of all unanswered questions on the site with the tag
(unanswered ?s with tag/all unanswered ?s)?

Divide unanswered_count (unanswered ?s with tag) by question_count (all ?s with tag) to see if
the value matches that of unanswered_pct to determine the answer.

Instructions:
• Exclude rows where question_count is 0 to avoid a divide by zero error.
• Limit the result to 10 rows.
• -- Divide unanswered_count by question_count
• SELECT unanswered_count/question_count::numeric AS computed
_pct,
• -- What are you comparing the above quantity to?
• unanswered_pct
• FROM stackoverflow
• -- Select rows where question_count is not 0
• WHERE question_count > 0
• LIMIT 10;
Exercise
Summarize numeric columns
Summarize the profit column in the fortune500 table using the functions you've learned.

You can access the course slides for reference using the PDF icon in the upper right corner of the
screen.

Instructions 1/2
• Compute the min(), avg(), max(), and stddev() of profits; don't use any aliases here.
-- Select min, avg, max, and stddev of fortune500 profits
SELECT min(profits),
avg(profits),
max(profits),
stddev(profits)
FROM fortune500;

Instructions 2/2
• Repeat Step 1, but this time, creating a grouped summary of profits by sector, ordering
the results by the average profits for each sector; don't use any aliases here.
• -- Select sector and summary measures of fortune500 profits
• SELECT sector,min(profits),
• avg(profits),
• max(profits),
• stddev(profits)
•
• FROM fortune500
• -- What to group by?
• GROUP BY sector
• -- Order by the average profits
• ORDER BY avg;
Exercise
Summarize group statistics
Sometimes you want to understand how a value varies across groups. For example, how does the
maximum value per group vary across groups?

To find out, first summarize by group, and then compute summary statistics of the group results.
One way to do this is to compute group values in a subquery, and then summarize the results of
the subquery.

For this exercise, what is the standard deviation across tags in the maximum number of Stack
Overflow questions per day? What about the mean, min, and max of the maximums as well?

Instructions
• Start by writing a subquery to compute the max() of question_count per tag; alias the
subquery result as maxval.
• Then compute the standard deviation of maxval with stddev().
• Compute the min(), max(), and avg() of maxval too.
• -- Compute standard deviation of maximum values
• SELECT stddev(maxval),
• -- min
• min(maxval),
• -- max
• max(maxval),
• -- avg
• avg(maxval)
• -- Subquery to compute max of question_count by tag
• FROM (SELECT max(question_count) AS maxval
• FROM stackoverflow
-- Compute max by...
GROUP BY tag) AS max_results; -
- alias for subquery
5. Exploring distributions

Understanding the distribution of a variable is crucial for finding errors, outliers, and
other anomalies in the data.

5.1. Count values

For columns with a small number of discrete values, we can view the distribution by
counting the number of observations with each distinct value. We group by, and order the results
by, the column of interest. There are 20 distinct values in the unanswered_count column in the
stackoverflow data with the tag amazon-ebs. Only partial results are shown here. Twenty values
are manageable to examine, but when the variable you're interested in takes on many different
values, binning or grouping the values can make the output more useful.
5.2. Truncate

One way to do this is with the trunc function. Trunc is short for truncate. The trunc
function reduces the precision of a number. This means replacing the smallest numeric places -
the right-most digits - with zeros. Truncating is not the same as rounding: you'll never get a
result with a larger absolute value than the original number. Trunc takes two arguments: the
value to truncate and the number of places to truncate it to. Positive values for the second
argument indicate the number of digits after the decimal to keep. For example, truncating 42-
point-1256 to 2 places keeps only the first two digits after the decimal. Negative values for the
second argument indicate places before the decimal to replace with zero. For example, truncating
12,345 to -3 replaces the three digits to the
left of the decimal with zero.

5.2.1. Truncating and grouping

We can use the trunc function to group values in the unanswered_count column into three
groups based on the digit in the tens place of the number. Note that the second argument to the trunc
function here is a -1. There are 74 values between 30 and 39.
5.3. Generate series

What if you want to group values by a quantity other than the place value of a number, such as by
units of 5 or 20? The generate_series function can help. It generates a series of numbers from a
starting value to an ending value, inclusive, by steps of a third value.

For example, we can generate a series from 1 to 10 by steps of 2, or a series from 0 to 1 by steps of
1/10th.
5.4. Create bins: output

generate_series can be used to group values into bins. Here's an example of what we want to
create: a series of lower and upper values, and the count of the number of observations falling in each
bin.
5.4.1. Create bins: query

Let's build the query to create that output. A WITH clause allows us to alias the results of
a subquery to use later in the query. Here, we generate two series: one for the lower bounds of
the bins and another for the upper. We name this "bins." Because we're only summarizing data
for tag amazon-ebs, we also create that subset of the stackoverflow table and call it ebs. Then
write the main select query to join the results of the subqueries we created and count the values.
We join ebs to bins where the column unanswered_count is greater than or equal to the lower
bound and strictly less than the upper bound. A left join keeps all bins in the result, even those
with no values in them. Finally, group by the lower and upper bin values to count the values in
each bin.

Each row in the output has the count of days where the number of unanswered questions
was greater than or equal to the lower bound and strictly less than the upper bound. Note that the
result contains bins with 0 values. This is because we counted non-null values of
unanswered_count instead of just the number of rows.
Exercise
Truncate
Use trunc() to examine the distributions of attributes of the Fortune 500 companies.

Remember that trunc() truncates numbers by replacing lower place value digits with zeros:

trunc(value_to_truncate, places_to_truncate)

Negative values for places_to_truncate indicate digits to the left of the decimal to replace, while
positive values indicate digits to the right of the decimal to keep.

Instructions 1/2:
• Use trunc() to truncate employees to the 100,000s (5 zeros).
• Count the number of observations with each truncated value.
• -- Truncate employees
• SELECT trunc(employees, -5) AS employee_bin,
-
- Count number of companies with each truncated value
COUNT(*)
FROM fortune500
-- Use alias to group
GROUP BY employee_bin
-- Use alias to order
ORDER BY employee_bin;
Instructions 2/2:
• Repeat step 1 for companies with < 100,000 employees (most common).
• This time, truncate employees to the 10,000s place.
• -- Truncate employees
• SELECT TRUNC(employees, -4) AS employee_bin,
• -
- Count number of companies with each truncated value
• COUNT(*)
• FROM fortune500
• -- Limit to which companies?
• WHERE employees < 100000
• -- Use alias to group
• GROUP BY employee_bin
• -- Use alias to order
• ORDER BY employee_bin;
Exercise
Generate series
Summarize the distribution of the number of questions with the tag "dropbox" on Stack
Overflow per day by binning the data.

Recall:

generate_series(from, to, step)

You can reference the slides using the PDF icon in the upper right corner of the screen.

Instructions 1/3:
• Start by selecting the minimum and maximum of the question_count column for the tag
'dropbox' so you know the range of values to cover with the bins.
-- Select the min and max of question_count
SELECT min(question_count),
max(question_count)
-- From what table?
FROM stackoverflow
-- For tag dropbox
WHERE tag = 'dropbox';

Instructions 2/3:
• Next, use generate_series() to create bins of size 50 from 2200 to 3100.
o To do this, you need an upper and lower bound to define a bin.
o This will require you to modify the stopping value of the lower bound and the
starting value of the upper bound by the bin width.
• -- Create lower and upper bounds of bins
• SELECT generate_series(2200, 3050, 50) AS lower,
• generate_series(2250, 3100, 50) AS upper;
Instructions 3/3:
• Select lower and upper from bins, along with the count of values within each bin
bounds.
• To do this, you'll need to join 'dropbox', which contains the question_count for tag
"dropbox", to the bins created by generate_series().
• The join should occur where the count is greater than or equal to the lower bound,
and strictly less than the upper bound.
• -- Bins created in Step 2
• WITH bins AS (
• SELECT generate_series(2200, 3050, 50) AS lower,
• generate_series(2250, 3100, 50) AS upper),
• -- Subset stackoverflow to just tag dropbox (Step 1)
• dropbox AS (
• SELECT question_count
• FROM stackoverflow
• WHERE tag='dropbox')
• -- Select columns for result
• -- What column are you counting to summarize?
• SELECT lower, upper, count(question_count)
• FROM bins -- Created above
• -
- Join to dropbox (created above), keeping all rows from th e bins table in
the join
• LEFT JOIN dropbox
• -- Compare question_count to lower and upper
• ON question_count >= lower
• AND question_count < upper
• -- Group by lower and upper to count values in each bin
• GROUP BY lower, upper
• -- Order by lower to put bins in order
• ORDER BY lower;
6. More summary functions

You've learned several functions to help you explore numeric data. Now it's time to add a few more.

6.1. Correlation

So far, we've summarized individual columns. But sometimes we want to understand the
relationship between two columns. Correlation is one measure of the relationship between two
variables. A correlation coefficient can range from 1 to -1, with larger values indicating a
stronger positive relationship, and more negative values indicating a stronger negative
relationship.
6.1.1. Correlation function

The corr function takes the names of two columns as arguments and returns the
correlation between them. Rows with a null value in either column are excluded.
6.2. Median

Another common summary measure is the median. The median is the 50th percentile,
or midpoint, in a sorted list of values.

6.2.1. Percentile functions

• Returns a value from a column

• Interpolates between values

To get the median, use a percentile function. The syntax for the percentile functions is
different than for other functions you've seen because the data must be ordered to do the
computation. It's called ordered-set aggregate syntax. The only argument to the function is a
number between 0 and 1 corresponding to the percentile you want. You then type "within
group", and then, inside parentheses, order by and the name of the column you want to compute
the percentile for. percentile d-i-s-c, or discrete, always returns a value that exists in the column.
percentile c-o-n-t, or continuous, interpolates between values around the specified percentile. It
can return a value that is not in the original data.
6.2.2. Percentile examples

Here's an example. We have four numbers: 1, 3, 4, and 5. The two percentile functions return
different values for the median. The discrete percentile function returns 3, while the
continuous percentile function interpolates between 3 and 4, to return 3-point-5. The
formula used to compute percentiles is fairly complex, and sometimes the results may
not be intuitive. In particular, you may be used to computing the median of an even
number of values as the average of the two middle
values. Be aware that these functions may not always return that value as the 50th percentile.

6.3. Common issues

• Error codes
o Examples: 9, 99, -99
• Missing value codes
o NA, NaN, N/A, #N/A
o 0 = missing or 0?
• Outlier (extreme) values
o Ready high or low?
o Negative values?
• Not really a number
o Examples: zip codes, survey response categories
Before you practice using these functions, there are a few common issues with numeric values
that you should be on the lookout for. First, error codes. Sometimes certain values, such as 9, 99,
or -99, may have special meaning and not be true data values. Check any documentation for your
database, and be suspicious of values that seem out of place. Such special codes might also
denote missing values. There are also additional special values, usually combinations of the
letters N and A, that are used to denote missing values in other programs. It's also good to check
to make sure that 0 really means 0 in the data and not missing. Beyond error or missing value
codes, you also want to check for extreme outlier values. These may indicate data entry errors or
other problems with the data. Finally, just because data is in a numeric column type, doesn't
mean it should be treated as a number. Zip codes may be stored in a numeric column, but it
doesn't make sense to take the average or variance of zip codes. Numeric values are also
sometimes used to encode multiple choice responses to survey questions, even though the answer
choices might not correspond to a numeric scale.

Exercise
Correlation
What's the relationship between a company's revenue and its other financial attributes? Compute
the correlation between revenues and other financial variables with the corr() function.

Instructions:
• Compute the correlation between revenues and profits.
• Compute the correlation between revenues and assets.
• Compute the correlation between revenues and equity.
-- Correlation between revenues and profit
SELECT corr(revenues,profits) AS rev_profits,
-- Correlation between revenues and assets
corr(revenues,assets) AS rev_assets,
-- Correlation between revenues and equity
corr(revenues,equity) AS rev_equity
FROM fortune500;
Exercise
Mean and Median
Compute the mean (avg()) and median assets of Fortune 500 companies by sector.

Use the percentile_disc() function to compute the median:

percentile_disc(0.5)
WITHIN GROUP (ORDER BY column_name)

Instructions:
• Select the mean and median of assets.
• Group by sector.
• Order the results by the mean.
-- What groups are you computing statistics by?
SELECT sector,
-- Select the mean of assets with the avg function
avg(assets) AS mean,
-- Select the median
percentile_disc(0.5) WITHIN GROUP (ORDER BY assets)
AS median
FROM fortune500
-- Computing statistics for each what?
GROUP BY sector
-- Order results by a value of interest
ORDER BY mean;
7. Creating temporary tables
Up to this point, you've run queries and viewed the results. But what if you want to keep
the results of a query around for reference? You need special permissions in a database to create
or update tables, but most users can create temporary tables that only they can see and that only
last for the duration of a database session.
7.1. Syntax

Create Temp Table Syntax

Select into Syntax

One way to create a temporary table is with a select query. The results of the query are saved
as a table that you can use later. To do this, we preface any select query with the words create
temp table, then a name for the table we're creating, and finally the keyword as. This copies the
result of the select query into a new table that has no connection to the original table. There are
other ways to create temporary tables as well. You may have seen the "select into" syntax before.
You add a special clause into the middle of a select query to direct the results into a new temp
table. In this example, the added clause is the middle line of code. Both of these queries do the
same thing, just with different syntax. We're going to use the create table syntax in this course.
It's the method recommended by Postgres, and it allows you to use options not available with the
"select into" syntax.
7.2. Create a table

As an example let's make a temporary table called top_companies with just the rank and title of
the top 10 companies in fortune500. We preface our select query with the create temp table
syntax. After we've created the table, we can then select from it. Note that the column names are
taken from the column names of the result
7.3. Insert into table

We can also insert new rows into a table after we've created it. We use an "insert into"
statement with the name of the table, followed by a select query that will generate the rows we
want to add to the table. The columns generated by the select query must match those already in
the table. Here we add companies with ranks 11 to 20 to the table. In many database clients, after
you run the command,you'll get a confirmation message that 10 rows were inserted into the table.
In the DataCamp editor, you won't see any message when rows are inserted. Now if we select
from the temp table top_companies again, you can see the new rows have been added.
7.4. Delete(drop) table

To delete a table, use the drop table command. The table will be deleted immediately
without warning. Dropping a table can be useful if you made a mistake when creating it or when
inserting values into it. Temporary tables will also be deleted automatically when you disconnect
from the database. A variation on the drop table command adds the clause if exists before the
table name. This means to only try to delete the table after confirming that such a table exists.
This variation is often used in scripts because it won't cause an error if the table doesn't exist.

Exercise
Create a temp table
Find the Fortune 500 companies that have profits in the top 20% for their sector (compared to
other Fortune 500 companies).

To do this, first, find the 80th percentile of profit for each sector with

percentile_disc(fraction)
WITHIN GROUP (ORDER BY sort_expression)

and save the results in a temporary table.

Then join fortune500 to the temporary table to select companies with profits greater than the
80th percentile cut-off.

Instructions 1/2:
• Create a temporary table called profit80 containing the sector and 80th percentile of
profits for each sector.
• Alias the percentile column as pct80.
-
- To clear table if it already exists; fill in name of temp
table
DROP TABLE IF EXISTS profit80;

-- Create the temporary table

CREATE TEMP TABLE profit80 AS
-- Select the two columns you need; alias as needed
SELECT sector,
percentile_disc(0.8) WITHIN GROUP(ORDER BY profits
) AS pct80
• -- What table are you getting the data from?
• FROM fortune500
• -- What do you need to group by?
• GROUP BY sector;
•
• -
- See what you created: select all columns and rows from th
e table you created
• SELECT *
• FROM profit80;
Instructions 2/2:
• Using the profit80 table you created in step 1, select companies that have profits greater
than pct80.
• Select the title, sector, profits from fortune500, as well as the ratio of the company's
profits to the 80th percentile profit.

-- Code from previous step

DROP TABLE IF EXISTS profit80;

CREATE TEMP TABLE profit80 AS

SELECT sector,
percentile_disc(0.8) WITHIN GROUP (ORDER BY profit
s) AS pct80
• FROM fortune500
• GROUP BY sector;
•
• -- Select columns, aliasing as needed
• SELECT title, fortune500.sector,
• profits, profits/pct80 AS ratio
• -- What tables do you need to join?
• FROM fortune500
• LEFT JOIN profit80
• -- How are the tables joined?
• ON fortune500.sector=profit80.sector
• -- What rows do you want to select?
• WHERE profits > pct80;
Exercise
Create a temp table to simplify a query
The Stack Overflow data contains daily question counts through 2018-09-25 for all tags, but
each tag has a different starting date in the data.

Find out how many questions had each tag on the first date for which data for the tag is available,
as well as how many questions had the tag on the last day. Also, compute the difference between
these two values.

To do this, first compute the minimum date for each tag.

Then use the minimum dates to select the question_count on both the first and last day. To do
this, join the temp table startdates to two different copies of the stackoverflow table: one for each
column - first day and last day - aliased with different names.

Instructions 1/2:
• First, create a temporary table called startdates with each tag and the min() date for the
tag in stackoverflow.
• -- To clear table if it already exists
• DROP TABLE IF EXISTS startdates;
•
• -- Create temp table syntax
• CREATE TEMP TABLE startdates AS
• -- Compute the minimum date for each what?
• SELECT tag,
• min(date) AS mindate
• FROM stackoverflow
• -- What do you need to compute the min date for each tag?
• GROUP BY tag;
•
• -- Look at the table you created
• SELECT *
• FROM startdates;
Instructions 2/2:
• Join startdates to stackoverflow twice using different table aliases.
• For each tag, select mindate, question_count on the mindate, and question_count on
2018-09-25 (the max date).
• Compute the change in question_count over time.
• -- To clear table if it already exists
• DROP TABLE IF EXISTS startdates;
•
• CREATE TEMP TABLE startdates AS
• SELECT tag, min(date) AS mindate
• FROM stackoverflow
• GROUP BY tag;
•
• -- Select tag (Remember the table name!) and mindate
• SELECT so_min.tag,
• mindate,
• -- Select question count on the min and max days
• so_min.question_count AS min_date_question_count,
• so_max.question_count AS max_date_question_count,
• -- Compute the change in question_count (max- min)
• so_max.question_count - so_min.question_count AS cha nge
• FROM startdates
• -
- Join startdates to stackoverflow with alias so_min
• INNER JOIN stackoverflow AS so_min
• -- What needs to match between tables?
• ON startdates.tag = so_min.tag
• AND startdates.mindate = so_min.date
• -- Join to stackoverflow again with alias so_max
• INNER JOIN stackoverflow AS so_max
• -- Again, what needs to match between tables?
• ON so_min.tag = so_max.tag
• AND so_max.date = '2018-09-25';
Exercise
Insert into a temp table
While you can join the results of multiple similar queries together with UNION, sometimes it's
easier to break a query down into steps. You can do this by creating a temporary table and
inserting rows into it.

Compute the correlations between each pair of profits, profits_change, and revenues_change
from the Fortune 500 data.

The resulting temporary table should have the following structure:

measure profits profits_change revenues_change

profits 1.00 # #

profits_change # 1.00 #

revenues_change # # 1.00

Recall the round() function to make the results more readable:

round(column_name::numeric, decimal_places)

Note that Steps 1 and 2 do not produce output. It is normal for the query result pane to say "Your
query did not generate any results."
Instructions 1/3:
Create a temp table correlations.

• Compute the correlation between profits and each of the three variables (i.e. correlate
profits with profits, profits with profits_change, etc).
• Alias columns by the name of the variable for which the correlation with profits is being
computed.

• DROP TABLE IF EXISTS correlations;

•
• -- Create temp table
• CREATE TEMP TABLE correlations AS
• -- Select each correlation
• SELECT 'profits'::varchar AS measure,
• -- Compute correlations
• corr(profits, profits) AS profits,
• corr(profits, profits_change) AS profits_change,
• corr(profits, revenues_change) AS revenues_change
• FROM fortune500;

Instructions 2/3:
• Insert rows into the correlations table for profits_change and revenues_change.
• DROP TABLE IF EXISTS correlations;
•
• CREATE TEMP TABLE correlations AS
• SELECT 'profits'::varchar AS measure,
• corr(profits, profits) AS profits,
• corr(profits, profits_change) AS profits_change,
• corr(profits, revenues_change) AS revenues_change
• FROM fortune500;
•
• -- Add a row for profits_change
• -- Insert into what table?
• INSERT INTO correlations
• -
- Follow the pattern of the select statement above using profits_change
instead of profits
• SELECT 'profits_change'::varchar AS measure,
• corr(profits_change,profits) AS profits,
• corr(profits_change,profits_change) AS profits_change,
• corr(profits_change,revenues_change) AS revenues_change
• FROM fortune500;
•
• -- Repeat the above, but for revenues_change
• INSERT INTO correlations
• SELECT 'revenues_change'::varchar AS measure,
• corr(revenues_change,profits) AS profits,
• corr(revenues_change,profits_change) AS profits_change,
• corr(revenues_change,revenues_change) AS revenues_change
• from fortune500;
Instructions 3/3:
• Select all rows and columns from the correlations table to view the correlation matrix.
• First, you will need to round each correlation to 2 decimal places.
• The output of corr() is of type double precision, so you will need to also cast columns to
numeric.
• DROP TABLE IF EXISTS
• correlations;
• CREATE TEMP TABLE correlations
AS
• SELECT 'profits'::varchar AS measure,
• corr(profits, profits) AS profits,
• corr(profits, profits_change) AS profits_change,
• corr(profits, revenues_change) AS revenues_change
• FROM fortune500;
•
• INSERT INTO correlations
• SELECT 'profits_change'::varchar AS measure,
• corr(profits_change, profits) AS profits,
• corr(profits_change, profits_change) AS profits_change,
• corr(profits_change, revenues_change) AS revenues_change
• FROM fortune500;
•
• INSERT INTO correlations
• SELECT 'revenues_change'::varchar AS measure,
• corr(revenues_change, profits) AS profits,
• corr(revenues_change, profits_change) AS profits_change,
• corr(revenues_change, revenues_change) AS revenues_c
hange
• FROM fortune500;
•
• -- Select each column, rounding the correlations
• SELECT measure,
• ROUND(profits::numeric,2) AS profits,
• ROUND(profits_change::numeric,2) AS profits_change,
• ROUND(revenues_change::numeric,2) AS revenues_change
• FROM correlations;
8. Character data types and common issues
The next type of data we’ll be exploring is character or text data.

8.1. PostgreSQL character types

Character(n) or char(n)

• Fixed length n
• Trailing spaces ignored in comparisons

Character varying(n) or varchar(n)

• Variable length up to a maximum of n

Text or varchar

• Unlimited length
There are three types of character columns to store strings of text: character (which can be
shortened to char), character varying (which can be shortened to varchar), and text. They differ
in the length of the string of text they store. The length of a string is defined as the number of
characters in it. Character columns store a fixed length string; spaces are added to the end of
shorter strings to make up any difference in length. Spaces at the end of char fields are ignored
when comparing values. Varchar columns can optionally specify a maximum string length; they
allow strings of any size up to the specified maximum. Text, or varchar columns without a
maximum length specified, can store strings of unlimited length.
8.2. Types of text data

Categorical

Tues, Tuesday, Mon, TH Shirts, shoes, hats, pants

Satisfied, very satisfied, unsatisfied 034S-S38, 1254-001, 5477-c51

Red, blue, green, yellow

Unstructured Text

I really like this product. I use it every day. It’s my favorite color.

We’ve redesigned your favorite t-shirt to make it even better. You’ll love…

Four score and seven years ago our fathers brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition that all men are created equal…

Regardless of the formal column type, for analysis, we want to distinguish between two types
of text data: categorical variables and unstructured text. Categorical variables are short strings of
text with values that are repeated across multiple rows. They take on a finite and manageable set
of distinct values. Days of the week, product categories, and multiple-choice survey question
responses are all examples of categorical variables. Unstructured text consists of longer strings of
unique values, such as answers to open-ended survey questions or product reviews. To analyze
unstructured text, we can create new variables that extract features from the text or indicate
whether the text has particular characteristics. For example, we could create binary indicator
variables that denote whether the text contains keywords of particular interest.

8.3. Grouping and counting

For now, we'll focus on categorical variables. The first things to check with categorical
variables are the set of distinct categories and the number of observations, or rows, for each
category. We do this with GROUP BY and count. Without ordering the results, it's hard to tell
which categories are commonly used and whether any categories should be grouped together.
8.4. Order: most frequent values

Ordering by the count of each value helps us see the most, and least, frequent categories.
It's good to check whether categories with only a few observations have errors - such as spelling,
capitalization, or spacing mistakes.
8.4.1. Order: category value

It's also a good idea to try ordering the results by the category. Doing so can help us
identify possible duplicates and other errors in the data. Does the order of the categories in the
results match what you were expecting?
8.4.2. Alphabetical order

Character types are sorted in alphabetical order. Spaces come before letters, and
uppercase letters come before lowercase letters. Looking at the first character of each category
shows that the results are in alphabetical order.
8.5. Common issues

Case matters

Spaces count

Empty strings aren’t null

Punctuation differences
So what are you looking for when grouping and counting values? Common inconsistencies
and issues with character data include: Differences in case: for example, when there are both
lower and upper case versions of the same value. White space differences, such as when values
only differ in the number or placement of spaces. One exception here is that when comparing
values of type char, trailing spaces are ignored. An empty string, which is a string of length zero,
is not the same as a string of all spaces. An empty string is also not the same as null. These are
distinct values. And finally punctuation differences. Punctuation differences can sometimes be
subtle. For example, there are multiple types of hyphens and dashes that look similar but are
different characters.

Exercise
Count the categories
In this chapter, we'll be working mostly with the Evanston 311 data in table evanston311. This is
data on help requests submitted to the city of Evanston, IL.

This data has several character columns. Start by examining the most frequent values in some of
these columns to get familiar with the common categories.

Instructions 1/4:
• How many rows does each priority level have?

-- Select the count of each level of priority

SELECT priority, COUNT(*)
FROM evanston311
GROUP BY priority;
Instructions 2/4:
• How many distinct values of zip appear in at least 100 rows?
• -- Find values of zip that appear in at least 100 rows
• -- Also get the count of each value
• SELECT zip, COUNT(*)
• FROM evanston311
• GROUP BY zip
• HAVING COUNT(*) >=100;

Instructions 3/4:
• How many distinct values of source appear in at least 100 rows?

-- Find values of source that appear in at least 100 rows

-- Also get the count of each value
SELECT source, COUNT(*)
FROM evanston311
GROUP BY source
HAVING COUNT(*)>=100;
Instructions 4/4:
• Select the five most common values of street and the count of each.

• -
- Find the 5 most common values of street and the count of
each
• SELECT street, COUNT(*)
• FROM evanston311
• GROUP BY street
• ORDER BY COUNT(*) DESC
• LIMIT 5;
9. Cases and spaces
Two of the most common inconsistencies in text data are differences in the case of
characters and in the spaces in a string. We can deal with these issues by using functions to
change character case or remove spaces and by querying data with the LIKE operator.
9.1. Converting case

First, one of the easiest ways to handle inconsistencies in case is to convert character data
to either be all upper or all lower case. The upper and lower functions do just that. The functions
have no effect on punctuation or numbers.
9.2. Case insensitive comparisons

You can use the lower or upper function to make comparisons case insensitive. For
example, the fruit data here has 8 entries corresponding to apple, but there are 6 different ways
the data is entered. To select rows from the fruit table with the value apple - regardless of case -
we can convert all fav_fruit values to lower case with the lower function. Then select rows where
the result of the function is equal to 'apple', all lower case. Note that while we got both upper and
lower case versions of apple in our 5 results, we are still missing 3 values with spaces at the
beginning or end of the word apple, or with the plural apples instead of apple.
9.3. Case insensitive searches

The LIKE operator can help us match values of apple that might have extra spaces or s-es
at the end. By using a LIKE pattern with a percentage sign before and after apple, we match
fav_fruit entries where apple is anywhere in the string. Remember that with LIKE, percentage
matches any number of characters, including 0, while an underscore matches exactly one
character. Now we have values of apple with spaces and s-es, but only lower case. To make this
query case insensitive, we can use ILIKE instead of LIKE. The I stands for insensitive. ILIKE
queries take longer to run than LIKE queries, so only use them when you need to. Using ILIKE
we also select variations of apple with upper case characters. All 8 variations of apple are now in
the result.
9.3.1. Watch out!

Remember though that LIKE searches can match more than you may intend. Our query
to select apple values would also select pineapple!
9.4. Trimming spaces

• Trim or btrim: both ends

o Trim(‘ abc ‘) = ‘abc’
• Rtrim: right end
o Rtrim(‘ abc ‘) = ‘ abc’
• Ltrim: left start
o Ltrim(‘ abc ‘) = ‘abc ‘
Turning now to the issue of extra spaces, one way to deal with them is to use a trim
function to remove spaces from one or both ends of a string. Trim, or btrim, removes spaces
from both ends of a string by default. rtrim removes spaces only from the right, or end, and ltrim
removes spaces only from the left, or beginning. By default, the trim functions remove only the
space character, not other white space characters like tabs or newlines.
9.5. Trimming other values

While the functions remove spaces by default, you can specify other characters that
should be removed instead. You can remove a single character, such as an exclamation point, or
a set of characters, all together in a single string. The trim functions are case sensitive, so in the
second example, we include both an upper and lower case W.
9.6. Combining functions

Instead of specifying both lower and upper case versions of the same letter, we can combine
functions. Remember that we can nest the call to one function inside another function. The inner
function is executed first, then the result is sent to the outer function. Here, we first convert all of
the characters to lower case with the lower function, then we use the trim function to remove
exclamation points and lower case w's.
Exercise
Trimming
Some of the street values in evanston311 include house numbers with # or / in them. In addition,
some street values end in a ..

Remove the house numbers, extra punctuation, and any spaces from the beginning and end of the
street values as a first attempt at cleaning up the values.

Instructions:
• Trim digits 0-9, #, /, ., and spaces from the beginning and end of street.
• Select distinct original street value and the corrected street value.
• Order the results by the original street value.
• SELECT distinct street,
• -- Trim off unwanted characters from street
• trim(street, '0123456789 #/.') AS cleaned_street
• FROM evanston311
• ORDER BY street;

Exercise
Exploring unstructured text
The description column of evanston311 has the details of the inquiry, while the category column
groups inquiries into different types. How well does the category capture what's in the
description?

LIKE and ILIKE queries will help you find relevant descriptions and categories. Remember that
with LIKE queries, you can include a % on each side of a word to find values that contain the
word. For example:

SELECT category FROM evanston311

WHERE category LIKE '%Taxi%';

% matches 0 or more characters.

Building up the query through the steps below, find inquires that mention trash or
garbage in the description without trash or garbage being in the category. What are the most
frequent categories for such inquiries?

Instructions 1/4:
• count rows in evanston311 where the description contains 'trash' or 'garbage' regardless of
case.
-- Count rows
SELECT COUNT(*)
FROM evanston311
-- Where description includes trash or garbage
WHERE description ILIKE '%trash%'
OR description ILIKE '%garbage%';

Instructions 2/4:
• category values are in title case. Use LIKE to find category values with 'Trash' or
'Garbage' in them.
-- Select categories containing Trash or Garbage
SELECT category
FROM evanston311
-- Use LIKE
WHERE category LIKE '%Trash%'
OR category LIKE '%Garbage%';
Instructions 3/4:
• Count rows where the description includes 'trash' or 'garbage' but the category does not.
-- Count rows
SELECT COUNT(*)
FROM evanston311
-- description contains trash or garbage (any case)
WHERE (description ILIKE '%trash%'
OR description ILIKE '%Garbage%')
-- category does not contain Trash or Garbage
AND category NOT LIKE '%Trash%'
AND category NOT LIKE '%Garbage%';

Instructions 4/4:
• Find the most common categories for rows with a description about trash that don't have
a trash-related category
• -- Count rows with each category
• SELECT category, COUNT(*)
• FROM evanston311
• WHERE (description ILIKE '%trash%'
• OR description ILIKE '%garbage%')
• AND category NOT LIKE '%Trash%'
• AND category NOT LIKE '%Garbage%'
• -- What are you counting?
• GROUP BY category
• ORDER BY count DESC
• LIMIT 10;
10. Splitting and concatenating text
When working with text values, you often need to break strings apart into multiple
pieces, extract part of a string to a new variable, or join, or concatenate, strings together. There
are functions to help us with these operations.
10.1. Substring

First, how do we extract just part of a string? The left and right functions take as arguments a
string, or the name of a column of strings, and the number of characters to keep. Left keeps characters
starting at the left, while right keeps characters counting from the end. Here, the first two characters in
the string abcde are a and b, while the last two characters are d and e. If the string contains fewer than the
requested number of characters, only the available characters are returned.

To extract characters from the middle of a string, use the substring function. The function takes
a string or column to operate on, and then the keyword FROM. Next comes the index of the character to
start with, counting from 1. Then the keyword FOR followed by the number of characters to include in
the substring. For example, if we take the substring of abcdef starting from position 2 and going for 3
characters, we get bcd. B was the second character in the string, and the function extracted 3
characters. You may also see an abbreviated version of substring with a shortened function name and
comma-separated arguments. It works the same way. The left, right, and substring functions can be
useful in situations such as extracting a snippet from a long unstructured text field, displaying just the
first or last few digits of an account number, or limiting a zip code to only the first 5 digits.
10.2. Delimiters

Fields/chunks:

1. Some text
2. More text
3. Still more text

The second string operation to know is how to split a string into parts based on a
delimiter. A delimiter is a character, such as a comma, or a string that separates fields or
chunks of text.

10.2.1. Splitting on a delimiter

The function split_part takes a string, the delimiter to split the string on, and the number
position of the part of the split string to return, counting from one. For example, if we split the
string a-comma- bc-comma-d with a comma as the delimiter, the string would be split into 3
parts: a, bc, and d. If we ask for the second part, we get bc. Note that the delimiter is not included
in the returned value.

The delimiter can be a single character or a string of multiple characters. For example, if
we split the string "cats and dogs and fish" on "and" surrounded by spaces, the first group is cats.
Note that the string was split on the delimiter exactly as it appears, not on the set of characters
included in the delimiter. It is common to split strings on a delimiter value when multiple pieces
of information have been stored together in a single column.
10.3. Concatenating text

The third string operation is concatenation. The concat function takes any number of arguments. It joins
the text representation of all of the values together in a single string. You can concatenate both
character types and non-character types. Values can also be concatenated with a double pipe, which
looks like two vertical bars. This operator is the SQL standard for string concatenation. It works the
same as the concat function except when null values are included. The concat function omits null
values, while the double pipe will return null if any component is null. One example of when you
might concatenate strings is to join a first name and last name stored in separate columns to get a
person's full name.

Exercise
Concatenate strings
House number (house_num) and street are in two separate columns in evanston311. Concatenate
them together with concat() with a space in between the values.

Instructions:
• Concatenate house_num, a space ' ', and street into a single value using the concat().
• Use a trim function to remove any spaces from the start of the concatenated value.
-
- Concatenate house_num, a space, and street and trim space
s from the start of the result
SELECT ltrim(CONCAT(house_num,' ', street)) AS address
FROM evanston311;
Exercise
Split strings on a delimiter
The street suffix is the part of the street name that gives the type of street, such as Avenue, Road,
or Street. In the Evanston 311 data, sometimes the street suffix is the full word, while other times
it is the abbreviation.

Extract just the first word of each street value to find the most common streets regardless of the
suffix.

To do this, use

split_part(string_to_split, delimiter, part_number)

Instructions
• Use split_part() to select the first word in street; alias the result as street_name.
• Also select the count of each value of street_name.
• -- Select the first word of the street value
• SELECT split_part(street,' ',1) AS street_name,
• count(*)
• FROM evanston311
• GROUP BY street_name
• ORDER BY count DESC
• LIMIT 20;
Exercise
Shorten long strings
The description column of evanston311 can be very long. You can get the length of a string with
the length() function.

For displaying or quickly reviewing the data, you might want to only display the first few
characters. You can use the left() function to get a specified number of characters at the start of
each value.

To indicate that more data is available, concatenate '...' to the end of any shortened description.
To do this, you can use a CASE WHEN statement to add '...' only when the string length is
greater than 50.

Select the first 50 characters of description when description starts with the word "I".

Instructions:
• Select the first 50 characters of description with '...' concatenated on the end where the
length() of the description is greater than 50 characters. Otherwise just select the
description as is.
• Select only descriptions that begin with the word 'I' and not the letter 'I'.
o For example, you would want to select "I like using SQL!", but would not want to
select "In this course we use SQL!".
• -- Select the first 50 chars when length is greater than 50
• SELECT CASE WHEN length(description) > 50
• THEN left(description, 50) || '...'
• -- otherwise just select description
• ELSE description
• END
• FROM evanston311
• -- limit to descriptions that start with the word I
• WHERE description LIKE 'I %'
• ORDER BY description;
11. Strategies for multiple transformations
You've learned several ways to transform character data. But what do you do when you need to
use different transformations on different observations?
11.1. Multiple transformations

Here's an example of data where different delimiter characters were used to separate the
major industry categories of Agriculture and Education, from the industry subcategories.
Sometimes there's a colon, other times there's a pipe character or a dash. We can use the
split_part function to separate the category into its two parts, but how can we apply different
delimiters to different rows?
11.2. CASE WHEN

One option when you need to apply multiple transformations to subsets of the data that don't
overlap is to use a CASE WHEN statement. We want to extract just the initial major category
from the category column - the part before a delimiter. To do that, we have a case for each
different delimiter: a colon followed by space, a dash surrounded by spaces, and a pipe
surrounded by spaces. The last case here goes in the else clause. We use LIKE statements to
select the rows with each type of delimiter, then apply the split_part function with the delimiter
for those rows. We alias the result of the CASE WHEN statement as major_category. We can
then use the major_category we extracted to group and aggregate the data. This allows us to get
the number of businesses in each of the two major categories.
11.3. Recoding table

Original values: fruit table

Standardized values: recode table

Another strategy for situations when you need to apply multiple transformations is to create a
temporary table that maps messy values,like those in this fruit table, to recoded, clean,
standardized values, like those on the right. You can then use the new recoded values in the
standardized column by joining the fruit table to the temporary table. The values in the original
column match those in the fav_fruit column.
11.4. Step 1: CREATE TEMP TABLE

The first step is to create a temporary table with two columns: original, containing the
distinct values of fav_fruit and standardized, which will eventually contain the recoded values.
We initially populate the standardized column with the original values.
11.4.1. Initial table

Here's what our recode table looks like initially.

11.5. Step 2: UPDATE values

The next step is to update the recode table using functions to clean up the values in the
standardized column. The format of an update statement is UPDATE, then the name of the table,
then a SET statement. You SET a column to have new values. Update statements can also
include a WHERE clause to choose what rows to update. Without a WHERE clause, all of the
rows will be updated.

Here, we need three update statements. In the first, we set the standardized value to be the
lower case version of the original value, with spaces trimmed from both ends. In the second, we
set the standardized value to banana only for rows that contained a double n. The third statement
updates the standardized value by removing s's from the end with the trim function.
11.5.1. Resulting recode table

Now the recode table contains clean, standardized values.

11.6. Step 3: JOIN original and recode tables

Original only

With recoded values

The final step is to use the recode table by joining it to the original data. On the left, without
the recoded data, we have many different values of apple and banana even when we group by
fav_fruit. On the right, we join the messy fruit table to the recode table by matching the original
values. We select and group by the new standardized values to get a much more useful summary
of the data.
11.7. RECAP
1. CREATE TEMP TABLE with original values
2. UPDATE to create standardized values
3. JOIN original data to standardize data

Exercise
Group and recode values
There are almost 150 distinct values of evanston311.category. But some of these categories are
similar, with the form "Main Category - Details". We can get a better sense of what requests are
common if we aggregate by the main category.

To do this, create a temporary table recode mapping distinct category values to new,
standardized values. Make the standardized values the part of the category before a dash ('-').
Extract this value with the split_part() function:

split_part(string text, delimiter text, field int)

You'll also need to do some additional cleanup of a few cases that don't fit this pattern.

Then the evanston311 table can be joined to recode to group requests by the new standardized
category values.

Instructions 1/4:
• Create recode with a standardized column; use split_part() and then rtrim() to remove any
remaining whitespace on the result of split_part().
-
- Fill in the command below with the name of the temp table
DROP TABLE IF EXISTS recode;

-- Create and name the temporary table

CREATE TEMP TABLE recode AS
-
- Write the select query to generate the table with distinc
t values of category and standardized values
• SELECT DISTINCT category,
• rtrim(split_part(category, '-
', 1)) AS standardized
-- What table are you selecting the above values from?
• FROM evanston311;
•
• -- Look at a few values before the next step
• SELECT DISTINCT standardized
• FROM recode
• WHERE standardized LIKE 'Trash%Cart'
• OR standardized LIKE 'Snow%Removal%';
Instructions 2/4:
• UPDATE standardized values LIKE 'Trash%Cart' to 'Trash Cart'.
• UPDATE standardized values of 'Snow Removal/Concerns' and 'Snow/Ice/Hazard
Removal' to 'Snow Removal'.
-- Code from previous step
DROP TABLE IF EXISTS recode;

CREATE TEMP TABLE recode AS

SELECT DISTINCT category,
rtrim(split_part(category, '-
', 1)) AS standardized
• FROM evanston311;
•
• -- Update to group trash cart values
• UPDATE recode
• SET standardized='Trash Cart'
• WHERE standardized LIKE 'Trash%Cart';
•
• -- Update to group snow removal values
• UPDATE recode
• SET standardized='Snow Removal'
• WHERE standardized LIKE 'Snow%Removal%';
•
• -- Examine effect of updates
• SELECT DISTINCT standardized
• FROM recode
• WHERE standardized LIKE 'Trash%Cart'
• OR standardized LIKE 'Snow%Removal%';
Instructions 3/4:
• UPDATE recode by setting standardized values of 'THIS REQUEST IS
INACTIVE...Trash Cart', '(DO NOT USE) Water Bill', 'DO NOT USE Trash', and 'NO
LONGER IN USE' to 'UNUSED'
• -- Code from previous step
• DROP TABLE IF EXISTS recode;
•
• CREATE TEMP TABLE recode AS
• SELECT DISTINCT category,
• rtrim(split_part(category, '-
', 1)) AS standardized
• FROM evanston311;
•
• UPDATE recode SET standardized='Trash Cart'
• WHERE standardized LIKE 'Trash%Cart';
•
• UPDATE recode SET standardized='Snow Removal'
• WHERE standardized LIKE 'Snow%Removal%';
•
• -- Update to group unused/inactive values
• UPDATE recode SET standardized='UNUSED'
• WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash C
art',
• '(DO NOT USE) Water Bill',
• 'DO NOT USE Trash',
• 'NO LONGER IN USE');
•
• -- Examine effect of updates
• SELECT DISTINCT standardized
• FROM recode
• ORDER BY standardized;
Instructions 4/4:
• Now, join the evanston311 and recode tables to count the number of requests with
each of the standardized values
• List the most common standardized values first.
• -- Code from previous step
• DROP TABLE IF EXISTS recode;
• CREATE TEMP TABLE recode AS
• SELECT DISTINCT category,
• rtrim(split_part(category, '-
', 1)) AS standardized
• FROM evanston311;
• UPDATE recode SET standardized='Trash Cart'
• WHERE standardized LIKE 'Trash%Cart';
• UPDATE recode SET standardized='Snow Removal'
• WHERE standardized LIKE 'Snow%Removal%';
• UPDATE recode SET standardized='UNUSED'
• WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash C
art',
• '(DO NOT USE) Water Bill',
• 'DO NOT USE Trash', 'NO LONGER IN USE');
•
• -- Select the recoded categories and the count of each
• SELECT standardized, COUNT(*)
• -- From the original table and table with recoded values
• FROM evanston311
• LEFT JOIN recode
• -- What column do they have in common?
• ON evanston311.category=recode.category
• -- What do you need to group by to count?
• GROUP BY standardized
• -- Display the most common val values first
• ORDER BY COUNT(*) DESC;
Exercise
Create a table with indicator variables
Determine whether medium and high priority requests in the evanston311 data are more likely to
contain requesters' contact information: an email address or phone number.

• Emails contain an @.
• Phone numbers have the pattern of three characters, dash, three characters, dash, four
characters. For example: 555-555-1212.

Use LIKE to match these patterns. Remember % matches any number of characters (even 0), and
_ matches a single character. Enclosing a pattern in % (i.e. before and after your pattern) allows
you to locate it within other text.

For example, '%___.com%' would allow you to search for a reference to a website with the top-
level domain '.com' and at least three characters preceding it.

Create and store indicator variables for email and phone in a temporary table. LIKE produces
True or False as a result, but casting a boolean (True or False) as an integer converts True to 1
and False to 0. This makes the values easier to summarize later.

Instructions 1/2:
• Create a temp table indicators from evanston311 with three columns: id, email, and
phone.
• Use LIKE comparisons to detect the email and phone patterns that are in the
description, and cast the result as an integer with CAST().
o Your phone indicator should use a combination of underscores _ and dashes - to
represent a standard 10-digit phone number format.
o Remember to start and end your patterns with % so that you can locate the pattern
within other text!
-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;

-- Create the indicators temp table

CREATE TEMP TABLE indicators AS
-- Select id
SELECT id,
-- Create the email indicator (find @)
CAST (description LIKE '%@%' AS integer) AS email,
-- Create the phone indicator
CAST (description LIKE '% - -
%' AS integer) AS phone
-- What table contains the data?
FROM evanston311;

-- Inspect the contents of the new temp table

SELECT *
FROM indicators;
Instructions 2/2:
• Join the indicators table to evanston311, selecting the proportion of reports including
an email or phone grouped by priority.
• Include adjustments to account for issues arising from integer division.
-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;

-- Create the temp table

CREATE TEMP TABLE indicators AS
SELECT id,
CAST (description LIKE '%@%' AS integer) AS email,
CAST (description LIKE '% - -
%' AS integer) AS phone
• FROM evanston311;
•
• -- Select the column you'll group by
• SELECT priority,
• -
- Compute the proportion of rows with each indicator
• sum(email)/count(*)::numeric AS email_prop,
• sum(phone)/count(*)::numeric AS phone_prop
• -- Tables to select from
• FROM evanston311
• LEFT JOIN indicators
• -- Joining condition
• ON evanston311.id=indicators.id
• -- What are you grouping by?
• GROUP BY priority;

12. Date/time types and formats

The last type of data we're exploring is date/time data. As the name suggests, date/time refers to
columns that store dates and or times.

12.1. Main types

Date

• YYYY-MM-DD
• Example: 2018-12-30

Timestamp

• YYYY-MM-DD HH:MM:SS
• Example: 2018-12-30 13:10:04.3

There are two main types: date and timestamp. Dates only include year, month, and day.
Timestamps include a date plus a time. Times are specified in terms of hours from 0 to 24, minutes, and
seconds. Seconds can be fractional down to microseconds.

12.2. Intervals

Interval examples:
There is also a third date/time type you should know: an interval. Intervals represent time
durations. For example, 6 days, 1 hour, 48 minutes, and 8 seconds, or 51 minutes and 3 seconds.
Columns can be of type interval, but it's more common to encounter intervals as a result of
subtracting one date or timestamp from another. Intervals will default to display the number of
days, if any, and the time.
12.3. Date/time format examples

1pm on January 10,2018

01/10/18 1:00

10/01/18 01:00:00

01/10/2018 1pm

January

10th,2018

1pm 10 Jan

2018 1:00

01/10/18 01:00:00

01/10/18 13:00:00

Date/time data can be difficult to work with because people record dates in many different
formats. Consider some of the different ways people might write 1pm on January 10th, 2018.
They might write the date with either the month or day first. They can use two digits for the year
or four. They could spell out the month name, abbreviate it, or use numbers. They might specify
the time using a 12 hour clock or 24 hour clock.
12.4. ISO 8601

ISO = International Organization for Standards

YYYY-MM-DD HH:MM:SS
Example: 2018-01-05 09:35:15

To address this ambiguity, Postgres stores date/time information according to something

called the ISO 8601 standard. ISO 8601 specifies one way to record date/time information. The
units are listed in order from the largest to the smallest - the way we write numbers. Years come
first, followed by months and then days. When time information is added, it starts with hours,
then minutes, then seconds. Each component has a fixed number of digits, so smaller values must
be padded with a
leading zero.

12.5. UTC and timezones

UTC = Coordinated Universal Time Timestamp with timezone:

YYYY-MM-DD HH:MM:SS+HH

Example: 2004-10-19 10:23:54+02

Timezones are another way datetime information can get complicated. Postgres stores
timestamps according to UTC, or Coordinated Universal Time. Timezones are defined in terms
of their offset from UTC. Timestamps in Postgres can include timezone information or not.
When timezones are included, they appear at the end with a plus or minus, followed by the
number of hours the timezone is offset from UTC. The example timestamp here is 2 hours ahead
of UTC.
12.6. Date and time comparisons

Compare with >,<,=

Now(): current timestamp

So how do we work with dates and timestamps? Date/time entries can be compared to each
other as numbers can: with greater than, less than, and equals signs. You can get the current timestamp
with the now function. This can be useful when comparing values to the current date and time. Note
how dates in these examples are specified in ISO 8601 format. They are surrounded by single quotes
like character data.
12.7. Date subtraction

In addition to comparing dates, you can also subtract them from each other. The result is of type
interval.
12.8. Date addition

You can also add time to or subtract time from existing dates. Adding an integer value to a
date will add days. Adding an integer to a timestamp, however, will cause an error. Other
amounts of time, from years to seconds, can be added with intervals. You specify an interval
with a combination of numbers and words inside single quotes, then cast this as an interval. For
example, you can add an interval of one year. Or, you can specify the interval in terms of
multiple units, such as 1 year, 2 days, and 3 minutes.
Exercise
Date comparisons

When working with timestamps, sometimes you want to find all observations on a given
day. However, if you specify only a date in a comparison, you may get unexpected results. This
query:

SELECT count(*)
FROM evanston311
WHERE date_created = '2018-01-02';

returns 0, even though there were 49 requests on January 2, 2018.

This is because dates are automatically converted to timestamps when compared to a timestamp.
The time fields are all set to zero:

SELECT '2018-01-02'::timestamp;
2018-01-02 00:00:00

When working with both timestamps and dates, you'll need to keep this in mind.

Instructions 1/3:
• Count the number of Evanston 311 requests created on January 31, 2017 by casting
date_created to a date.
-- Count requests created on January 31, 2017
SELECT count(*)
FROM evanston311
WHERE date_created::date='2017-01-31';

Instructions 2/3:
• Count the number of Evanston 311 requests created on February 29, 2016 by using
>= and < operators.
-- Count requests created on February 29, 2016
SELECT count(*)
FROM evanston311
WHERE date_created >= '2016-02-29'
AND date_created < '2016-03-01' ;
Instructions 3/3:
Count the number of requests created on March 13, 2017.
Specify the upper bound by adding 1 to the lower bound.
• -- Count requests created on March 13, 2017
• SELECT count(*)
• FROM evanston311
• WHERE date_created >= '2017-03-13'
• AND date_created < '2017-03-13'::date + 1;

Exercise
Date arithmetic
You can subtract dates or timestamps from each other.

You can add time to dates or timestamps using intervals. An interval is specified with a number
of units and the name of a datetime field. For example:

• '3 days'::interval
• '6 months'::interval
• '1 month 2 years'::interval
• '1 hour 30 minutes'::interval

Practice date arithmetic with the Evanston 311 data and now().

Instructions 1/4:
• Subtract the minimum date_created from the maximum date_created.

-- Subtract the min date_created from the max

SELECT max(date_created)-min(date_created)
FROM evanston311;
Instructions 2/4:
• Using now(), find out how old the most recent evanston311 request was created.

-- How old is the most recent request?

SELECT now()-max(date_created)
FROM evanston311;

Instructions 3/4:
• Add 100 days to the current timestamp.
• -- Add 100 days to the current timestamp
• SELECT now()+'100 days'::interval;

Instructions 4/4
• Select the current timestamp and the current timestamp plus 5 minutes.

-- Select the current timestamp,

-- and the current timestamp + 5 minutes
SELECT now(),now() + '5 minutes'::interval ;
Exercise
Completion time by category
The evanston311 data includes a date_created timestamp from when each request was created
and a date_completed timestamp for when it was completed. The difference between these tells
us how long a request was open.

Which category of Evanston 311 requests takes the longest to complete?

Instructions:
• Compute the average difference between the completion timestamp and the creation
timestamp by category.
• Order the results with the largest average time to complete the request first.
• -
- Select the category and the average completion time by ca
tegory
• SELECT category,
• avg(date_completed-date_created) AS completion_time
• FROM evanston311
• GROUP BY category
• -- Order the results
• ORDER BY completion_time DESC;
13. Date/time components and aggregation
As with numerical and character data, sometimes we need to extract components of a
date/time, or truncate the value, to aggregate the data in a meaningful way.
13.1. Common date/time fields

Date/Time Functions and Operators

Documentation Fields

• Century: 2019-01-01 = century 21

• Decade: 2019-01-01 = decade 201
• Year, month, day
• Hour, minute, second
• Week
• Dow: day of week

Functions exist to extract individual components of date/time data. These components are
called fields. The fields are defined in the Postgres documentation. Many are based on the ISO
8601 standard. Let's look at some common fields starting with the largest unit of time. First, we
can get the century or decade that a timestamp belongs in. January 1st, 2019 is in century 21 and
decade 201. Date/time field definitions can be complicated and sometimes counterintuitive. It's
always a good idea to read the documentation before using unfamiliar fields. Next, we can get
the year, month, and day fields that make up a date. We can also get the hour, minute, and
second fields that make up a time. Week is the week number in the year, based on the ISO 8601
definition. D-O-W is day of week. The week starts with Sunday, which has a value of 0, and
ends on Saturday with a value of 6.

13.2. Extracting fields

To extract these fields from a date or timestamp, you can use the date_part or extract
functions. These functions expect a timestamp, but they automatically convert dates to
timestamps. These two functions give the same output. They just have different syntax. date_part
uses a comma to separate arguments, while extract uses the "from" keyword. The name of the
field should be surrounded by single quotes when using the date_part function. With the extract
function, the field name can be unquoted. Finally, the extract function, field name, and FROM
keyword are typically written in all uppercase, while the date_part function is written in
lowercase like other functions. The extract function actually calls the date_part function. You
can see this in the output of the example query: the result of both the call to date_part and the call
to extract are labeled with date_part in the output. The result of 1 indicates the first month of
January.
13.3. Extract to summarize by field

Individual sales
By month

Extracting fields from dates is useful when looking at how data varies by one unit of time across
a larger unit of time. For example, how do sales vary by month across years? Using sales from
2010- 2016, are sales in January usually higher than those in March?
13.4. Truncating dates

Instead of extracting single fields, you can also truncate dates and timestamps to a specified
level of precision. Remember that dates and timestamps are ordered from left to right, largest units to
smallest. You can use the date_trunc function, which is short for date truncate, to specify how much of
a timestamp to keep, as you might with a numeric value. Valid field types include all of those we
discussed except day of week. Date_trunc replaces fields smaller than, or less significant than, the one
specified with zero, or one, as appropriate. Month and day are set to 1, while time fields are set to 0.
Here, the year and month remain, and the rest of the fields are set to 0 or 1. The timezone remains
unchanged.
13.5. Truncate to keep large units

Individual sales
By month with year

Truncating dates is useful when you want to count, average, or sum data associated with
timestamps or dates by larger units of time. For example, starting from individual timestamped
sales transactions, what is the monthly trend in sales from June 2017 to January 2019?

Exercise
Date parts
The date_part() function is useful when you want to aggregate data by a unit of time across
multiple larger units of time. For example, aggregating data by month across different years, or
aggregating by hour across different days.

Recall that you use date_part() as:

SELECT date_part('field', timestamp);

In this exercise, you'll use date_part() to gain insights about when Evanston 311 requests are
submitted and completed.
Instructions 1/3:
• How many requests are created in each of the 24 months during 2016-2017?
• -- Extract the month from date_created and count requests
• SELECT date_part('month',date_created) AS month,
• COUNT(*)
• FROM evanston311
• -- Limit the date range
• WHERE date_created >='2016-01-01'
• AND date_created<'2018-01-01'
• -- Group by what to get monthly counts?
• GROUP BY month;

Instructions 2/3:
• What is the most common hour of the day for requests to be created?

-- Get the hour and count requests

SELECT date_part('hour',date_created) AS hour,
count(*)
FROM evanston311
GROUP BY hour
-- Order results to select most common
ORDER BY count desc
LIMIT 1;
Instructions 3/3:
• During what hours are requests usually completed? Count requests completed by
hour.
• Order the results by hour.

• -- Count requests completed by hour

• SELECT date_part('hour',date_completed) AS hour,
• COUNT(*)
• FROM evanston311
• GROUP BY hour
• ORDER BY hour DESC;

Exercise
Variation by day of week
Does the time required to complete a request vary by the day of the week on which the request
was created?

We can get the name of the day of the week by converting a timestamp to character data:

to_char(date_created, 'day')

But character names for the days of the week sort in alphabetical, not chronological, order. To
get the chronological order of days of the week with an integer value for each day, we can use:

EXTRACT(DOW FROM date_created)

DOW stands for "day of week."

Instructions:
• Select the name of the day of the week the request was created (date_created) as day.
• Select the mean time between the request completion (date_completed) and request
creation as duration.
• Group by day (the name of the day of the week) and the integer value for the day of
the week (use a function).
• Order by the integer value of the day of the week using the same function used in
GROUP BY.
• -
- Select name of the day of the week the request was create
d
• SELECT to_char(date_created, 'day') AS day,
• -
- Select avg time between request creation and completion
• avg(date_completed - date_created) AS duration
• FROM evanston311
• -- Group by the name of the day of the week and
• -- integer value of day of week the request was created
• GROUP BY day, EXTRACT(DOW FROM date_created)
• -- Order by integer value of the day of the week
• -- the request was created
• ORDER BY EXTRACT(DOW FROM date_created);
Exercise
Date truncation
Unlike date_part() or EXTRACT(), date_trunc() keeps date/time units larger than the field you
specify as part of the date. So instead of just extracting one component of a
timestamp, date_trunc() returns the specified unit and all larger ones as well.

Recall the syntax:

date_trunc('field', timestamp)

Using date_trunc(), find the average number of Evanston 311 requests created per day for each
month of the data. Ignore days with no requests when taking the average.

Instructions:
• Write a subquery to count the number of requests created per day.
• Select the month and average count per month from the daily_count subquery.
• -- Aggregate daily counts by month
• SELECT date_trunc('month', day) AS month,
• avg(count)
• -- Subquery to compute daily counts
• FROM (SELECT date_trunc('day', date_created) AS day,
• COUNT(*) AS count
• FROM evanston311
• GROUP BY day) AS daily_count
• GROUP BY month
• ORDER BY month;
14. Aggregating with date/time series
When counting observations by month or day, the result only includes rows for values
that appear in your data. How do you find periods of time with no observations?
14.1. Generate series

Recall the generate_series function, which you used with numeric data. The same
function can be used with date/time data. generate_series expects timestamps for the from and to
arguments. Dates will automatically be cast to a timestamp. The last argument is an interval. For
example, here we have an interval of two days. The result is a series of timestamps between the
start and end values separated by the interval.

Here's an example with an interval of hours. The last value in the series will be less than
or equal to the ending timestamp specified. For example, here the series ends at 8pm on January
1st, because the next value in the series would be greater than the 0th hour of January 2nd.
14.2. Generate series from the beginning

To get consistent values, generate series using the beginning of a month or year, not the end. For
example, attempting to generate a series for the last day in each month produces unexpected results.
When you add one month to January 31st, you get the last day in February, the 28th, because there is
no 31st. But then 1 month after February 28th is March 28th, not March 31st.

To correctly generate a series for the last day of each month, generate a series using the
beginning of each month, then subtract 1 day from the result.
Normal aggregation

Series can also be used to find units of time with no observations. For example, you
might want to count sales by the hour of the day they occurred. Here's some sample sales data in
its original form. Then with the number of sales counted by hour. Looking at the counts, it's hard
to tell at a glance that there were no sales in the 11 o'clock hour.
14.3. Aggregation with series

To include hours with no sales, generate a series of hours, and then join this to the original
data to introduce rows for the missing hours. First, use a WITH clause to create the series of
hours from 9am to 2pm and call this hour_series. Then, join this to the sales data, matching the
hour from the series to the sales date truncated to the hour. Count the date column, instead of
counting the rows, because we don't want to count null values. Group and order by hours to get
the count of sales per hour.
The result now includes all hours between 9am and 2pm, with zeros for hours with no sales.
We're less likely to overlook that some hours have no sales.
14.4. Aggregation with bins

If you want to aggregate data by an interval that is not equal to one unit of a date/time field,
you can create bins. Recall this strategy from working with numeric data. Let's count sales in 3
hour intervals during the day. First, create two series, one for the lower bound of each bin and
one for the upper. The series for the upper bound starts and ends 3 hours after the lower bound.
This is the amount of the interval. We alias the result as bins. Then, join bins to the sales data,
where the sales date is greater than or equal to the lower bin and less than the upper bin. Then
group and order by the bin bounds.

The result is the count of sales made during each of the three hour intervals.
Exercise
Find missing dates
The generate_series() function can be useful for identifying missing dates.

Recall:

generate_series(from, to, interval)

where from and to are dates or timestamps, and interval can be specified as a string with a
number and a unit of time, such as '1 month'.

Are there any days in the Evanston 311 data where no requests were created?

Instructions:
• Write a subquery using generate_series() to get all dates between the min() and max()
date_created in evanston311.
• Write another subquery to select all values of date_created as dates from
evanston311.
• Both subqueries should produce values of type date (look for the ::).
• Select dates (day) from the first subquery that are NOT IN the results of the second
subquery. This gives you days that are not in date_created.

5. SELECT day
6. -- 1) Subquery to generate all dates
7. -- from min to max date_created
8. FROM (SELECT generate_series(min(date_created),
9. max(date_created),
10. '1 day')::date AS day
11. -- What table is date_created in?
12. FROM evanston311) AS all_dates
13. -
- 4) Select dates (day from above) that are NOT IN the subq
uery
14. WHERE day NOT IN
15. -
- 2) Subquery to select all date_created values as dates
16. (SELECT date_created::date
17. FROM evanston311);
Exercise
Custom aggregation periods
Find the median number of Evanston 311 requests per day in each six month period from 2016-
01- 01 to 2018-06-30. Build the query following the three steps below.

Recall that to aggregate data by non-standard date/time intervals, such as six months, you can
use generate_series() to create bins with lower and upper bounds of time, and then summarize
observations that fall in each bin.

Remember: you can access the slides with an example of this type of query using the PDF icon
link in the upper right corner of the screen.

Instructions 1/3:
• Use generate_series() to create bins of 6 month intervals. Recall that the upper bin
values are exclusive, so the values need to be one day greater than the last day to be
included in the bin.
• Notice how in the sample code, the first bin value of the upper bound is July 1st, and
not June 30th.
• Use the same approach when creating the last bin values of the lower and upper
bounds (i.e. for 2018).
-- Generate 6 month bins covering 2016-01-01 to 2018-06-30

-- Create lower bounds of bins

SELECT generate_series('2016-01-01', -
- First bin lower value
• '2018-01-01', -
- Last bin lower value
'6 months'::interval) AS lower,
-- Create upper bounds of bins
generate_series('2016-07-01', -
- First bin upper value
• '2018-07-01', -
- Last bin upper value
'6 months'::interval) AS upper;
Instructions 2/3:
• Count the number of requests created per day. Remember to not count *, or you will
risk counting NULL values.
• Include days with no requests by joining evanston311 to a daily series from 2016-01-
01 to 2018-06-30.

- Note that because we are not generating bins, you can use June 30th as your series
end date.

-- Count number of requests made per day

SELECT day, COUNT(date_created) AS count
-- Use a daily series from 2016-01-01 to 2018-06-30
-- to include days with no requests
FROM (SELECT generate_series('2016-01-01', -
- series start date
'2018-06-30', -- series end date
'1 day'::interval)::date AS day)
AS daily_series
LEFT JOIN evanston311
-- match day from above (which is a date) to date_created
ON day = date_created::date
GROUP BY day;

Instructions 3/3:
• Assign each daily count to a single 6 month bin by joining bins to daily_counts.
• Compute the median value per bin using percentile_disc().
• -- Bins from Step 1
• WITH bins AS (
• SELECT generate_series('2016-01-01',
• '2018-01-01',
• '6 months'::interval) AS lower,
• generate_series('2016-07-01',
• '2018-07-01',
• '6 months'::interval) AS upper)
,
• -- Daily counts from Step 2
• daily_counts AS (
• SELECT day, count(date_created) AS count
• FROM (SELECT generate_series('2016-01-01',
• '2018-06-30',
• '1 day'::interval)::dat
e AS day) AS daily_series
• LEFT JOIN evanston311
• ON day = date_created::date
• GROUP BY day)
• -- Select bin bounds
• SELECT lower,
• upper,
• -- Compute median of count for each bin
• percentile_disc(0.5) WITHIN GROUP (ORDER BY count) A
S median
• -- Join bins and daily_counts
• FROM bins
• LEFT JOIN daily_counts
• -- Where the day is between the bin bounds
• ON day >= lower
• AND day < upper
• -- Group by bin bounds
• GROUP BY lower, upper
• ORDER BY lower;
Exercise
Monthly average with missing dates

Find the average number of Evanston 311 requests created per day for each month of the data.

This time, do not ignore dates with no requests.

Instructions:
• Generate a series of dates from 2016-01-01 to 2018-06-30.
• Join the series to a subquery to count the number of requests created per day.
• Use date_trunc() to get months from date, which has all dates, NOT day.
• Use coalesce() to replace NULL count values with 0. Compute the average of this
value.
• -- generate series with all days from 2016-01-01 to 2018- 06-30
• WITH all_days AS
• (SELECT generate_series('2016-01-01',
• '2018-06-30',
• '1 day'::interval) AS date),
• -- Subquery to compute daily counts
• daily_count AS
• (SELECT date_trunc('day', date_created) AS day,
• count(*) AS count
• FROM evanston311
• GROUP BY day)
• -- Aggregate daily counts by month using date_trunc
• SELECT date_trunc('month', date) AS month,
• -- Use coalesce to replace NULL count values with 0
• avg(coalesce(count, 0)) AS average
• FROM all_days
• LEFT JOIN daily_count
• -- Joining condition
• ON all_days.date=daily_count.day
• GROUP BY month
• ORDER BY month;
15. Time between events
You know how to subtract one date from another. But how do you find out how much
time has passed between events, when the dates or timestamps are all saved in the same column?
15.1. The problem

For example, here is data from a sales table with a timestamp for each sale. Our question
is: how much time passes on average between each sale?
15.2. Lead and lag

The lead and lag functions let us offset the ordered values in a column by 1 row by default.
Then we can subtract the original values from the lead or lag of the values to get the difference
between events. Before we talk about the syntax, let's look at the results of the function calls.
The lag function pushes all of the values down one row. NULL is inserted at the beginning of the
lag column,so that the first sales time,at 9:07, is now the second value. The last sales time is
discarded. The lead function does the opposite,pulling all values up one row. The second sales
time of 9:13 becomes the first value,and NULL is added at the end of the lead column. The first
sales time is discarded.
Okay, back to the syntax. For the lead and lag functions to work, you have to specify how
rows should be ordered. Remember that the rows in a database table have no inherent order to
them - they are only ordered when you explicitly specify an order. Lead and lag are window
functions. You start with the function name, and supply the column you want to apply the lead or
lag to as the argument. You then add an "over" clause with the keyword OVER and an "order
by" statement specifying how the rows should be ordered. The "order by" statement goes in
parentheses. In this example, we have ordered by the same column that we want to lead and lag:
date, but this isn't a requirement. We'll see an example where these columns are different shortly.
15.3. Time between events

But first, how do we use lead and lag to compute the time between sales? If we order
dates from oldest to newest, we want to subtract the lagged date from the current date to compute
the gap between each sale and the previous sale. We have one less gap value than the number of
sales.
15.4. Average time between events

To compute the average gap, we need to use a subquery. We cannot simply wrap the
average function around the difference between sales because window functions can't be used
inside aggregation functions like average. The average time between sales is 32 minutes and 15
seconds.
15.5. Change in a time series

The lead and lag functions are not limited to date/time data. As mentioned before, you can
order the rows of the table by one column while getting the lead or lag of a different column. We
often want to do this to compute changes in a time series. A time series is any variable that has a
date or time associated with each value. Here, we want to see not how much time passes between
each sale, but how the amount sold changes from one sale to the next. We can't compute a
change for the first value in a time series because there is no previous value. Looking at the
change column, for the second sale, the amount was 19 less than the first sale. The last sale was
35 more than the previous sale.
Exercise
Longest gap

What is the longest time between Evanston 311 requests being submitted?

Recall the syntax for lead() and lag():

lag(column_to_adjust) OVER (ORDER BY ordering_column)

lead(column_to_adjust) OVER (ORDER BY ordering_column)

Instructions:
• Select date_created and the date_created of the previous request using lead() or lag()
as appropriate.
• Compute the gap between each request and the previous request.
• Select the row with the maximum gap.
• -- Compute the gaps
• WITH request_gaps AS (
• SELECT date_created,
• -- lead or lag
• lag(date_created) OVER (ORDER BY date_create
d) AS previous,
• -
- compute gap as date_created minus lead or lag
• date_created - lag(date_created) OVER (ORDER
BY date_created) AS gap
• FROM evanston311)
• -- Select the row with the maximum gap
• SELECT *
• FROM request_gaps
• -- Subquery to select maximum gap from request_gaps
• WHERE gap = (SELECT max(gap)
• FROM request_gaps);
Exercise
Rats!

Requests in category "Rodents- Rats" average over 64 days to resolve. Why?

Investigate in 4 steps:
1. Why is the average so high? Check the distribution of
completion times. Hint: date_trunc() can be used on
intervals.
2. See how excluding outliers influences average completion times.
3. Do requests made in busy months take longer to complete? Check the
correlation between the average completion time and requests per
month.
4. Compare the number of requests created per month to the number completed.

Remember: the time to resolve, or completion time, is date_completed - date_created.

Instructions 1/4:
• Use date_trunc() to examine the distribution of rat request completion times by
number of days.
• -- Truncate the time to complete requests to the day
• SELECT date_trunc('day',date_completed-
date_created) AS completion_time,
• -- Count requests with each truncated time
• COUNT(*)
• FROM evanston311
• -- Where category is rats
• WHERE category = 'Rodents- Rats'
• -- Group and order by the variable of interest
• GROUP BY completion_time
• ORDER BY completion_time;
Instructions 2/4:
• Compute average completion time per category excluding the longest 5% of requests
(outliers).
• SELECT category,
• -- Compute average completion time per category
• avg(date_completed-
date_created) AS avg_completion_time
• FROM evanston311
• -
- Where completion time is less than the 95th percentile va
lue
• WHERE date_completed-date_created <
• -
- Compute the 95th percentile of completion time in a subqu
ery
• (SELECT percentile_disc(0.95) WITHIN GROUP (ORDER
BY date_completed-date_created)
• FROM evanston311)
• GROUP BY category
• -- Order the results
• ORDER BY avg_completion_time DESC;
Instructions 3/4:
• Get corr() between avg. completion time and monthly requests. EXTRACT(epoch
FROM interval) returns seconds in interval.
• -- Compute correlation (corr) between
• -- avg_completion time and count from the subquery
• SELECT corr(avg_completion, count)
• -- Convert date_created to its month with date_trunc
• FROM (SELECT date_trunc('month', date_created) AS month,
• -
- Compute average completion time in number of seconds

• AVG(EXTRACT(epoch FROM date_completed - date

_created)) AS avg_completion,
• -- Count requests per month
• count(*) AS count
• FROM evanston311
• -- Limit to rodents
• WHERE category='Rodents- Rats'
• -- Group by month, created above
• GROUP BY month)
• -- Required alias for subquery
• AS monthly_avgs;
Instructions 4/4:
• Select the number of requests created and number of requests completed per month.

• -- Compute monthly counts of requests created

• WITH created AS (
• SELECT date_trunc('month',date_created) AS month,
• count(*) AS created_count
• FROM evanston311
• WHERE category='Rodents- Rats'
• GROUP BY month),
• -- Compute monthly counts of requests completed
• completed AS (
• SELECT date_trunc('month',date_completed) AS month,
• count(*) AS completed_count
• FROM evanston311
• WHERE category='Rodents- Rats'
• GROUP BY month)
• -- Join monthly created and completed counts
• SELECT created.month,
• created_count,
• completed_count
• FROM created
• INNER JOIN completed
• ON created.month=completed.month
• ORDER BY created.month;

Database Design For Mere Mortals
33% (6)
Database Design For Mere Mortals
30 pages
What's in The Database?: Christina Maimone
No ratings yet
What's in The Database?: Christina Maimone
35 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
PD Sum
No ratings yet
PD Sum
41 pages
Database Basics: What's A Database?
No ratings yet
Database Basics: What's A Database?
5 pages
Database
No ratings yet
Database
27 pages
Appian
No ratings yet
Appian
32 pages
Intro - To - DBMS 1
No ratings yet
Intro - To - DBMS 1
95 pages
SQL Technology
No ratings yet
SQL Technology
66 pages
Practical 1: - Creating Database
No ratings yet
Practical 1: - Creating Database
20 pages
Database Normalization: From Join
No ratings yet
Database Normalization: From Join
16 pages
Q.bank Solve of Programming
No ratings yet
Q.bank Solve of Programming
33 pages
File (SQL Tutorial)
No ratings yet
File (SQL Tutorial)
81 pages
Normalization Part I
No ratings yet
Normalization Part I
60 pages
As Chapter 10
No ratings yet
As Chapter 10
46 pages
Review RDBMS
No ratings yet
Review RDBMS
30 pages
Day1 Live
No ratings yet
Day1 Live
14 pages
Database Management System
No ratings yet
Database Management System
7 pages
CEF342 - Database and Design Chapter 3 - The Relational Database Model
No ratings yet
CEF342 - Database and Design Chapter 3 - The Relational Database Model
10 pages
SQL Basics Learn SQL The Easy Way (Fabian Gaussling)
No ratings yet
SQL Basics Learn SQL The Easy Way (Fabian Gaussling)
121 pages
DBMS Final
No ratings yet
DBMS Final
89 pages
Database Notes
No ratings yet
Database Notes
7 pages
Rdbms
100% (7)
Rdbms
60 pages
Database System
No ratings yet
Database System
11 pages
Databases, The Introduction
No ratings yet
Databases, The Introduction
5 pages
DBMS - Sub Q Index Triger Cursor
No ratings yet
DBMS - Sub Q Index Triger Cursor
132 pages
Components of A Database System
No ratings yet
Components of A Database System
42 pages
Week 1SQL
No ratings yet
Week 1SQL
10 pages
Semana 3
No ratings yet
Semana 3
5 pages
DBMS
No ratings yet
DBMS
83 pages
Rdbms 3
No ratings yet
Rdbms 3
8 pages
Primary Key and Foreign Key
No ratings yet
Primary Key and Foreign Key
7 pages
Chapter 03
No ratings yet
Chapter 03
43 pages
RD SQL Notes
No ratings yet
RD SQL Notes
119 pages
(SQL Notes) - TheTestingAcademy - Pramod - Google Drive
No ratings yet
(SQL Notes) - TheTestingAcademy - Pramod - Google Drive
20 pages
Week 3 - Relational Database Model
No ratings yet
Week 3 - Relational Database Model
53 pages
Unit 3 Database Management
No ratings yet
Unit 3 Database Management
22 pages
Database Summary Note
No ratings yet
Database Summary Note
10 pages
Advanced Databases Practical Lecture 1
No ratings yet
Advanced Databases Practical Lecture 1
99 pages
Review of DB Concepts
No ratings yet
Review of DB Concepts
27 pages
DBMS Unit 2
No ratings yet
DBMS Unit 2
61 pages
Database Analysis & Design
No ratings yet
Database Analysis & Design
57 pages
Lec 14 Database
No ratings yet
Lec 14 Database
45 pages
Database
No ratings yet
Database
11 pages
Unit I
No ratings yet
Unit I
66 pages
The Relational Database Model
No ratings yet
The Relational Database Model
60 pages
Advance Database Systems: Overview of Data Modeling
No ratings yet
Advance Database Systems: Overview of Data Modeling
33 pages
DB Session 3 Slides
No ratings yet
DB Session 3 Slides
45 pages
Database Final File - by Enas
No ratings yet
Database Final File - by Enas
7 pages
Unit - 3 RDBMS
No ratings yet
Unit - 3 RDBMS
51 pages
SQL and Database
No ratings yet
SQL and Database
32 pages
SQL Notes
No ratings yet
SQL Notes
96 pages
Muhammad Husnain (Fa22-Bba-129)
No ratings yet
Muhammad Husnain (Fa22-Bba-129)
12 pages
Smartsheet User Guide for Accelerated Learning
From Everand
Smartsheet User Guide for Accelerated Learning
Darren Mullen
No ratings yet
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Intermediate Excel: Excel Essentials, #2
From Everand
Intermediate Excel: Excel Essentials, #2
M.L. Humphrey
5/5 (1)
Intermediate Excel 365: Excel 365 Essentials, #2
From Everand
Intermediate Excel 365: Excel 365 Essentials, #2
M.L. Humphrey
No ratings yet
DBMS Lab Manual
From Everand
DBMS Lab Manual
Jitendra Patel
1.5/5 (3)
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
Pivot Tables for everyone. From simple tables to Power-Pivot: Useful guide for creating Pivot Tables in Excel
From Everand
Pivot Tables for everyone. From simple tables to Power-Pivot: Useful guide for creating Pivot Tables in Excel
Olga Maria Stefania Cucaro
No ratings yet
ITEC 212 Manual Part 1 Programs
No ratings yet
ITEC 212 Manual Part 1 Programs
12 pages
TCS Interview Questions TCS Recruitment Process - Javatpoint
No ratings yet
TCS Interview Questions TCS Recruitment Process - Javatpoint
23 pages
CSD201-Assignment 1 - SE1513
No ratings yet
CSD201-Assignment 1 - SE1513
2 pages
Register Page Code
100% (1)
Register Page Code
4 pages
Aradhy Bhargav: George Mason University, Fairfax, VA
No ratings yet
Aradhy Bhargav: George Mason University, Fairfax, VA
1 page
Oracle 12c Golden Gate Veridata 1610472497
No ratings yet
Oracle 12c Golden Gate Veridata 1610472497
170 pages
Apache Hbase Installation
No ratings yet
Apache Hbase Installation
5 pages
Introduction To Unix: Unit 2:the File System and Some File Handling Commands
No ratings yet
Introduction To Unix: Unit 2:the File System and Some File Handling Commands
62 pages
Rhel6 Rhcsa
No ratings yet
Rhel6 Rhcsa
14 pages
White Paper Introduction To XtremIO Snapshots H13035
No ratings yet
White Paper Introduction To XtremIO Snapshots H13035
30 pages
KKC
No ratings yet
KKC
279 pages
Install Sakila Database
No ratings yet
Install Sakila Database
5 pages
Valueset
No ratings yet
Valueset
2 pages
AIX-To-Linux Quick Start Comparison
No ratings yet
AIX-To-Linux Quick Start Comparison
7 pages
CH - 5
No ratings yet
CH - 5
43 pages
Business Objects Oracle
No ratings yet
Business Objects Oracle
3 pages
MinhNV CV
No ratings yet
MinhNV CV
11 pages
Class Notes-AWS
No ratings yet
Class Notes-AWS
2 pages
Disk Management in DOS
No ratings yet
Disk Management in DOS
3 pages
I Know Youre There E Major MN0077272
No ratings yet
I Know Youre There E Major MN0077272
8 pages
SQL Examples
No ratings yet
SQL Examples
10 pages
Point of Sale (Haresco, Mendoza, Respecia, San Diego)
No ratings yet
Point of Sale (Haresco, Mendoza, Respecia, San Diego)
12 pages
MSC Description
No ratings yet
MSC Description
4 pages
Exadata Exam
100% (1)
Exadata Exam
16 pages
Week 8 Solution
No ratings yet
Week 8 Solution
10 pages
Certified Data Analyst - Ain GenX (PVT.) Ltd.
No ratings yet
Certified Data Analyst - Ain GenX (PVT.) Ltd.
11 pages
Excel Prompts
No ratings yet
Excel Prompts
23 pages
De Iva Kokila Resume
No ratings yet
De Iva Kokila Resume
4 pages
1z0 071 PDF
No ratings yet
1z0 071 PDF
168 pages
CSC 227
No ratings yet
CSC 227
17 pages