SQL Cleaning Data
SQL Cleaning Data
1. LEFT
2. RIGHT
3. LENGTH
LEFT pulls a specified number of characters for each row in a specified column starting at the
beginning (or from the left). As you saw here, you can pull the first three digits of a phone number
using LEFT(phone_number, 3).
RIGHT pulls a specified number of characters for each row in a specified column starting at the end
(or from the right). As you saw here, you can pull the last eight digits of a phone number
using RIGHT(phone_number, 8).
LENGTH provides the number of characters for each row of a specified column. Here, you saw that
we could use this to get the length of each phone number as LENGTH(phone_number).
BERIKUTNYA
2. There is much debate about how much the name (or even the first letter of a company
name) matters. Use the accounts table to pull the first letter of each company name to see the
distribution of company names that begin with each letter (or number).
3. Use the accounts table and a CASE statement to create two groups: one group of company
names that start with a number and a second group of those company names that start with a letter.
What proportion of company names start with a letter?
4. Consider vowels as a , e , i , o , and u . What proportion of company names start with a
vowel, and what percent start with anything else?
1. POSITION
2. STRPOS
3. LOWER
4. UPPER
POSITION takes a character and a column, and provides the index where that
character is for each row. The index of the first position is 1 in SQL. If you come
from another programming language, many begin indexing at 0. Here, you saw that
you can pull the index of a comma as POSITION(',' IN city_state).
STRPOS provides the same result as POSITION, but the syntax for achieving those
results is a bit different as shown here: STRPOS(city_state, ',').
2. Now see if you can do the same thing for every rep name in the sales_reps table. Again
provide first and last name columns.
1. CONCAT
2. Piping ||
Each of these will allow you to combine columns together across rows. In this video,
you saw how first and last names stored in separate columns could be combined
together to create a full name: CONCAT(first_name, ' ', last_name) or with piping
as first_name || ' ' || last_name.
Quizzes CONCAT
1. Each company in the accounts table wants to create an email address for
each primary_poc . The email address should be the first name of the primary_poc . last
name primary_poc @ company name .com .
2. You may have noticed that in the previous solution some of the company names include
spaces, which will certainly not work in an email address. See if you can create an email address
that will work by removing all of the spaces in the account name , but otherwise your solution
should be just as in question 1 . Some helpful documentation is here.
3. We would also like to create an initial password, which they will change after their first
log in. The first password will be the first letter of the primary_poc 's first name (lowercase),
then the last letter of their first name (lowercase), the first letter of their last name (lowercase),
the last letter of their last name (lowercase), the number of letters in their first name, the number
of letters in their last name, and then the name of the company they are working with, all
capitalized with no spaces.
CONCAT Solutions
1. WITH t1 AS (
2. SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name,
RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
3. FROM accounts)
4. SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name,
'.com')
5. FROM t1;
6.
7. WITH t1 AS (
8. SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name,
RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
9. FROM accounts)
10. SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@',
REPLACE(name, ' ', ''), '.com')
11. FROM t1;
12.
13. WITH t1 AS (
14. SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name,
RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
15. FROM accounts)
16. SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name,
'.com'), LEFT(LOWER(first_name), 1) || RIGHT(LOWER(first_name), 1) ||
LEFT(LOWER(last_name), 1) || RIGHT(LOWER(last_name), 1) || LENGTH(first_name)
|| LENGTH(last_name) || REPLACE(UPPER(name), ' ', '')
17. FROM t1;
18.
In this video, you saw additional functionality for working with dates including:
1. TO_DATE
2. CAST
3. Casting with ::
DATE_PART('month', TO_DATE(month, 'month')) here changed a month name into
the number associated with that particular month.
Then you can change a string to a date using CAST. CAST is actually useful to
change lots of column types. Commonly you might be doing as you saw here,
where you change a string to a date using CAST(date_column AS DATE). However,
you might want to make other changes to your columns in terms of their data
types. You can see other examples here.
In this example, you also saw that instead of CAST(date_column AS DATE), you can
use date_column::DATE.
Expert Tip
Most of the functions presented in this lesson are specific to strings. They won’t
work with dates, integers or floating-point numbers. However, using any of these
functions will automatically change the data to the appropriate type.
LEFT, RIGHT, and TRIM are all used to select only certain elements of strings, but
using them to select elements of a number or date will treat them as strings for the
purpose of the function. Though we didn't cover TRIM in this lesson explicitly, it
can be used to remove characters from the beginning and end of a string. This can
remove unwanted spaces at the beginning or end of a row that often happen with
data being moved from Excel or other storage systems.
There are a number of variations of these functions, as well as several other string
functions not covered here. Different databases use subtle variations on these
functions, so be sure to look up the appropriate database’s syntax if you’re
connected to a private database.The Postgres literature contains a lot of the related
functions.
CAST Solutions
1. SELECT *
2. FROM sf_crime_data
3. LIMIT 10;
4.
5. yyyy-mm-dd
6. The format of the date column is mm/dd/yyyy with times that are not correct also at the
end of the date.
In this video, you learned about how to use COALESCE to work with NULL values.
Unfortunately, our dataset does not have the NULL values that were fabricated in
this dataset, so you will work through a different example in the next concept to get
used to the COALESCE function.
COALESCE Quizzes
In this quiz, we will walk through the previous example using the following task list. We will use the
COALESCE function to complete the orders record for the row in the table output.
Tasks to complete:
Task List
1. Run the query entered below in the SQL workspace to notice the row with missing data.
2. Use COALESCE to fill in the accounts.id column with the account.id for the NULL value
for the table in 1 .
3. Use COALESCE to fill in the orders.account_id column with the account.id for the
NULL value for the table in 1 .
4. Use COALESCE to fill in each of the qty and usd columns with 0 for the table in 1 .
5. Run the query in 1 with the WHERE removed and COUNT the number of id s .
6. Run the query in 5 , but with the COALESCE function used in questions 2 through 4 .
COALESCE Solutions
1. SELECT *
2. FROM accounts a
3. LEFT JOIN orders o
4. ON a.id = o.account_id
5. WHERE o.total IS NULL;
6.
7. SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long,
a.primary_poc, a.sales_rep_id, o.*
8. FROM accounts a
9. LEFT JOIN orders o
10. ON a.id = o.account_id
11. WHERE o.total IS NULL;
12.
13. SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long,
a.primary_poc, a.sales_rep_id, COALESCE(o.account_id, a.id) account_id,
o.occurred_at, o.standard_qty, o.gloss_qty, o.poster_qty, o.total,
o.standard_amt_usd, o.gloss_amt_usd, o.poster_amt_usd, o.total_amt_usd
14. FROM accounts a
15. LEFT JOIN orders o
16. ON a.id = o.account_id
17. WHERE o.total IS NULL;
18.
19. SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long,
a.primary_poc, a.sales_rep_id, COALESCE(o.account_id, a.id) account_id,
o.occurred_at, COALESCE(o.standard_qty, 0) standard_qty, COALESCE(o.gloss_qty,0)
gloss_qty, COALESCE(o.poster_qty,0) poster_qty, COALESCE(o.total,0) total,
COALESCE(o.standard_amt_usd,0) standard_amt_usd, COALESCE(o.gloss_amt_usd,0)
gloss_amt_usd, COALESCE(o.poster_amt_usd,0) poster_amt_usd,
COALESCE(o.total_amt_usd,0) total_amt_usd
20. FROM accounts a
21. LEFT JOIN orders o
22. ON a.id = o.account_id
23. WHERE o.total IS NULL;
24.
25. SELECT COUNT(*)
26. FROM accounts a
27. LEFT JOIN orders o
28. ON a.id = o.account_id;
29.
30. SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long,
a.primary_poc, a.sales_rep_id, COALESCE(o.account_id, a.id) account_id,
o.occurred_at, COALESCE(o.standard_qty, 0) standard_qty, COALESCE(o.gloss_qty,0)
gloss_qty, COALESCE(o.poster_qty,0) poster_qty, COALESCE(o.total,0) total,
COALESCE(o.standard_amt_usd,0) standard_amt_usd, COALESCE(o.gloss_amt_usd,0)
gloss_amt_usd, COALESCE(o.poster_amt_usd,0) poster_amt_usd,
COALESCE(o.total_amt_usd,0) total_amt_usd
31. FROM accounts a
32. LEFT JOIN orders o
33. ON a.id = o.account_id;