Course 5
Course 5
filters
You have learned about four phases of analysis:
Organize data
Format and adjust data
Get input from others
Transform data
The organization of datasets is really important for data analysts. Most of the datasets you will use
will be organized as tables. Tables are helpful because they let you manipulate your data and
categorize it. Having distinct categories and classifications lets you focus on, and differentiate
between, your data quickly and easily.
Data analysts also need to format and adjust data when performing an analysis. Sorting and
filtering are two ways you can keep things organized when you format and adjust data to work
with it. For example, a filter can help you find errors or outliers so you can fix or flag them before
your analysis. Outliers are data points that are very different from similarly collected data and
might not be reliable values. The benefit of filtering the data is that after you fix errors or identify
outliers, you can remove the filter and return the data to its original organization.
In this reading, you will learn the difference between sorting and filtering. You will also be introduced
to how a particular form of sorting is done in a pivot table.
Sorting is when you arrange data into a meaningful order to make it easier to understand, analyze,
and visualize. It ranks your data based on a specific metric you choose. You can sort data in
spreadsheets, SQL databases (when your dataset is too large for spreadsheets), and tables in
documents.
For example, if you need to rank things or create chronological lists, you can sort by ascending or
descending order. If you are interested in figuring out a group’s favorite movies, you might sort by
movie title to figure it out. Sorting will arrange the data in a meaningful way and give you immediate
insights. Sorting also helps you to group similar data together by a classification. For movies, you
could sort by genre -- like action, drama, sci-fi, or romance.
Filtering is used when you are only interested in seeing data that meets a specific criteria, and
hiding the rest. Filtering is really useful when you have lots of data. You can save time by zeroing in
on the data that is really important or the data that has bugs or errors. Most spreadsheets and SQL
databases allow you to filter your data in a variety of ways. Filtering gives you the ability to find what
you are looking for without too much effort.
For example, if you are only interested in finding out who watched movies in October, you could use
a filter on the dates so only the records for movies watched in October are displayed. Then, you
could check out the names of the people to figure out who watched movies in October.
To recap, the easiest way to remember the difference between sorting and filtering is that you can
use sort to quickly order the data, and filter to display only the data that meets the criteria that you
have chosen. Use filtering when you need to reduce the amount of data that is displayed.
It is important to point out that, after you filter data, you can sort the filtered data, too. If you
revisit the example of finding out who watched movies in October, after you have filtered for the
movies seen in October, you can then sort the names of the people who watched those movies in
alphabetical order.
If the items aren’t in a custom list, they will be sorted in ascending order by default. But, if you sort in
descending order, you are setting up a rule that controls how the field is sorted even after new data
fields are added.
Common conversions
The following table summarizes some of the more common conversions made with the CAST
function. Refer to Conversion Rules in Standard SQL for a full list of functions and associated rules.
Where expression is the data to be converted and typename is the data type to be returned.
SELECT
CAST (MyCount AS STRING) FROM MyTable
In the above SQL statement, the following occurs:
The syntax for SAFE_CAST is the same as for CAST. Simply substitute the function directly in your
queries. The following SAFE_CAST statement returns a string from a date.
A string is a set of characters that helps to declare the texts in programming languages such as
SQL. SQL string functions are used to obtain various information about the characters, or in this
case, manipulate them. One such function, CONCAT, is commonly used. Review the table below to
learn more about the CONCAT function and its variations.
Dataanalysis
Sometimes, depending on the strings, you will need to add a space character, so your function
should actually be:
Data analysis
The same rule applies when combining three strings together. For example,
search_key
The value to search for.
For example, 42, "Cats", or I24.
range
The range to consider for the search.
The first column in the range is searched to locate data matching the value specified by search_key.
index
The column index of the value to be returned, where the first column in range is numbered 1.
If index is not between 1 and the number of columns in range, #VALUE! is returned.
is_sorted
Indicates whether the column to be searched (the first column of the specified range) is sorted.
TRUE by default.
It’s recommended to set is_sorted to FALSE. If set to FALSE, an exact match is returned. If there
are multiple matching values, the content of the cell corresponding to the first value found is
returned, and #N/A is returned if no such value is found.
If is_sorted is TRUE or omitted, the nearest match (less than or equal to the search key) is returned.
If all values in the search column are greater than the search key, #N/A is returned.
In both cases, you now have a new name that you can use to refer to the column or table that was
aliased.
Aliasing in action
Let’s check out an example of a SQL query that uses aliasing. Let’s say that you are working with
two tables: one of them has employee data and the other one has department data. The FROM
statement to alias those tables could be:
These aliases still let you know exactly what is in these tables, but now you don’t have to manually
input those long table names. Aliases can be really helpful for long, complicated queries. It is easier
to read and write your queries when you have aliases that tell you what is included within your
tables.
If you need a refresher on primary and foreign keys, refer to the glossary for this course, or go back
to Databases in data analytics.
Type of JOINs
There are four general ways in which to conduct JOINs in SQL queries: INNER, LEFT, RIGHT, and
FULL OUTER.
The circles represent left and right tables, and where they are joined is highlighted in blue
The results from the query might look like the following, where customer_name is from the
customers table and product_id and ship_date are from the orders table:
LEFT JOIN
You may see this as LEFT OUTER JOIN, but most users prefer LEFT JOIN. Both are correct syntax.
LEFT JOIN returns all the records from the left table and only the matching records from the right
table. Use LEFT JOIN whenever you need the data from the entire first table and values from the
second table, if they exist. For example, in the query below, LEFT JOIN will return customer_name
with the corresponding sales_rep, if it is available. If there is a customer who did not interact with a
sales representative, that customer would still show up in the query results but with a NULL value for
sales_rep.
The results from the query might look like the following where customer_name is from the customers
table and sales_rep is from the sales table. Again, the data from both tables was joined together by
matching the customer_id common to both tables even though customer_id wasn't returned in the
query results.
customer_name sales_rep
Martin's Ice Cream Luis Reyes
Beachside Treats NULL
Mona's Natural Flavors Geri Hall
...etc. ...etc.
RIGHT JOIN
You may see this as RIGHT OUTER JOIN or RIGHT JOIN. RIGHT JOIN returns all records from the
right table and the corresponding records from the left table. Practically speaking, RIGHT JOIN is
rarely used. Most people simply switch the tables and stick with LEFT JOIN. But using the previous
example for LEFT JOIN, the query using RIGHT JOIN would look like the following:
The query results are the same as the previous LEFT JOIN example.
customer_name sales_rep
Martin's Ice Cream Luis Reyes
Beachside Treats NULL
Mona's Natural Flavors Geri Hall
...etc. ...etc.
FULL OUTER JOIN
You may sometimes see this as FULL JOIN. FULL OUTER JOIN returns all records from the
specified tables. You can combine tables this way, but remember that this can potentially be a large
data pull as a result. FULL OUTER JOIN returns all records from both tables even if data isn’t
populated in one of the tables. For example, in the query below, you will get all customers and their
products’ shipping dates. Because you are using a FULL OUTER JOIN, you may get customers
returned without corresponding shipping dates or shipping dates without corresponding customers.
A NULL value is returned if corresponding data doesn’t exist in either table.
The results from the query might look like the following.
customer_name ship_date
Martin's Ice Cream 2021-02-23
Beachside Treats 2021-02-25
NULL 2021-02-25
The Daily Scoop NULL
Mountain Ice Cream NULL
Mona's Natural Flavors 2021-02-28
...etc. ...etc.
You will find that, within the first SELECT clause is another SELECT clause. The second SELECT
clause marks the start of the subquery in this statement. There are many different ways in which you
can make use of subqueries, and resources referenced will provide additional guidance as you
learn. But first, let’s recap the subquery rules.
Refer to the resources at the end of this reading for information about similar functions in Microsoft
Excel.
SUMIF to SUMIFS
The basic syntax of a SUMIF function is: =SUMIF(range, criterion, sum_range)
The first range is where the function will search for the condition that you have set. The criterion is
the condition you are applying and the sum_range is the range of cells that will be included in the
calculation.
For example, you might have a table with a list of expenses, their cost, and the date they occurred.
Column A: A1 - Expense A2 -
Fuel A3 - Food A4 - Taxi A5 - Coffee A6 - Fuel A7 - Taxi A8 - Coffee A9 - Food Column B: B1 - Price
B2 - $48.00 B3 - $12.34 B4 - $21.57 A5 - $2.50 A6 - $36.00 A7 - $15.88 A8 - $4.15 A9 - $6.75
Column C: C1 - Date C2 - 12/14/2020 C3 - 12/14/2020 C4 - 12/14/2020 C5 - 12/15/2020 C6 -
12/15/2020 C7 - 12/15/2020 C8 - 12/15/2020 C9 - 12/15/2020
You could use SUMIF to calculate the total price of fuel in this table, like this:
But, you could also build in multiple conditions by using the SUMIFS function. SUMIF and SUMIFS
are very similar, but SUMIFS can include multiple conditions.
This formula gives you the total cost of every fuel expense from the date listed in the conditions. In
this example, C1:C9 is our second criterion_range and the date 12/15/2020 is the second condition.
As long as you follow the basic syntax, you can add up to 127 conditions to a SUMIFS statement!
COUNTIF to COUNTIFS
Just like the SUMIFS function, COUNTIFS allows you to create a COUNTIF function with multiple
conditions.
Just like SUMIF, you set the range and then the condition that needs to be met. For example, if you
wanted to count the number of times Food came up in the Expenses column, you could use a
COUNTIF function like this:
The criteria_range and criterion are in the same order, and you can add more conditions to the end
of the function. So, if you wanted to find the number of times Coffee appeared in the Expenses
column on 12/15/2020, you could use COUNTIFS to apply those conditions, like this:
This formula follows the basic syntax to create conditions for “Coffee” and the specific date. Now we
can find every instance where both of these conditions are true.
Pivot tables make it possible to view data in multiple ways in order to identify insights and trends.
They can help you quickly make sense of larger data sets by comparing metrics, performing
calculations, and generating reports. They’re also useful for answering specific questions about your
data.
A pivot table has four basic parts: rows, columns, values, and filters.
The rows of a pivot table organize and group data you select horizontally. For example, in the
Working with pivot tables video, the Release Date values were used to create rows that grouped the
data by year.
The columns organize and display values from your data vertically. Similar to rows, columns can
be pulled directly from the data set or created using values. Values are used to calculate and
count data. This is where you input the variables you want to measure. This is also how you create
calculated fields in your pivot table. As a refresher, a calculated field is a new field within a pivot
table that carries out certain calculations based on the values of other fields
In the previous movie data example, the Values editor created columns for the pivot table, including
the SUM of Box Office Revenue, the AVERAGE of Box Office Revenue, and the COUNT of Box
Office Revenue columns.
Finally, the filters section of a pivot table enables you to apply filters based on specific criteria —
just like filters in regular spreadsheets! For example, a filter was added to the movie data pivot table
so that it only included movies that generated less than $10 million in revenue.
Being able to use all four parts of the pivot table editor will allow you to compare different metrics
from your data and execute calculations, which will help you gain valuable insights.
Instead of making changes to the original spreadsheet data, they used a pivot table to answer these
questions and easily compare the sales revenue and number of products sold by each department.
They used the department as the rows for this pivot table to group and organize the rest of the sales
data. Then, they input two Values as columns: the SUM of sales and a count of the products sold.
They also sorted the data by the SUM of sales column in order to determine which department
generated the most revenue.
Now they know that the Toys department generated the most revenue!
Pivot tables are an effective tool for data analysts working with spreadsheets because they highlight
key insights from the spreadsheet data without having to make changes to the spreadsheet. Coming
up, you will create your own pivot table to analyze data and identify trends that will be highly
valuable to stakeholders.
As a junior data analyst, you might not perform all of these validations. But you could ask if and how
the data was validated before you begin working with a dataset. Data validation helps to ensure the
integrity of data. It also gives you confidence that the data you are using is clean. The following list
outlines six types of data validation and the purpose of each, and includes examples and limitations.
Purpose: Check that the data matches the data type defined for a field.
Example: Data values for school grades 1-12 must be a numeric data type.
Limitations: The data value 13 would pass the data type validation but would be an unacceptable
value. For this case, data range validation is also needed.
Purpose: Check that the data falls within an acceptable range of values defined for the field.
Example: Data values for school grades should be values between 1 and 12.
Limitations: The data value 11.5 would be in the data range and would also pass as a numeric
data type. But, it would be unacceptable because there aren't half grades. For this case, data
constraint validation is also needed.
Purpose: Check that the data meets certain conditions or criteria for a field. This includes the type
of data entered as well as other attributes of the field, such as number of characters.
Example: Content constraint: Data values for school grades 1-12 must be whole numbers.
Limitations: The data value 13 is a whole number and would pass the content constraint
validation. But, it would be unacceptable since 13 isn’t a recognized school grade. For this case,
data range validation is also needed.
Purpose: Check that the data makes sense in the context of other related data.
Example: Data values for product shipping dates can’t be earlier than product production dates.
Limitations: Data might be consistent but still incorrect or inaccurate. A shipping date could be
later than a production date and still be wrong.
Purpose: Check that the data follows or conforms to a set structure.
Example: Web pages must follow a prescribed structure to be displayed properly.
Limitations: A data structure might be correct with the data still incorrect or inaccurate. Content
on a web page could be displayed properly and still contain the wrong information.
Purpose: Check that the application code systematically performs any of the previously mentioned
validations during user data input.
Example: Common problems discovered during code validation include: more than one data type
allowed, data range checking not done, or ending of text strings not well defined.
Limitations: Code validation might not validate all possible variations with data input.
The statement begins with the WITH clause followed by the name of the new temporary table you
want to create
The AS clause appears after the name of the new table. This clause instructs the database to put all
of the data identified in the next part of the statement into the new table.
The opening parenthesis after the AS clause creates the subquery that filters the data from an
existing table. The subquery is a regular SELECT statement along with a WHERE clause to specify
the data to be filtered.
The closing parenthesis ends the subquery created by the AS clause.
When the database executes this query, it will first complete the subquery and assign the values that
result from that subquery to “new_table_data,” which is the temporary table. You can then run
multiple queries on this filtered data without having to filter the data every time.
Note: BigQuery uses CREATE TEMP TABLE instead of CREATE TABLE, but the general
syntax is the same.
CREA
TE TABLE table_name ( column1 datatype, column2 datatype, column3 datatype, .... )
After you have completed working with your temporary table, you can remove the table from the
database using the DROP TABLE clause. The general syntax is as follows:
Google Sheets, on the other hand, is a spreadsheet tool that is easy to use and shareable with a
familiar interface. It also allows simple and flexible analysis with tools like pivot tables, charts, and
formulas.
Connected Sheets integrates both BigQuery and Google Sheets, allowing the user to analyze
billions of rows of data in Sheets without any need for specialized knowledge, such as SQL.
Additionally, Connected Sheets is built to handle big data. Users won’t experience the same
limitations or performance issues they’ve had in the past (such as data loss) when working with large
data sets in spreadsheets.
Many teams and industries benefit from Connected Sheets such as finance, marketing, and
operations teams.
Business planning: A user can build and prepare datasets, and then find insights from the data.
For example, a data analyst can analyze sales data to determine which products sell better in
different locations.
Customer service: A user can find out which stores have the most complaints per 10,000
customers.
Sales: A user can create internal finance and sales reports. After completing, they can share
revenue reports with sales reps.
Logistics, fulfillment, and delivery: A user can run real-time inventory management and
intelligent analytics tools.
Connected Sheets benefits
Collaborate with teammates and stakeholders
Since Connects Sheets lives in Google Workspace, you can easily collaborate with other teammates
and stakeholders in your company. If you’d like to limit access, you also control permissions for who
can view, edit, or share the data.
Up to date data
With Connected Sheets, data professionals can ensure they are making decisions based on a single
source of truth by setting up automatic refreshes of BigQuery data in Sheets.