0% found this document useful (0 votes)
17 views30 pages

Course 5

Uploaded by

haminjohn15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views30 pages

Course 5

Uploaded by

haminjohn15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Keeping data organized with sorting and

filters
You have learned about four phases of analysis:

 Organize data
 Format and adjust data
 Get input from others
 Transform data
The organization of datasets is really important for data analysts. Most of the datasets you will use
will be organized as tables. Tables are helpful because they let you manipulate your data and
categorize it. Having distinct categories and classifications lets you focus on, and differentiate
between, your data quickly and easily.

Data analysts also need to format and adjust data when performing an analysis. Sorting and
filtering are two ways you can keep things organized when you format and adjust data to work
with it. For example, a filter can help you find errors or outliers so you can fix or flag them before
your analysis. Outliers are data points that are very different from similarly collected data and
might not be reliable values. The benefit of filtering the data is that after you fix errors or identify
outliers, you can remove the filter and return the data to its original organization.

In this reading, you will learn the difference between sorting and filtering. You will also be introduced
to how a particular form of sorting is done in a pivot table.

Sorting versus filtering

Sorting is when you arrange data into a meaningful order to make it easier to understand, analyze,
and visualize. It ranks your data based on a specific metric you choose. You can sort data in
spreadsheets, SQL databases (when your dataset is too large for spreadsheets), and tables in
documents.
For example, if you need to rank things or create chronological lists, you can sort by ascending or
descending order. If you are interested in figuring out a group’s favorite movies, you might sort by
movie title to figure it out. Sorting will arrange the data in a meaningful way and give you immediate
insights. Sorting also helps you to group similar data together by a classification. For movies, you
could sort by genre -- like action, drama, sci-fi, or romance.

Filtering is used when you are only interested in seeing data that meets a specific criteria, and
hiding the rest. Filtering is really useful when you have lots of data. You can save time by zeroing in
on the data that is really important or the data that has bugs or errors. Most spreadsheets and SQL
databases allow you to filter your data in a variety of ways. Filtering gives you the ability to find what
you are looking for without too much effort.

For example, if you are only interested in finding out who watched movies in October, you could use
a filter on the dates so only the records for movies watched in October are displayed. Then, you
could check out the names of the people to figure out who watched movies in October.

To recap, the easiest way to remember the difference between sorting and filtering is that you can
use sort to quickly order the data, and filter to display only the data that meets the criteria that you
have chosen. Use filtering when you need to reduce the amount of data that is displayed.

It is important to point out that, after you filter data, you can sort the filtered data, too. If you
revisit the example of finding out who watched movies in October, after you have filtered for the
movies seen in October, you can then sort the names of the people who watched those movies in
alphabetical order.

Sorting in a pivot table


Items in the row and column areas of a pivot table are sorted in ascending order by any custom list
first. For example, if your list contains days of the week, the pivot table allows weekday and month
names to sort like this: Monday, Tuesday, Wednesday, etc. rather than alphabetically like this:
Friday, Monday, Saturday, etc.

If the items aren’t in a custom list, they will be sorted in ascending order by default. But, if you sort in
descending order, you are setting up a rule that controls how the field is sorted even after new data
fields are added.

Transforming data in SQL


Data analysts usually need to convert data from one format to another to complete an analysis. But
what if you are using SQL rather than a spreadsheet? Just like spreadsheets, SQL uses standard
rules to convert one type of data to another. If you are wondering why data transformation is an
important skill to have as a data analyst, think of it like being a driver who is able to change a flat tire.
Being able to convert data to the right format speeds you along in your analysis. You don’t have to
wait for someone else to convert the data for you.
In this reading, you will go over the conversions that can be done using the CAST function. There
are also more specialized functions like COERCION to work with big numbers, and UNIX_DATE
to work with dates. UNIX_DATE returns the number of days that have passed since January 1,
1970 and is used to compare and work with dates across multiple time zones. You will likely use
CAST most often.

Common conversions
The following table summarizes some of the more common conversions made with the CAST
function. Refer to Conversion Rules in Standard SQL for a full list of functions and associated rules.

Starting with CAST function can convert to:


Numeric
- Integer - Numeric (number) - Big number - Floating integer - String
(number)
- Boolean - Integer - Numeric (number) - Big number - Floating integer - String - Bytes - Date -
String
time - Time - Timestamp
Date - String - Date - Date time - Timestamp

The CAST function (syntax and examples)


CAST is an American National Standards Institute (ANSI) function used in lots of programming
languages, including BigQuery. This section provides the BigQuery syntax and examples of
converting the data types in the first column of the previous table. The syntax for the CAST function
is as follows:

CAST (expression AS typename)

Where expression is the data to be converted and typename is the data type to be returned.

Converting a number to a string


The following CAST statement returns a string from a numeric identified by the variable MyCount in
the table called MyTable.

SELECT
CAST (MyCount AS STRING) FROM MyTable
In the above SQL statement, the following occurs:

 SELECT indicates that you will be selecting data from a table


 CAST indicates that you will be converting the data you select to a different data type
 AS comes before and identifies the data type which you are casting to
 STRING indicates that you are converting the data to a string
 FROM indicates which table you are selecting the data from
Converting a string to a number
The following CAST statement returns an integer from a string identified by the variable
MyVarcharCol in the table called MyTable. (An integer is any whole number.)

SELECT CAST(MyVarcharCol AS INT) FROM MyTable

In the above SQL statement, the following occurs:

 SELECT indicates that you will be selecting data from a table


 CAST indicates that you will be converting the data you select to a different data type
 AS comes before and identifies the data type which you are casting to
 INT indicates that you are converting the data to an integer
 FROM indicates which table you are selecting the data from
Converting a date to a string
The following CAST statement returns a string from a date identified by the variable MyDate in the
table called MyTable.

In the above SQL statement, the following occurs:

 SELECT indicates that you will be selecting data from a table


 CAST indicates that you will be converting the data you select to a different data type
 AS comes before and identifies the data type which you are casting to
 STRING indicates that you are converting the data to a string
 FROM indicates which table you are selecting the data from
Converting a date to a datetime
Datetime values have the format of YYYY-MM-DD hh: mm: ss format, so date and time are retained
together. The following CAST statement returns a datetime value from a date.

In the above SQL statement, the following occurs:


 SELECT indicates that you will be selecting data from a table
 CAST indicates that you will be converting the data you select to a different data type
 AS comes before and identifies the data type which you are casting to
 DATETIME indicates that you are converting the data to a datetime value
 FROM indicates which table you are selecting the data from

The SAFE_CAST function


Using the CAST function in a query that fails returns an error in BigQuery. To avoid errors in the
event of a failed query, use the SAFE_CAST function instead. The SAFE_CAST function returns a
value of Null instead of an error when a query fails.

The syntax for SAFE_CAST is the same as for CAST. Simply substitute the function directly in your
queries. The following SAFE_CAST statement returns a string from a date.

SELECT SAFE_CAST (MyDate AS STRING) FROM MyTable

Manipulating strings in SQL


Knowing how to convert and manipulate your data for an accurate analysis is an important part of a
data analyst’s job. In this reading, you will learn about different SQL functions and their usage,
especially regarding string combinations.

A string is a set of characters that helps to declare the texts in programming languages such as
SQL. SQL string functions are used to obtain various information about the characters, or in this
case, manipulate them. One such function, CONCAT, is commonly used. Review the table below to
learn more about the CONCAT function and its variations.

Function Usage Example


A function that adds strings together to
CONCAT create new text strings that can be used CONCAT (‘Google’, ‘.com’);
as unique keys
CONCAT_WS (‘ . ’, ‘www’, ‘google’, ‘com’) *The se
A function that adds two or more strings
CONCAT_WS (being the period) gets input before and after Google w
together with a separator
run the SQL function
CONCAT with Adds two or more strings together using
‘Google’ + ‘.com’
+ the + operator
CONCAT at work
When adding two strings together such as ‘Data’ and ‘analysis’, it will be input like this:
 SELECT CONCAT (‘Data’, ‘analysis’);
The result will be:

 Dataanalysis
Sometimes, depending on the strings, you will need to add a space character, so your function
should actually be:

 SELECT CONCAT (‘Data’, ‘ ‘, ‘analysis’);


And the result will be:

 Data analysis
The same rule applies when combining three strings together. For example,

 SELECT CONCAT (‘Data’,’ ‘, ‘analysis’, ‘ ‘, ‘is’, ‘ ‘, ‘awesome!’);


And the result will be

 Data analysis is awesome!

VLOOKUP core concepts


Functions can be used to quickly find information and perform calculations using specific values. In
this reading, you will learn about the importance of one such function, VLOOKUP, or Vertical
Lookup, which searches for a certain value in a spreadsheet column and returns a corresponding
piece of information from the row in which the searched value is found.

When do you need to use VLOOKUP?


Two common reasons to use VLOOKUP are:

 Populating data in a spreadsheet


 Merging data from one spreadsheet with data in another
VLOOKUP syntax
A VLOOKUP function is available in both Microsoft Excel and Google Sheets. You will be introduced
to the general syntax in Google Sheets. (You can refer to the resources at the end of this reading for
more information about VLOOKUP in Microsoft Excel.)

Here is the syntax.

search_key
 The value to search for.
 For example, 42, "Cats", or I24.
range
 The range to consider for the search.
 The first column in the range is searched to locate data matching the value specified by search_key.
index
 The column index of the value to be returned, where the first column in range is numbered 1.
 If index is not between 1 and the number of columns in range, #VALUE! is returned.
is_sorted
 Indicates whether the column to be searched (the first column of the specified range) is sorted.
TRUE by default.
 It’s recommended to set is_sorted to FALSE. If set to FALSE, an exact match is returned. If there
are multiple matching values, the content of the cell corresponding to the first value found is
returned, and #N/A is returned if no such value is found.
 If is_sorted is TRUE or omitted, the nearest match (less than or equal to the search key) is returned.
If all values in the search column are greater than the search key, #N/A is returned.

What if you get #N/A?


As you have just read, #N/A indicates that a matching value can't be returned as a result of the
VLOOKUP. The error doesn’t mean that anything is actually wrong with the data, but people might
have questions if they see the error in a report. You can use the IFNA function to replace the #N/A
error with something more descriptive, like “Does not exist.”

Here is the syntax.


value
 This is a required value.
 The function checks if the cell value matches the value; such as #N/A.
value_if_na
 This is a required value.
 The function returns this value if the cell value matches the value in the first argument; it returns this
value when the cell value is #N/A.

Helpful VLOOKUP reminders


 TRUE means an approximate match, FALSE means an exact match on the search key. If the data
used for the search key is sorted, TRUE can be used.
 You want the column that matches the search key in a VLOOKUP formula to be on the left side of
the data. VLOOKUP only looks at data to the right after a match is found. In other words, the index
for VLOOKUP indicates columns to the right only. This may require you to move columns around
before you use VLOOKUP.
 After you have populated data with the VLOOKUP formula, you may copy and paste the data as
values only to remove the formulas so you can manipulate the data again.

Secret identities: The importance of


aliases
In this reading, you will learn about using aliasing to simplify your SQL queries. Aliases are used in
SQL queries to create temporary names for a column or table. Aliases make referencing tables and
columns in your SQL queries much simpler when you have table or column names that are too long
or complex to make use of in queries. Imagine a table name like
special_projects_customer_negotiation_mileages. That would be difficult to retype every time you
use that table. With an alias, you can create a meaningful nickname that you can use for your
analysis. In this case “special_projects_customer_negotiation_mileages” can be aliased to simply
“mileage.” Instead of having to write out the long table name, you can use a meaningful nickname
that you decide.

Basic syntax for aliasing


Aliasing is the process of using aliases. In SQL queries, aliases are implemented by making use
of the AS command. The basic syntax for the AS command can be seen in the following query for
aliasing a table:
Notice that AS is preceded by the table name and followed by the new nickname. It is a similar
approach to aliasing a column:

In both cases, you now have a new name that you can use to refer to the column or table that was
aliased.

Alternate syntax for aliases


If using AS results in an error when running a query because the SQL database you are working
with doesn't support it, you can leave it out. In the previous examples, the alternate syntax for
aliasing a table or column would be:

 FROM table_name alias_name


 SELECT column_name alias_name
The key takeaway is that queries can run with or without using AS for aliasing, but using AS has the
benefit of making queries more readable. It helps to make aliases stand out more clearly.

Aliasing in action
Let’s check out an example of a SQL query that uses aliasing. Let’s say that you are working with
two tables: one of them has employee data and the other one has department data. The FROM
statement to alias those tables could be:

FROM work_day.employees AS employees

These aliases still let you know exactly what is in these tables, but now you don’t have to manually
input those long table names. Aliases can be really helpful for long, complicated queries. It is easier
to read and write your queries when you have aliases that tell you what is included within your
tables.

Using JOINs effectively


In this reading, you will review how JOINs are used and will be introduced to some resources that
you can use to learn more about them. A JOIN combines tables by using a primary or foreign key to
align the information coming from both tables in the combination process. JOINs use these keys to
identify relationships and corresponding values across tables.

If you need a refresher on primary and foreign keys, refer to the glossary for this course, or go back
to Databases in data analytics.

The general JOIN syntax


As you can see from the syntax, the JOIN statement is part of the FROM clause of the query. JOIN
in SQL indicates that you are going to combine data from two tables. ON in SQL identifies how the
tables are to be matched for the correct information to be combined from both.

Type of JOINs
There are four general ways in which to conduct JOINs in SQL queries: INNER, LEFT, RIGHT, and
FULL OUTER.

The circles represent left and right tables, and where they are joined is highlighted in blue

Here is what these different JOIN queries do.


INNER JOIN
INNER is optional in this SQL query because it is the default as well as the most commonly used
JOIN operation. You may see this as JOIN only. INNER JOIN returns records if the data lives in both
tables. For example, if you use INNER JOIN for the 'customers' and 'orders' tables and match the
data using the customer_id key, you would combine the data for each customer_id that exists in both
tables. If a customer_id exists in the customers table but not the orders table, data for that
customer_id isn’t joined or returned by the query.

The results from the query might look like the following, where customer_name is from the
customers table and product_id and ship_date are from the orders table:

customer_name product_id ship_date


Martin's Ice Cream 043998 2021-02-23
Beachside Treats 872012 2021-02-25
Mona's Natural Flavors 724956 2021-02-28
... etc. ... etc. ... etc.
The data from both tables was joined together by matching the customer_id common to both tables.
Notice that customer_id doesn’t show up in the query results. It is simply used to establish the
relationship between the data in the two tables so the data can be joined and returned.

LEFT JOIN
You may see this as LEFT OUTER JOIN, but most users prefer LEFT JOIN. Both are correct syntax.
LEFT JOIN returns all the records from the left table and only the matching records from the right
table. Use LEFT JOIN whenever you need the data from the entire first table and values from the
second table, if they exist. For example, in the query below, LEFT JOIN will return customer_name
with the corresponding sales_rep, if it is available. If there is a customer who did not interact with a
sales representative, that customer would still show up in the query results but with a NULL value for
sales_rep.
The results from the query might look like the following where customer_name is from the customers
table and sales_rep is from the sales table. Again, the data from both tables was joined together by
matching the customer_id common to both tables even though customer_id wasn't returned in the
query results.

customer_name sales_rep
Martin's Ice Cream Luis Reyes
Beachside Treats NULL
Mona's Natural Flavors Geri Hall
...etc. ...etc.
RIGHT JOIN
You may see this as RIGHT OUTER JOIN or RIGHT JOIN. RIGHT JOIN returns all records from the
right table and the corresponding records from the left table. Practically speaking, RIGHT JOIN is
rarely used. Most people simply switch the tables and stick with LEFT JOIN. But using the previous
example for LEFT JOIN, the query using RIGHT JOIN would look like the following:

The query results are the same as the previous LEFT JOIN example.

customer_name sales_rep
Martin's Ice Cream Luis Reyes
Beachside Treats NULL
Mona's Natural Flavors Geri Hall
...etc. ...etc.
FULL OUTER JOIN
You may sometimes see this as FULL JOIN. FULL OUTER JOIN returns all records from the
specified tables. You can combine tables this way, but remember that this can potentially be a large
data pull as a result. FULL OUTER JOIN returns all records from both tables even if data isn’t
populated in one of the tables. For example, in the query below, you will get all customers and their
products’ shipping dates. Because you are using a FULL OUTER JOIN, you may get customers
returned without corresponding shipping dates or shipping dates without corresponding customers.
A NULL value is returned if corresponding data doesn’t exist in either table.

The results from the query might look like the following.

customer_name ship_date
Martin's Ice Cream 2021-02-23
Beachside Treats 2021-02-25
NULL 2021-02-25
The Daily Scoop NULL
Mountain Ice Cream NULL
Mona's Natural Flavors 2021-02-28
...etc. ...etc.

SQL functions and subqueries: A


functional friendship
In this reading, you will learn about SQL functions and how they are sometimes used with
subqueries. SQL functions are tools built into SQL to make it possible to perform calculations. A
subquery (also called an inner or nested query) is a query within another query.

How do SQL functions, function?


SQL functions are what help make data aggregation possible. (As a reminder, data aggregation is
the process of gathering data from multiple sources in order to combine it into a single, summarized
collection.) So, how do SQL functions work? Going back to W3Schools, let’s review some of these
functions to get a better understanding of how to run these queries:
 SQL HAVING: This is an overview of the HAVING clause, including what it is and a tutorial on how
and when it works.
 SQL CASE: Explore the usage of the CASE statement and examples of how it works.
 SQL IF: This is a tutorial of the IF function and offers examples that you can practice with.
 SQL COUNT: The COUNT function is just as important as all the rest, and this tutorial offers
multiple examples to review.

Subqueries - the cherry on top


Think of a query as a cake. A cake can have multiple layers contained within it and even layers
within those layers. Each of these layers are our subqueries, and when you put all of the layers
together, you get a cake (query). Usually, you will find subqueries nested in the SELECT, FROM,
and/or WHERE clauses. There is no general syntax for subqueries, but the syntax for a basic
subquery is as follows:

SELECT account_table.* FROM ( SELECT * FROM transaction.sf_model_feature_2014_01


WHERE day_of_week = 'Friday' ) account_table WHERE account_table.availability = 'YES'

You will find that, within the first SELECT clause is another SELECT clause. The second SELECT
clause marks the start of the subquery in this statement. There are many different ways in which you
can make use of subqueries, and resources referenced will provide additional guidance as you
learn. But first, let’s recap the subquery rules.

There are a few rules that subqueries must follow:

 Subqueries must be enclosed within parentheses


 A subquery can have only one column specified in the SELECT clause. But if you want a subquery
to compare multiple columns, those columns must be selected in the main query.
 Subqueries that return more than one row can only be used with multiple value operators, such as
the IN operator which allows you to specify multiple values in a WHERE clause.
 A subquery can’t be nested in a SET command. The SET command is used with UPDATE to specify
which columns (and values) are to be updated in a table.

Functions with multiple conditions


In this reading, you will learn more about conditional functions and how to construct functions with
multiple conditions. Recall that conditional functions and formulas perform calculations according to
specific conditions. Previously, you learned how to use functions like SUMIF and COUNTIF that
have one condition. You can use the SUMIFS and COUNTIFS functions if you have two or more
conditions. You will learn their basic syntax in Google Sheets, and check out an example.

Refer to the resources at the end of this reading for information about similar functions in Microsoft
Excel.

SUMIF to SUMIFS
The basic syntax of a SUMIF function is: =SUMIF(range, criterion, sum_range)

The first range is where the function will search for the condition that you have set. The criterion is
the condition you are applying and the sum_range is the range of cells that will be included in the
calculation.

For example, you might have a table with a list of expenses, their cost, and the date they occurred.

Column A: A1 - Expense A2 -
Fuel A3 - Food A4 - Taxi A5 - Coffee A6 - Fuel A7 - Taxi A8 - Coffee A9 - Food Column B: B1 - Price
B2 - $48.00 B3 - $12.34 B4 - $21.57 A5 - $2.50 A6 - $36.00 A7 - $15.88 A8 - $4.15 A9 - $6.75
Column C: C1 - Date C2 - 12/14/2020 C3 - 12/14/2020 C4 - 12/14/2020 C5 - 12/15/2020 C6 -
12/15/2020 C7 - 12/15/2020 C8 - 12/15/2020 C9 - 12/15/2020

You could use SUMIF to calculate the total price of fuel in this table, like this:

But, you could also build in multiple conditions by using the SUMIFS function. SUMIF and SUMIFS
are very similar, but SUMIFS can include multiple conditions.

The basic syntax is: =SUMIFS(sum_range, criteria_range1, criterion1,


[criteria_range2, criterion2, ...])
The square brackets let you know that this is optional. The ellipsis at the end of the statement lets
you know that you can have as many repetition of these parameters as needed. For example, if you
wanted to calculate the sum of the fuel costs for one date in this table, you could create a SUMIFS
statement with multiple conditions, like this:

This formula gives you the total cost of every fuel expense from the date listed in the conditions. In
this example, C1:C9 is our second criterion_range and the date 12/15/2020 is the second condition.
As long as you follow the basic syntax, you can add up to 127 conditions to a SUMIFS statement!

COUNTIF to COUNTIFS
Just like the SUMIFS function, COUNTIFS allows you to create a COUNTIF function with multiple
conditions.

The basic syntax for COUNTIF is: =COUNTIF(range, criterion)

Just like SUMIF, you set the range and then the condition that needs to be met. For example, if you
wanted to count the number of times Food came up in the Expenses column, you could use a
COUNTIF function like this:

COUNTIFS has the same basic syntax as SUMIFS: =COUNTIFS(criteria_range1,


criterion1, [criteria_range2, criterion2, ...])

The criteria_range and criterion are in the same order, and you can add more conditions to the end
of the function. So, if you wanted to find the number of times Coffee appeared in the Expenses
column on 12/15/2020, you could use COUNTIFS to apply those conditions, like this:

This formula follows the basic syntax to create conditions for “Coffee” and the specific date. Now we
can find every instance where both of these conditions are true.

Elements of a pivot table


Previously, you learned that a pivot table is a tool used to sort, reorganize, group, count, total, or
average data in spreadsheets. In this reading, you will learn more about the parts of a pivot table
and how data analysts use them to summarize data and answer questions about their data.

Pivot tables make it possible to view data in multiple ways in order to identify insights and trends.
They can help you quickly make sense of larger data sets by comparing metrics, performing
calculations, and generating reports. They’re also useful for answering specific questions about your
data.

A pivot table has four basic parts: rows, columns, values, and filters.

The rows of a pivot table organize and group data you select horizontally. For example, in the
Working with pivot tables video, the Release Date values were used to create rows that grouped the
data by year.

The columns organize and display values from your data vertically. Similar to rows, columns can
be pulled directly from the data set or created using values. Values are used to calculate and
count data. This is where you input the variables you want to measure. This is also how you create
calculated fields in your pivot table. As a refresher, a calculated field is a new field within a pivot
table that carries out certain calculations based on the values of other fields

In the previous movie data example, the Values editor created columns for the pivot table, including
the SUM of Box Office Revenue, the AVERAGE of Box Office Revenue, and the COUNT of Box
Office Revenue columns.
Finally, the filters section of a pivot table enables you to apply filters based on specific criteria —
just like filters in regular spreadsheets! For example, a filter was added to the movie data pivot table
so that it only included movies that generated less than $10 million in revenue.

Being able to use all four parts of the pivot table editor will allow you to compare different metrics
from your data and execute calculations, which will help you gain valuable insights.

Using pivot tables for analysis


Pivot tables can be a useful tool for answering specific questions about a dataset so you can quickly
share answers with stakeholders. For example, a data analyst working at a department store was
asked to determine the total sales for each department and the number of products they each sold.
They were also interested in knowing exactly which department generated the most revenue.

Instead of making changes to the original spreadsheet data, they used a pivot table to answer these
questions and easily compare the sales revenue and number of products sold by each department.
They used the department as the rows for this pivot table to group and organize the rest of the sales
data. Then, they input two Values as columns: the SUM of sales and a count of the products sold.
They also sorted the data by the SUM of sales column in order to determine which department
generated the most revenue.
Now they know that the Toys department generated the most revenue!

Pivot tables are an effective tool for data analysts working with spreadsheets because they highlight
key insights from the spreadsheet data without having to make changes to the spreadsheet. Coming
up, you will create your own pivot table to analyze data and identify trends that will be highly
valuable to stakeholders.

Types of data validation


This reading describes the purpose, examples, and limitations of six types of data validation. The
first five are validation types associated with the data (type, range, constraint, consistency, and
structure) and the sixth type focuses on the validation of application code used to accept data from
user input.

As a junior data analyst, you might not perform all of these validations. But you could ask if and how
the data was validated before you begin working with a dataset. Data validation helps to ensure the
integrity of data. It also gives you confidence that the data you are using is clean. The following list
outlines six types of data validation and the purpose of each, and includes examples and limitations.

 Purpose: Check that the data matches the data type defined for a field.
 Example: Data values for school grades 1-12 must be a numeric data type.
 Limitations: The data value 13 would pass the data type validation but would be an unacceptable
value. For this case, data range validation is also needed.

 Purpose: Check that the data falls within an acceptable range of values defined for the field.
 Example: Data values for school grades should be values between 1 and 12.
 Limitations: The data value 11.5 would be in the data range and would also pass as a numeric
data type. But, it would be unacceptable because there aren't half grades. For this case, data
constraint validation is also needed.
 Purpose: Check that the data meets certain conditions or criteria for a field. This includes the type
of data entered as well as other attributes of the field, such as number of characters.
 Example: Content constraint: Data values for school grades 1-12 must be whole numbers.
 Limitations: The data value 13 is a whole number and would pass the content constraint
validation. But, it would be unacceptable since 13 isn’t a recognized school grade. For this case,
data range validation is also needed.

 Purpose: Check that the data makes sense in the context of other related data.
 Example: Data values for product shipping dates can’t be earlier than product production dates.
 Limitations: Data might be consistent but still incorrect or inaccurate. A shipping date could be
later than a production date and still be wrong.
 Purpose: Check that the data follows or conforms to a set structure.
 Example: Web pages must follow a prescribed structure to be displayed properly.
 Limitations: A data structure might be correct with the data still incorrect or inaccurate. Content
on a web page could be displayed properly and still contain the wrong information.

 Purpose: Check that the application code systematically performs any of the previously mentioned
validations during user data input.
 Example: Common problems discovered during code validation include: more than one data type
allowed, data range checking not done, or ending of text strings not well defined.
 Limitations: Code validation might not validate all possible variations with data input.

Working with temporary tables


Temporary tables are exactly what they sound like—temporary tables in a SQL database that
aren’t stored permanently. In this reading, you will learn the methods to create temporary tables
using SQL commands. You will also learn a few best practices to follow when working with
temporary tables.

A quick refresher on what you have already learned


about temporary tables
 They are automatically deleted from the database when you end your SQL session.
 They can be used as a holding area for storing values if you are making a series of calculations. This
is sometimes referred to as pre-processing of the data.
 They can collect the results of multiple, separate queries. This is sometimes referred to as data
staging. Staging is useful if you need to perform a query on the collected data or merge the
collected data.
 They can store a filtered subset of the database. You don’t need to select and filter the data each
time you work with it. In addition, using fewer SQL commands helps to keep your data clean.
It is important to point out that each database has its own unique set of commands to create and
manage temporary tables. We have been working with BigQuery, so we will focus on the commands
that work well in that environment. The rest of this reading will go over the ways to create temporary
tables, primarily in BigQuery.

Temporary table creation in BigQuery


Temporary tables can be created using different clauses. In BigQuery, the WITH clause can be
used to create a temporary table. The general syntax for this method is as follows:
Breaking down this query a bit, notice the following:

 The statement begins with the WITH clause followed by the name of the new temporary table you
want to create
 The AS clause appears after the name of the new table. This clause instructs the database to put all
of the data identified in the next part of the statement into the new table.
 The opening parenthesis after the AS clause creates the subquery that filters the data from an
existing table. The subquery is a regular SELECT statement along with a WHERE clause to specify
the data to be filtered.
 The closing parenthesis ends the subquery created by the AS clause.
When the database executes this query, it will first complete the subquery and assign the values that
result from that subquery to “new_table_data,” which is the temporary table. You can then run
multiple queries on this filtered data without having to filter the data every time.

Temporary table creation in other databases (not


supported in BigQuery)
The following method isn’t supported in BigQuery, but most other versions of SQL databases
support it, including SQL Server and mySQL. Using SELECT and INTO, you can create a
temporary table based on conditions defined by a WHERE clause to locate the information you
need for the temporary table. The general syntax for this method is as follows:
SELECT * INTO
AfricaSales FROM GlobalSales WHERE Region = "Africa"
This SELECT statement uses the standard clauses like FROM and WHERE, but the INTO clause
tells the database to store the data that is being requested in a new temporary table named, in this
case, “AfricaSales.”

User-managed temporary table creation


So far, we have explored ways of creating temporary tables that the database is responsible for
managing. But, you can also create temporary tables that you can manage as a user. As an analyst,
you might decide to create a temporary table for your analysis that you can manage yourself. You
would use the CREATE TABLE statement to create this kind of temporary table. After you have
finished working with the table, you would then delete or drop it from the database at the end of your
session.

Note: BigQuery uses CREATE TEMP TABLE instead of CREATE TABLE, but the general
syntax is the same.
CREA
TE TABLE table_name ( column1 datatype, column2 datatype, column3 datatype, .... )
After you have completed working with your temporary table, you can remove the table from the
database using the DROP TABLE clause. The general syntax is as follows:

Best practices when working with temporary tables


 Global vs. local temporary tables: Global temporary tables are made available to all
database users and are deleted when all connections that use them have closed. Local temporary
tables are made available only to the user whose query or connection established the temporary
table. You will most likely be working with local temporary tables. If you have created a local
temporary table and are the only person using it, you can drop the temporary table after you are
done using it.
 Dropping temporary tables after use: Dropping a temporary table is a little different from
deleting a temporary table. Dropping a temporary table not only removes the information contained
in the rows of the table, but removes the table variable definitions (columns) themselves. Deleting a
temporary table removes the rows of the table but leaves the table definition and columns ready to
be used again. Although local temporary tables are dropped after you end your SQL session, it may
not happen immediately. If a lot of processing is happening in the database, dropping your
temporary tables after using them is a good practice to keep the database running smoothly.

Using Connected Sheets with BigQuery


In this reading, you will learn about Connected Sheets, a tool that allows data professionals to use
basic spreadsheet functions to analyze large datasets housed in BigQuery. With Connected Sheets
users don’t need to know SQL. Instead, anyone, not just data professionals, can generate insights
with basic spreadsheet operations such as formulas, charts, and pivot tables.

What is Connected Sheets?


Recall that BigQuery allows users to analyze petabytes (a million gigabytes) of data using complex
queries. A benefit of BigQuery is that it reduces the time needed to develop insights from large
datasets.

Google Sheets, on the other hand, is a spreadsheet tool that is easy to use and shareable with a
familiar interface. It also allows simple and flexible analysis with tools like pivot tables, charts, and
formulas.

Connected Sheets integrates both BigQuery and Google Sheets, allowing the user to analyze
billions of rows of data in Sheets without any need for specialized knowledge, such as SQL.

Additionally, Connected Sheets is built to handle big data. Users won’t experience the same
limitations or performance issues they’ve had in the past (such as data loss) when working with large
data sets in spreadsheets.

Why would a data analytics professional


use Connected Sheets?
As a data analytics professional, Connected Sheets can help with several tasks, such as:

 Collaborating with partners, analysts, or other stakeholders in a familiar spreadsheet interface;


 Ensuring a single source of truth for data analysis without additional .csv exports;
 Defining variables so that all users are working with the same data;
 Sharing insights with your team in a secure environment; and
 Streamlining your reporting and dashboard workflows.

Many teams and industries benefit from Connected Sheets such as finance, marketing, and
operations teams.

A few example use cases of Connected Sheets include:

 Business planning: A user can build and prepare datasets, and then find insights from the data.
For example, a data analyst can analyze sales data to determine which products sell better in
different locations.
 Customer service: A user can find out which stores have the most complaints per 10,000
customers.
 Sales: A user can create internal finance and sales reports. After completing, they can share
revenue reports with sales reps.
 Logistics, fulfillment, and delivery: A user can run real-time inventory management and
intelligent analytics tools.
Connected Sheets benefits
Collaborate with teammates and stakeholders
Since Connects Sheets lives in Google Workspace, you can easily collaborate with other teammates
and stakeholders in your company. If you’d like to limit access, you also control permissions for who
can view, edit, or share the data.

Do more with familiar tools


With Connected Sheets, you can access billions of rows of BigQuery data directly in Sheets. This
direct access makes it easier for all employees to track, forecast, and analyze their data to get to
better decisions faster.

Easily visualize data


You can unlock insights from your BigQuery datasets using features you’re already familiar with in
Sheets, such as pivot tables, charts, and formulas. These features help visualize large datasets
more easily than using a more advanced language such as SQL. However, if you know SQL, you
may prefer to use it in certain situations.

Up to date data
With Connected Sheets, data professionals can ensure they are making decisions based on a single
source of truth by setting up automatic refreshes of BigQuery data in Sheets.

Less data integrity and security risk


While users can access big data with Connected Sheets, they won’t be able to accidentally
manipulate or jeopardize the integrity of the data. There’s less security risk because data isn’t stored
on individual workstations, it’s stored in the cloud.

Connected Sheets shortcomings


Limited free pricing tier
A shortcoming of Connected Sheets is that for the free pricing tier, users only receive 1 terabyte (TB)
of processed query data each month. To process more data, you will need to move to a paid tier.

Data must be housed in BigQuery


Another shortcoming is that you will need access to your data set in BigQuery. Without access to
BigQuery, you won’t be able to analyze data in Connected Sheets.
Query will fail with large results
A third shortcoming is that the Connected Sheets query will fail if the results are too large. Your
query will fail if your pivot table has a significant amount of results, which could be anywhere from
30,000 to 50,000. To reduce your results, you can use filters or limit the number of rows per
breakout.

You might also like