Big Query Help
Big Query Help
Question 1
Activity overview
In previous activities, you learned about and practiced SQL. In this activity, you’ll work with SQL
queries of different sizes.
By the time you complete this activity, you’ll be familiar with the different sizes used to measure
data storage. This will help you understand how data size affects the amount of time queries take
to run and how valuable tools like SQL can be to data analysts.
Data is measured by the number of bits it takes to represent it. All information in a computer can
be represented as a binary number consisting solely of 0’s and 1’s. Each 0 or 1 in a number is a
bit. A bit is the smallest unit of storage in computers. Since computers work in binary (Base 2),
this means that all the important numbers that differentiate between different data sizes will be
powers of 2.
A byte is a collection of 8 bits. Take a moment to examine the table below to get a feel for the
difference between data measurements and their relative sizes to one another.
The size of the dataset you’re working with usually determines which tool, spreadsheets or SQL,
is best suited for the task. Spreadsheets often start to have performance issues as dataset sizes
increase beyond a few megabytes. SQL databases are much better at working with larger
datasets that have billions of rows with sizes measured in gigabytes. The dataset’s size still
matters here--larger datasets will take longer for queries to complete, depending on the query’s
content and the number of rows SQL has to process to complete the query.
You’ll now discover for yourself how these runtimes change with dataset size by running some
queries on a huge dataset—Wikipedia!
1. Log in to BigQuery Sandbox. If you have a free trial version of BigQuery, you can use that
instead. On the BigQuery page, click the Go to BigQuery button.
Note: BigQuery Sandbox frequently updates its user interface. The latest changes may
not be reflected in the screenshots presented in this activity, but the principles remain
the same. Adapting to changes in software updates is an essential skill for data
analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to
your community of learners on the discussion forum for help.
2. If you have never created a BigQuery project before, click CREATE PROJECT on the right
side of the screen. If you have created a project before, you can use an existing one or create a
new one by clicking the project dropdown in the blue header bar and selecting NEW PROJECT.
3. Name your project something that will help you identify it later. You can give it a unique project
ID or use an auto-generated one. Don’t worry about selecting an organization if you don’t know
what to put.
4. Now, you’ll see the Editor interface. In the middle of the screen is a window where you can
type code, and to the left is the Explorer menu where you can search for datasets.
5. Copy and paste the following query into the editor and run it. The formatting is just cosmetic,
so don’t worry if it changes when copied over. The query should take 10-15 seconds to run:
13
views DESC;
Note: This query sorts and filters a dataset. You don't need to understand each detail yet.
Coming up, you will learn what each part of this query means and how to use its functions in your
own work.
After the query finishes, your screen should appear like this:
This query returns a table that displays the total number of times each Wikipedia page with
“Google” in the title has been viewed in each language. Note the information that BigQuery
provides on the query you just ran. As you can infer from the dataset’s title in the query, this
dataset is a sample consisting of 10 billion rows from the Wikipedia public dataset.
You’ll find that the query processes over 415 gigabytes of data when run—pretty impressive for
15 seconds! Note that if you run the query again, the runtime will be almost instant (as long as
you haven’t changed the default caching settings). This is because BigQuery caches the query
results to avoid extra work if the query needs to be rerun.