0% found this document useful (0 votes)
77 views4 pages

Big Query Help

This document provides an overview of an activity where the learner will query large datasets using SQL to understand how data size affects query runtime. It begins by defining units of digital data storage from bits to zettabytes. The learner is then instructed to log into BigQuery and run a sample SQL query on a 10 billion row sample of Wikipedia page view data. The query takes 10-15 seconds to run and processes over 415 gigabytes of data, demonstrating how SQL can efficiently query very large datasets.

Uploaded by

Shiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views4 pages

Big Query Help

This document provides an overview of an activity where the learner will query large datasets using SQL to understand how data size affects query runtime. It begins by defining units of digital data storage from bits to zettabytes. The learner is then instructed to log into BigQuery and run a sample SQL query on a 10 billion row sample of Wikipedia page view data. The query takes 10-15 seconds to run and processes over 415 gigabytes of data, demonstrating how SQL can efficiently query very large datasets.

Uploaded by

Shiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

Question 1

Activity overview

In previous activities, you learned about and practiced SQL. In this activity, you’ll work with SQL
queries of different sizes.

By the time you complete this activity, you’ll be familiar with the different sizes used to measure
data storage. This will help you understand how data size affects the amount of time queries take
to run and how valuable tools like SQL can be to data analysts. 

Understand how data is measured

Data is measured by the number of bits it takes to represent it. All information in a computer can
be represented as a binary number consisting solely of 0’s and 1’s. Each 0 or 1 in a number is a
bit. A bit is the smallest unit of storage in computers. Since computers work in binary (Base 2),
this means that all the important numbers that differentiate between different data sizes will be
powers of 2.  

A byte is a collection of 8 bits. Take a moment to examine the table below to get a feel for the
difference between data measurements and their relative sizes to one another. 

Unit Equivalent to Abbreviation Real-Worl

Byte 8 bits B 1 characte

Kilobyte 1024 bytes KB A page of

Megabyte 1024 Kilobytes MB 1 song in M

Gigabyte 1024 Megabytes GB ~300 song

Terabyte 1024 Gigabytes TB ~500 hour

Petabyte 1024 Terabytes PB 10 billion F

Exabyte 1024 Petabytes EB ~500 millio

Zettabyte 1024 Exabytes ZB All the data


The amount of data in the world is exploding and growing at an incredible pace every year. This
growth is largely the result of the over 4.6 billion people around the world connected to the
Internet. Now that smartphones and other Internet-connected devices have become common,
they generate a staggering amount of new data. Many experts believe that the size of all the data
on the Internet will swell to 175 ZB by the end of 2025!

The size of the dataset you’re working with usually determines which tool, spreadsheets or SQL,
is best suited for the task. Spreadsheets often start to have performance issues as dataset sizes
increase beyond a few megabytes. SQL databases are much better at working with larger
datasets that have billions of rows with sizes measured in gigabytes. The dataset’s size still
matters here--larger datasets will take longer for queries to complete, depending on the query’s
content and the number of rows SQL has to process to complete the query.

Query a large dataset

You’ll now discover for yourself how these runtimes change with dataset size by running some
queries on a huge dataset—Wikipedia!

1. Log in to BigQuery Sandbox. If you have a free trial version of BigQuery, you can use that
instead. On the BigQuery page, click the Go to BigQuery button.

 Note: BigQuery Sandbox frequently updates its user interface. The latest changes may
not be reflected in the screenshots presented in this activity, but the principles remain
the same. Adapting to changes in software updates is an essential skill for data
analysts, and it’s helpful for you to practice troubleshooting. You can also reach out to
your community of learners on the discussion forum for help.

2. If you have never created a BigQuery project before, click CREATE PROJECT on the right
side of the screen. If you have created a project before, you can use an existing one or create a
new one by clicking the project dropdown in the blue header bar and selecting NEW PROJECT.

3. Name your project something that will help you identify it later. You can give it a unique project
ID or use an auto-generated one. Don’t worry about selecting an organization if you don’t know
what to put.

4. Now, you’ll see the Editor interface. In the middle of the screen is a window where you can
type code, and to the left is the Explorer menu where you can search for datasets.

5. Copy and paste the following query into the editor and run it. The formatting is just cosmetic,
so don’t worry if it changes when copied over. The query should take 10-15 seconds to run:

13
  views DESC;

Note: This query sorts and filters a dataset. You don't need to understand each detail yet.
Coming up, you will learn what each part of this query means and how to use its functions in your
own work.

After the query finishes, your screen should appear like this:
This query returns a table that displays the total number of times each Wikipedia page with
“Google” in the title has been viewed in each language. Note the information that BigQuery
provides on the query you just ran. As you can infer from the dataset’s title in the query, this
dataset is a sample consisting of 10 billion rows from the Wikipedia public dataset. 

You’ll find that the query processes over 415 gigabytes of data when run—pretty impressive for
15 seconds! Note that if you run the query again, the runtime will be almost instant (as long as
you haven’t changed the default caching settings). This is because BigQuery caches the query
results to avoid extra work if the query needs to be rerun.

You might also like