0% found this document useful (0 votes)
144 views15 pages

Activity Overview - Course 3 Module 3 Google Data ANALYTICS

Uploaded by

julianoftheeast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views15 pages

Activity Overview - Course 3 Module 3 Google Data ANALYTICS

Uploaded by

julianoftheeast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Activity overview

Recently, you’ve been thinking about identifying good data sources that would be useful for analysis.
You also spent some time in a previous activity exploring a public dataset in BigQuery and writing some
basic SQL queries. In addition to using public data in BigQuery, your future data career will involve
importing data from other sources. In this activity, you will create a custom table and dataset that you’ll
load into a new table and query.

By the time you complete this activity, you will be able to load your own data into BigQuery for analysis.
This will enable you to import your own data sources into BigQuery, which is a skill that will enable you
to more effectively analyze data from different sources.

What you will need

To get started, download the baby names data zip file. This file contains about 7 MB of data about
popular baby names from the U.S. Social Security Administration website.

Select the link to the baby names data zip file to download it.

Link to baby names data: names.zip

Create a custom table

Once you have the zip file downloaded, import it into BigQuery to query and analyze. In order to do that,
create a new dataset and a custom table.

Step 1: Unzip the file

Unzip the file you downloaded onto your computer to access it on BigQuery. Once you have unzipped
the file, find a .pdf file titled NationalReadMe that contains more information about the dataset. This
dataset tracks the popularity of baby names for each year; you can find text files labeled by the year they
contain. Open yob2014.txt to preview the data. You will notice that it’s a .csv file with three columns.
Remember where you saved this folder so you can reference it later.

Step 2: Create a dataset

Before uploading your .txt file and creating a table to query, you will need to create a dataset to upload
your data into and store your tables.

1. From the BigQuery console, go to the Explorer pane in your workspace and select the three dots to
open a menu. From here, select Create dataset.
2. This will open the Create dataset menu. This is where you will fill out information about the dataset.
Input the Dataset ID as babynames and set the Data location to Multi-region (US). Once you have
finished filling out this information, select the blue CREATE DATASET button at the bottom of the
menu.
Step 3: Create a table

Now that you have a custom dataset stored in your project space, this is where you will add your table.

1. Select the newly created babynames dataset. Check the tabs in your Dataset info window and select
the first blue + CREATE TABLE button. This will open another menu in your console.
2. In the Source section, select the Upload option in Create table from. Then select Browse to open your
files. Find and open the yob2014.txt file. Set the file format to .csv. In the Destination section, in the
Table data box, name your table as names_2014. For Schema, select Edit as text and input the following
code:

1
2
3
name:string,
gender:string,
count:integer

This will establish the data types of the three columns in the table. Leave the other parameters as they
are, and select Create table.
3. Once you have created your table titled names_2014, it will appear in your explorer pane under the
dataset babynames that you created earlier.
Select the table to open it in your workspace. Here, you can check the table schema. Then, go to the
Preview tab to explore your data. The table should have three columns: name, gender, and count.

Query your custom table

Now that your table is set up, you’re ready to start writing queries and answering questions about this
data. For example, let’s say you were interested in the top five baby names for boys in the United States
in 2014.

Select COMPOSE NEW QUERY to start a new query for this table. Then, copy and paste this code:

1 SELECT
2 name,
3 count
4 FROM
5 `your project name.babynames.names_2014`
6 WHERE
7 gender = 'M'
8 ORDER BY
9 count DESC
10 LIMIT
11 5

NOTE: Making sure that your FROM statement is correct is key to making this query run! The database
needs the query to tell it the location of the table you just uploaded so that it can fetch the data. It’s like
giving the query a map to your table. That map will include your unique BigQuery project name, the
dataset name (“babynames”), and the table name (“names_2014”). The location names for each of these
elements are separated by periods. Then, wrap the whole location in backticks. The final result will be
something like this:

'loyal-glass-371423.babynames.names_2014.loyal-glass-371423'

Note that this is just an example of a project name; it will be swapped out with your project’s actual
name in your FROM statement.

This query selects the name and count columns from the names_2014 table. Using the WHERE clause,
you are filtering for a specific gender for your results. Then, you’re sorting how you want your results to
appear with ORDER BY. Because you are ordering by the count in descending order, you will get names
and the corresponding count from largest to smallest. Finally, LIMIT tells SQL to only return the top five
most popular names and their counts.

Once you have input this in your console, select RUN to get your query results.
Set up your data

1. Log in to BigQuery Sandbox. On the BigQuery page, click the Go to BigQuery button.

If you have a free-of-charge trial version of BigQuery, you can use that instead.

Note: BigQuery Sandbox frequently updates its user interface. The latest changes may not be reflected
in the screenshots presented in this activity, but the principles remain the same. Adapting to changes in
software updates is an essential skill for data analysts, and it’s helpful for you to practice
troubleshooting. You can also reach out to your community of learners on the discussion forum for help.

2. If you haven’t done so already, create a BigQuery project. (If you have a project, select it in the
Explorer pane.)

a. In the BigQuery console, select the dropdown list to the right of the Google Cloud logo to open the
Select a project dialog box.

b. In the Select a project dialog box, select the CREATE PROJECT button.

c. Give your project a name that will help you identify it later. This can be a unique project ID or use an
auto-generated one. You do not need to select an organization.

d. Select the CREATE button to create the project.

3. The three main sections of BigQuery are now onscreen: the BigQuery navigation menu; the Explorer
pane, which you can use to search for public datasets and open projects; and the Details pane, which
shows details of the database or dataset you’ve selected in the Explorer pane and displays windows for
you to enter queries.

Notice that you can use the <| symbol in the BigQuery navigation menu section to collapse it. There is a
similar symbol to collapse the Explorer pane.
Choose a dataset

Follow these steps to find and select the NYC Trees dataset for this activity:

1. In the Explorer pane, select the + ADD button.

2. In the Add box that pops up, scroll down the Additional sources list. Select Public Datasets.

3. A new box opens where you can search public datasets that are available through Google Cloud. In the
Search Marketplace text box, search for New York City Trees.
4. Select the search result NYC Street Trees, then select the View Dataset button.

Heading is NYC Street Trees. Link to City of New


York provided. Subheading is New York City Street Tree Census data. Select View dataset button to link
to dataset.
5. Google Cloud opens a new browser tab displaying BigQuery with the bigquery-public-data collection
open in the Explorer pane. To ensure the bigquery-public-data database remains in your project’s
Explorer pane, select the star next to the dataset.
6. The BigQuery Details pane contains information about the new_york_trees dataset. This information
includes the date the dataset was created, when it was last modified, and the Dataset ID.

The Details pane displaying the


new_york_trees data description including the dataset ID, when it was created, default table expiration,
when it was last modified, the data location, description, the default collation, the default rounding
mode, case insensitive, the labels, and the tags.
Choose a table

1. In the Explorer pane, select the arrow next to the new_york_trees dataset to display the tables it
contains.

Note: If the new_york_trees dataset is not in the Explorer pane, type new_york_trees into the Search
text box in the Explorer pane. (This will work if you have pinned bigquery-public-data in the Explorer
pane.) If search doesn’t return the needed results, follow the steps above to search for the
new_york_trees dataset.

2. Notice that the new_york_trees dataset contains three tree census tables from 1995, 2005, and 2015. It
also contains a table that lists the tree species.
In the Explorer pane, bigquery-
public-data is open and the new_york_trees dataset is expanded to show the tables in the dataset are:
tree_census_1995, tree_census_2005, tree_census_2015, and tree_species.
These are all tables contained in the dataset. Now, examine the data for all trees cataloged in New York
City for three specific years.

3. Select the tree_census_2005 table. BigQuery displays the table’s structure in the Details pane.

4. In the Details pane, select Query > In new tab to open a new query window.
5. Notice that BigQuery populates the Query Window with a SELECT query. This query is incomplete
because it doesn’t contain anything in between SELECT and FROM.

Query the data

This SELECT statement in the Details pane is incomplete because the columns to display have not been
specified. So, either list the columns separated by commas or use the asterisk to have BigQuery return all
columns in the table.

1. Type an asterisk * after the SELECT command in line one of the Query Editor. Your query should
now read SELECT * FROM followed by your table location. This command tells BigQuery to return all
of the columns in the tree_census_2005 table.
2. In the Query Editor, select the Run button to run the query. The results will be displayed as a table in
the Query Results pane below the Query Editor.

Results in the preview mode with columns with data populated including the row, object ID, cen_year,
tree_dbh, tree_loc, pit_type, soil_lvl, status, spc_latin, and spc_common.
This query returns all columns for the first 1,000 rows from the table. BigQuery returns only the first 1,000
rows because the SELECT query includes a LIMIT 1000 clause. This limits the rows returned to reduce
the processing time required.

3. Next, write a query to find out the average diameter of all NYC trees in 2005. On line 1, replace the *
after the SELECT command with AVG(tree_dbh). Select the Run button to execute the query.

This returns your answer, 12.833 (which means the average diameter of NYC trees in 2005 was 12.833
inches).

Write your own queries

Now, come up with some questions and answer them with your own SQL queries. For example, query
the 1995 and the 2015 tables to find the average diameter of trees. You can then compare the average
diameter of the trees in all three datasets to determine whether the trees in NYC have grown on average.
Note that the field name for tree diameter in the tree_census_1995 table is diameter.
Terms and definitions for Course 3, Module 3
Administrative metadata: Metadata that indicates the technical source of a digital asset

CSV (comma-separated values) file: A delimited text file that uses a comma to separate values

Data governance: A process for ensuring the formal management of a company’s data assets

Descriptive metadata: Metadata that describes a piece of data and can be used to identify it at a later point
in time

Foreign key: A field within a database table that is a primary key in another table (Refer to primary key)

FROM: The section of a query that indicates where the selected data comes from

Geolocation: The geographical location of a person or device by means of digital information

Metadata: Data about data

Metadata repository: A database created to store metadata

Naming conventions: Consistent guidelines that describe the content, creation date, and version of a file in
its name

Normalized database: A database in which only related data is stored in each table

Notebook: An interactive, editable programming environment for creating data reports and showcasing data
skills

Primary key: An identifier in a database that references a column in which each value is unique (Refer to
foreign key)

Redundancy: When the same piece of data is stored in two or more places

Schema: A way of describing how something, such as data, is organized

SELECT: The section of a query that indicates the subset of a dataset

Structural metadata: Metadata that indicates how a piece of data is organized and whether it is part of one
or more than one data collection

WHERE: The section of a query that specifies criteria that the requested data must meet

World Health Organization: An organization whose primary role is to direct and coordinate international
health within the United Nations system

Go to next item

You might also like