Hands On Lab Guide For Data Lake PDF
Hands On Lab Guide For Data Lake PDF
HANDS-ON LAB
GUIDE FOR
DATA LAKE
To be used with the Snowflake free 30-day trial at:
https://trial.snowflake.com
Table of Contents
Lab Overview
Module 1: Setup
Module 2: External Table
Module 3: Materialized View on External Table
Module 4: Data Lake Export
Lab Overview
This hands on lab is intended to help you get up to speed on the features available in
Snowflake to augment your cloud data lake. This lab does not require that you have an
existing data lake. All data will be provided in a publicly available cloud storage location.
The datasets we’ll be using contain trip data from the Citibike transportation company in
New York.
Target Audience
Data engineers, Database and Cloud architects, and Data Warehouse administrators
Prerequisites
● Access to a free 30-day Snowflake trial environment
● Basic knowledge of SQL, and database concepts and objects
● Familiarity with CSV comma-delimited files and JSON semi-structured data
Page 1
Module 1: Setup
1.1 Steps to Prepare Your Lab Environment
If not yet done, register for a Snowflake free 30-day trial at https://trial.snowflake.com
● The Snowflake edition (Standard, Enterprise, e.g.), cloud provider (AWS, Azure,
e.g.), and Region (US East, EU, e.g.) do *not* matter for this lab. But we suggest you
select the region which is physically closest to you. And select the Enterprise edition
so you can leverage some advanced capabilities that are not available in the
Standard edition.
● After registering, you will receive an email with an activation link and your Snowflake
account URL. Bookmark this URL for easy, future access. After activation, you will
create a user name and password. Write down these credentials.
Resize your browser windows so you can view this lab guide PDF and your web
browser side-by-side to more easily follow the lab instructions. If possible, even better is
to use a secondary display dedicated to the lab guide.
Click on this link and download the “lab_scripts_DataLake.sql” file to your local
machine. This file contains pre-written SQL commands and we will use this file later in
the lab. Note: attempting to copy and paste the SQL scripts from this lab guide can
result in syntax errors. Please open the SQL file in a text editor, and copy the
commands from there.
Start by navigating to the Worksheets tab, and three new empty worksheets. Name
them “Module 1” through “Module 4”, as shown below.
The SQL command file is broken up into 4 separate modules. Copy each section of
code from the SQL file into its corresponding worksheet tab. When all four worksheet
tabs are filled, you can close the text editor.
Page 2
Navigate back to the Module 1 worksheet, and we’ll begin executing each statement. To
execute a single statement, just position the cursor anywhere within the statement and
click the Run button. To execute several statements, they must be highlighted through
the final semi-colon prior to clicking the Run button.
Create the empty CITIBIKE database that will be used by this lab.
Page 3
Create an external function call that is bound to a REST-based API
Page 4
$2 url,
citibike.utils.fetch_http_data( url ) payload,
current_timestamp() row_inserted
from
(values
('stations',
'https://gbfs.citibikenyc.com/gbfs/en/station_information.json')
,
('regions',
'https://gbfs.citibikenyc.com/gbfs/en/system_regions.json'));
-- Now refine that raw JSON data by extracting out the STATIONS
nodes
create or replace table STATION_JSON as
with s as (
select payload, row_inserted
from gbfs_json
where data = 'stations'
and row_inserted = (select max(row_inserted) from
gbfs_json)
)
select
value station_v,
payload:response.last_updated::timestamp last_updated,
row_inserted
from s,
lateral flatten (input => payload:response.data.stations) ;
Page 5
create or replace view STATIONS_VW as
with s as (
select * from station_json
where row_inserted = (select max(row_inserted) from
station_json)
),
r as (
select * from region_json
where row_inserted = (select max(row_inserted) from
region_json))
select
station_v:station_id::number station_id,
station_v:name::string station_name,
station_v:lat::float station_lat,
station_v:lon::float station_lon,
station_v:station_type::string station_type,
station_v:capacity::number station_capacity,
station_v:rental_methods rental_methods,
region_v:name::string region_name
from s
left outer join r
on station_v:region_id::integer =
region_v:region_id::integer ;
Snowflake has provided the Citibike TRIPS data in an AWS S3 bucket. The data files are in
Parquet format and are partitioned into folders by year. The bucket URL is:
s3://snowflake-corp-citibike-demo-master/V2/external/
Select the worksheet labeled “Module 2”. The first step is to set our context to the
Citibike database and the Demo schema.
Page 6
create or replace stage CITIBIKE_STAGE
url = 's3://snowflake-corp-citibike-demo-master/V2/external/'
credentials = (aws_key_id = 'AKIAXSJGLDPF3ZICVOPB'
aws_secret_key = 'pfUJgyX4ukVmfMQr0qXLZ8Vmh5Q2yVnXXkr+CQQZ')
file_format=(type=parquet);
Click on row 1 in the results pane. This opens a dialog box showing the contents of a
single row from the Parquet file.
Page 7
Let’s create a basic external table on the Parquet files in the bucket.
In this basic external table definition, there are only two columns available - a variant
named VALUE that contains the file data, and a pseudocolumn called
metadata$filename.
A single column named value is not going to be very user-friendly. It would be much
more useful to break out the individual fields into separate columns. And performance
against this external table will be less than optimal, because the query engine can’t
effectively prune out unneeded files from the execution plan. We can introduce a
partitioning scheme that makes the query engine much more efficient.
Let’s create a new external table on the same set of Parquet files, but this time we’ll
define each column separately, and partition the files on the date portion of their folder
name.
Page 8
-- 2.1.5 Create a new external table on the same set of Parquet
-- files, but this time we’ll define each column
-- separately, and partition the files on the date
-- portion of their folder names.
Notice that every column definition consists of three parts: the column name, its
datatype, and the transformation clause following the “as” keyword. The most basic
transformation clause is just a reference to an element in the file as
value:”elementName”, followed by an explicit datatype casting as ::datatype.
Let’s see what the data looks like in the new external table.
-- 2.1.6 Let’s see what the data looks like in the new
-- external table.
select * from trips_big_xt limit 100;
Page 9
Now it’s time to start querying. This first query will be relatively slow, as we are not
filtering on the partition column.
select
start_station_id,
count(*) num_trips,
avg(tripduration) avg_duration
from trips_big_xt
group by 1
order by 2 desc;
select
start_station_id,
count(*) num_trips,
avg(tripduration) avg_duration
from trips_big_xt
where startdate between to_date('2014-01-01') and
to_date('2014-06-30')
group by 1;
Open the History panel (next to the Results pane) to compare the query profile for the
two queries we just executed.
Page 10
Query 2.1.7 – no partition filter Query 2.1.8 – partition on startdate
In query 2.1.7, the query engine was forced to read all 8,913 files and “spilled” nearly
3.15GB to local storage. The second query was able to effectively prune out over 90%
of the file reads, and only spilled 148MB into local storage. Consequently, the second
query executed in about 90% faster.
An external table acts just like a normal table, so we can join with other tables.
Page 11
-- 2.1.9 External tables act just like regular tables,
-- so we can join them with other tables
with t as (
select
start_station_id,
end_station_id,
count(*) num_trips
from trips_big_xt
where startdate between to_date('2014-01-01') and
to_date('2014-12-30')
group by 1, 2)
select
ss.station_name start_station,
es.station_name end_station,
num_trips
from t inner join stations_vw ss on t.start_station_id =
ss.station_id
inner join stations_vw es on t.end_station_id =
es.station_id
order by 3 desc;
In this section, we’ll explore two important options for External Tables. There will not be any
additional scripts to execute, since demonstrating them in action would require access to an
AWS Console.
Specifies whether to automatically refresh the external table metadata once, immediately
after the external table is created. The metadata for an external table is the list of files
that exist in the specified storage location. Setting this option to FALSE essentially
creates an “empty” external table definition. To refresh the metadata requires execution
of the command ALTER EXTERNAL TABLE refresh;
Default: TRUE
Page 12
Module 3: Materialized Views over External Tables
External tables offer the ability to have an SQL interface on top of a file-based data lake,
without having to maintain an additional copy of the data in the Snowflake storage layer.
Automatic refresh can keep the external metadata in sync with the contents of the storage
location, eliminating many complex data engineering workflows. Defining the external table
with an effective partitioning scheme can greatly improve query performance against external
tables.
Materialized Views are pre-computed data sets derived from a query specification (the
SELECT in the view definition) and stored for later use. Because the data is pre-computed,
querying a materialized view is faster than executing the original query.
Combining these two techniques, i.e., creating a Materialized View over an External Table,
provides the optimum performance of SQL, with the benefits of maintaining the original data
source in an external stage.
select
count(*) num_rows,
sum(num_trips) num_trips
from trips_mv;
3.1.2 Let’s re-run our join query, replacing the external table with the new materialized view.
Page 13
-- 3.1.2 Let’s re-run our join query, replacing the
-- external table with the new materialized view.
with t as (
select
start_station_id,
end_station_id,
sum(num_trips) num_trips
from trips_mv
where startdate between to_date('2014-01-01') and
to_date('2014-12-30')
group by 1, 2)
select
ss.station_name start_station,
es.station_name end_station,
num_trips
from t
inner join stations_vw ss
on t.start_station_id = ss.station_id
inner join stations_vw es
on t.end_station_id = es.station_id
order by 3 desc;
3.1.3 Open the History panel and review the query profile. Look how much faster this was
than the same query against the External Table.
Page 14
Query 2.1.9 – external table Query 3.1.2 – materialized view
4.1.1 Let’s create an internal stage to unload the data. This is used as a simplification for
demonstration in the event you do not have access to your own S3 bucket.
Page 15
Optional: Here are the commands to create an external stage on S3 for those that have
access to an AWS account and can create an S3 bucket. This section creates a Snowflake
Storage Integration object, which encapsulates the AWS IAM credentials to access the storage
location. Snowflake documentation on the creation and use of Storage Integrations can be
found here: https://docs.snowflake.com/en/user-guide/data-load-s3-config.html
Page 16
show stages;
4.1.2 Unload the data to the internal stage, partitioning it by date, and placing a cap on the file
size.
Page 17