0% found this document useful (0 votes)
469 views19 pages

Hands On Lab Guide For Data Lake PDF

This hands-on lab guide walks users through using Snowflake to query data stored in a cloud data lake. It demonstrates external tables to query partitioned data in the data lake, materialized views to improve query performance, and unloading curated data back to the data lake. The lab utilizes publicly available New York Citi Bike trip and station data stored in AWS S3 buckets to showcase these Snowflake capabilities.

Uploaded by

PrasadVallura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
469 views19 pages

Hands On Lab Guide For Data Lake PDF

This hands-on lab guide walks users through using Snowflake to query data stored in a cloud data lake. It demonstrates external tables to query partitioned data in the data lake, materialized views to improve query performance, and unloading curated data back to the data lake. The lab utilizes publicly available New York Citi Bike trip and station data stored in AWS S3 buckets to showcase these Snowflake capabilities.

Uploaded by

PrasadVallura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

 

HANDS-ON LAB
GUIDE FOR
DATA LAKE
To be used with the Snowflake free 30-day trial at:
https://trial.snowflake.com

Works for any Snowflake edition or cloud provider


Approximate duration: 90 minutes. Approximately 7 credits used.
 

Table of Contents

Lab Overview
Module 1: ​Setup
Module 2: ​External Table
Module 3: ​Materialized View on External Table
Module 4: Data Lake Export
Lab Overview
This hands on lab is intended to help you get up to speed on the features available in
Snowflake to augment your cloud data lake. This lab does not require that you have an
existing data lake. All data will be provided in a publicly available cloud storage location.
The datasets we’ll be using contain trip data from the ​Citibike​ transportation company in
New York.

Target Audience
Data engineers, Database and Cloud architects, and Data Warehouse administrators

What you'll learn


Features demonstrated in this lab include

● Query partitioned data in your data lake with ​External Tables​;


● Improve performance of queries with ​Materialized Views​ on External Tables;
● Unload curated data back to your data lake using ​Copy into​ ​Location​;

Prerequisites
● Access to a free 30-day Snowflake trial environment
● Basic knowledge of SQL, and database concepts and objects
● Familiarity with CSV comma-delimited files and JSON semi-structured data

Page 1
Module 1: Setup 
1.1 Steps to Prepare Your Lab Environment

If not yet done, register for a Snowflake free 30-day trial at ​https://trial.snowflake.com
● The Snowflake edition (Standard, Enterprise, e.g.), cloud provider (AWS, Azure,
e.g.), and Region (US East, EU, e.g.) do *not* matter for this lab. But we suggest you
select the region which is physically closest to you. And select the ​Enterprise ​edition
so you can leverage some advanced capabilities that are not available in the
Standard edition.

● After registering, you will receive an email with an activation link and your Snowflake
account URL. Bookmark this URL for easy, future access. After activation, you will
create a user name and password. Write down these credentials.

Resize your browser windows so you can view this lab guide PDF and your web
browser side-by-side to more easily follow the lab instructions. If possible, even better is
to use a secondary display dedicated to the lab guide.

Click on​ this link​ and download the “lab_scripts_DataLake.sql” file to your local
machine. This file contains pre-written SQL commands and we will use this file later in
the lab. Note: attempting to copy and paste the SQL scripts from this lab guide can
result in syntax errors. Please open the SQL file in a text editor, and copy the
commands from there.

1.2 Setup the Environment

Start by navigating to the Worksheets tab, and three new empty worksheets. Name
them “Module 1” through “Module 4”, as shown below.
The SQL command file is broken up into 4 separate modules. Copy each section of
code from the SQL file into its corresponding worksheet tab. When all four worksheet
tabs are filled, you can close the text editor.

Page 2
Navigate back to the Module 1 worksheet, and we’ll begin executing each statement. To
execute a single statement, just position the cursor anywhere within the statement and
click the Run button. To execute several statements, they must be highlighted through
the final semi-colon prior to clicking the Run button.

Create a warehouse for use during this lab.

-- 1.2.1 Create a virtual warehouse cluster


use role SYSADMIN;
create or replace warehouse LOAD_WH with
warehouse_size = 'xlarge'
auto_suspend = 300
initially_suspended = true;

use warehouse LOAD_WH;

Create the empty CITIBIKE database that will be used by this lab.

-- 1.2.2 Create the new empty CITIBIKE database


create or replace database CITIBIKE;
create or replace schema DEMO;
create or replace schema UTILS;

Page 3
Create an external function call that is bound to a REST-based API

-- 1.2.3 Create an external function call to a REST API


use schema UTILS;
use role accountadmin;

-- Create an API Integration object


create or replace api integration fetch_http_data
api_provider = aws_api_gateway
api_aws_role_arn =
'arn:aws:iam::148887191972:role/ExecuteLambdaFunction'
enabled = true
api_allowed_prefixes =
('https://dr14z5kz5d.execute-api.us-east-1.amazonaws.com/prod/fe
tchhttpdata');

grant usage on integration fetch_http_data to role sysadmin;

-- Now create the external function that uses the API


integration object

use role sysadmin;

-- create the function


create or replace external function fetch_http_data(v varchar)
returns variant
api_integration = fetch_http_data
as
'https://dr14z5kz5d.execute-api.us-east-1.amazonaws.com/prod/fet
chhttpdata';

Create a few reference tables and populate them with data.

-- 1.2.4 Create a few reference tables and populate them with


data.

use schema DEMO;


create or replace table GBFS_JSON (
data varchar,
url varchar,
payload variant,
row_inserted timestamp_ltz);

-- Populate it with raw JSON data through the External Function


call
insert into GBFS_JSON
select
$1 data,

Page 4
$2 url,
citibike.utils.fetch_http_data( url ) payload,
current_timestamp() row_inserted
from
(values
('stations',
'https://gbfs.citibikenyc.com/gbfs/en/station_information.json')
,
('regions',
'https://gbfs.citibikenyc.com/gbfs/en/system_regions.json'));

-- Now refine that raw JSON data by extracting out the STATIONS
nodes
create or replace table STATION_JSON as
with s as (
select payload, row_inserted
from gbfs_json
where data = 'stations'
and row_inserted = (select max(row_inserted) from
gbfs_json)
)
select
value station_v,
payload:response.last_updated::timestamp last_updated,
row_inserted
from s,
lateral flatten (input => payload:response.data.stations) ;

-- extract the individual region records


create or replace table REGION_JSON as
with r as (
select payload, row_inserted
from gbfs_json
where data = 'regions'
and row_inserted = (select max(row_inserted) from
gbfs_json)
)
select
value region_v,
payload:response.last_updated::timestamp last_updated,
row_inserted
from r,
lateral flatten (input => payload:response.data.regions);

-- Lastly, create a view that "flattens" the JSON into a


standard table structure

Page 5
create or replace view STATIONS_VW as
with s as (
select * from station_json
where row_inserted = (select max(row_inserted) from
station_json)
),
r as (
select * from region_json
where row_inserted = (select max(row_inserted) from
region_json))
select
station_v:station_id::number station_id,
station_v:name::string station_name,
station_v:lat::float station_lat,
station_v:lon::float station_lon,
station_v:station_type::string station_type,
station_v:capacity::number station_capacity,
station_v:rental_methods rental_methods,
region_v:name::string region_name
from s
left outer join r
on station_v:region_id::integer =
region_v:region_id::integer ;

Module 2: External Tables

Snowflake has provided the Citibike TRIPS data in an AWS S3 bucket. The data files are in
Parquet format and are partitioned into folders by year. The bucket URL is:
s3://snowflake-corp-citibike-demo-master/V2/external/

2.1 Create an External Table linked to an S3 bucket

Select the worksheet labeled “​Module 2​”. The first step is to set our context to the
Citibike database and the Demo schema.

-- 2.1.1 Set context


use schema citibike.demo;
use role SYSADMIN;

Create an external stage that points to the public S3 bucket.


-- 2.1.2 Create an EXTERNAL STAGE that points to the
-- public S3 bucket

Page 6
create or replace stage CITIBIKE_STAGE
url = 's3://snowflake-corp-citibike-demo-master/V2/external/'
credentials = (aws_key_id = 'AKIAXSJGLDPF3ZICVOPB'
aws_secret_key = 'pfUJgyX4ukVmfMQr0qXLZ8Vmh5Q2yVnXXkr+CQQZ')
file_format=(type=parquet);

Let’s see what data is available:

-- 2.1.3 Let’s see what data is available:


-- Show the list of files in the external stage
list @citibike_stage/trips/2019;

-- Let’s take a peek inside the files themselves


select $1 from @citibike_stage/trips/2019 limit 100;

Click on row 1 in the results pane. This opens a dialog box showing the contents of a
single row from the Parquet file.

Page 7
Let’s create a basic external table on the Parquet files in the bucket.

-- 2.1.4 Create a basic External Table on the Parquet files


-- in the bucket

create or replace external table TRIPS_BASIC_XT


location = @citibike_stage/trips
auto_refresh = false
file_format=(type=parquet) ;

In this basic external table definition, there are only two columns available - a variant
named VALUE that contains the file data, and a pseudocolumn called
metadata$filename.

select metadata$filename, value


from TRIPS_BASIC_XT
LIMIT 100;

A single column named ​value​ is not going to be very user-friendly. It would be much
more useful to break out the individual fields into separate columns. And performance
against this external table will be less than optimal, because the query engine can’t
effectively prune out unneeded files from the execution plan. We can introduce a
partitioning scheme that makes the query engine much more efficient.

Let’s create a new external table on the same set of Parquet files, but this time we’ll
define each column separately, and partition the files on the date portion of their folder
name.

Page 8
-- 2.1.5 Create a new external table on the same set of Parquet
-- files, but this time we’ll define each column
-- separately, and partition the files on the date
-- portion of their folder names.

create or replace external table TRIPS_BIG_XT (


tripduration integer as (value:"TRIPDURATION"::integer),
startdate date as
to_date(split_part(metadata$filename, '/', 4) || '-' ||
split_part(metadata$filename, '/', 5) || '-01'),
starttime timestamp as (value:"STARTTIME"::timestamp),
stoptime timestamp as (value:"STOPTIME"::timestamp),
start_station_id integer as
(case when value:"START_STATION_ID"='NULL' then null
else value:"START_STATION_ID"::integer end),
end_station_id integer as
(case when value:"END_STATION_ID"='NULL' then null
else value:"END_STATION_ID"::integer end),
bikeid integer as (value:"BIKEID"::integer),
usertype string as (value:"USERTYPE"::string),
birth_year integer as
(case when value:"BIRTH_YEAR"='NULL' then null
else value:"BIRTH_YEAR"::integer end),
gender integer as (value:"GENDER"::integer),
program_id integer as (value:"PROGRAM_ID"::integer)
)
partition by (startdate)
location = @citibike_stage/trips
auto_refresh = false
file_format=(type=parquet)

Notice that every column definition consists of three parts: the column name, its
datatype, and the transformation clause following the “as” keyword. The most basic
transformation clause is just a reference to an element in the file as
value:”elementName”​, followed by an explicit datatype casting as ​::datatype​.

Let’s see what the data looks like in the new external table.

-- 2.1.6 Let’s see what the data looks like in the new
-- external table.
select * from trips_big_xt limit 100;

Page 9
Now it’s time to start querying. This first query will be relatively slow, as we are not
filtering on the partition column.

-- 2.1.7 Now it's time to start running some real queries.

select
start_station_id,
count(*) num_trips,
avg(tripduration) avg_duration
from trips_big_xt
group by 1
order by 2 desc;

Let’s add the partition column into the filter

-- 2.1.8 Add the partition column into the query

select
start_station_id,
count(*) num_trips,
avg(tripduration) avg_duration
from trips_big_xt
where startdate between to_date('2014-01-01') and
to_date('2014-06-30')
group by 1;

Open the History panel (next to the Results pane) to compare the query profile for the
two queries we just executed.

Page 10
Query 2.1.7 – no partition filter Query 2.1.8 – partition on startdate

In query 2.1.7, the query engine was forced to read all 8,913 files and “spilled” nearly
3.15GB to local storage. The second query was able to effectively prune out over 90%
of the file reads, and only spilled 148MB into local storage. Consequently, the second
query executed in about 90% faster.

An external table acts just like a normal table, so we can join with other tables.

Page 11
-- 2.1.9 External tables act just like regular tables,
-- so we can join them with other tables

with t as (
select
start_station_id,
end_station_id,
count(*) num_trips
from trips_big_xt
where startdate between to_date('2014-01-01') and
to_date('2014-12-30')
group by 1, 2)

select
ss.station_name start_station,
es.station_name end_station,
num_trips
from t inner join stations_vw ss on t.start_station_id =
ss.station_id
inner join stations_vw es on t.end_station_id =
es.station_id
order by 3 desc;

2.2 Additional Options for External Tables

In this section, we’ll explore two important options for External Tables. There will not be any
additional scripts to execute, since demonstrating them in action would require access to an
AWS Console.

REFRESH_ON_CREATE = < ​TRUE | FALSE >

Specifies whether to automatically refresh the external table metadata once, immediately
after the external table is created. The metadata for an external table is the list of files
that exist in the specified storage location. Setting this option to FALSE essentially
creates an “empty” external table definition. To refresh the metadata requires execution
of the command ​ALTER EXTERNAL TABLE refresh;
Default: TRUE

AUTO_REFRESH = < ​TRUE | FALSE >

Specifies whether Snowflake should enable triggering automatic refreshes of the


external table metadata when ​new or updated​ data files are available in the named
external stage specified. Setting this option to TRUE will keep external tables in sync
with the contents of the related storage location.
Default: TRUE

Page 12
Module 3: Materialized Views over External Tables
External tables offer the ability to have an SQL interface on top of a file-based data lake,
without having to maintain an additional copy of the data in the Snowflake storage layer.
Automatic refresh can keep the external metadata in sync with the contents of the storage
location, eliminating many complex data engineering workflows. Defining the external table
with an effective partitioning scheme can greatly improve query performance against external
tables.

Materialized Views are pre-computed data sets derived from a query specification (the
SELECT in the view definition) and stored for later use. Because the data is pre-computed,
querying a materialized view is faster than executing the original query.

Combining these two techniques, i.e., creating a Materialized View over an External Table,
provides the optimum performance of SQL, with the benefits of maintaining the original data
source in an external stage.

3.1 Create a Materialized View on the External Table

3.1.1 Create the Materialized View.

-- 3.1.1 Create the materialized view


use role SYSADMIN;
use schema CITIBIKE.DEMO;

create or replace materialized view TRIPS_MV as


select
startdate,
start_station_id,
end_station_id,
count(*) num_trips
from trips_big_xt
group by 1, 2, 3;

Let’s see how many rows we have in the materialized view.

select
count(*) num_rows,
sum(num_trips) num_trips
from trips_mv;

3.1.2 Let’s re-run our join query, replacing the external table with the new materialized view.

Page 13
-- 3.1.2 Let’s re-run our join query, replacing the
-- external table with the new materialized view.

with t as (
select
start_station_id,
end_station_id,
sum(num_trips) num_trips
from trips_mv
where startdate between to_date('2014-01-01') and
to_date('2014-12-30')
group by 1, 2)

select
ss.station_name start_station,
es.station_name end_station,
num_trips
from t
inner join stations_vw ss
on t.start_station_id = ss.station_id
inner join stations_vw es
on t.end_station_id = es.station_id
order by 3 desc;

3.1.3 Open the History panel and review the query profile. Look how much faster this was
than the same query against the External Table.

Page 14
Query 2.1.9 – external table Query 3.1.2 – materialized view

Module 4: Data Lake Export 


Now that we have a query that joins our materialized view of the Trip data with the Station and
Region tables, we would like to unload that enhanced dataset back to the data lake​. This
exercise will unload the data from the materialized view into Parquet files, partitioned by date,
into a Snowflake internal stage.
Navigate to the worksheet labeled “Module 4”.

4.1.1 Let’s create an internal stage to unload the data. This is used as a simplification for
demonstration in the event you do not have access to your own S3 bucket.

-- 4.1.1 Let’s create an internal stage to unload the data.


use role SYSADMIN;
use schema CITIBIKE.DEMO;

create or replace stage CITIBIKE_UNLOAD


file_format=(type=parquet);

Page 15
Optional​: Here are the commands to create an external stage on S3 for those that have
access to an AWS account and can create an S3 bucket. This section creates a Snowflake
Storage Integration object, which encapsulates the AWS IAM credentials to access the storage
location. Snowflake documentation on the creation and use of Storage Integrations can be
found here: ​https://docs.snowflake.com/en/user-guide/data-load-s3-config.html

-- The first two steps occur in the AWS Console.


​ reate an IAM policy​ that governs access to the S3 bucket.
-- 1. C

-- 2. ​Create an IAM role​ and assign the policy from step 1.

-- 3. This step uses the Snowflake worksheets tab to create a new


-- IAM user in the AWS account.

use role accountadmin;

CREATE OR REPLACE STORAGE INTEGRATION citibike_int


TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = ' ' -- Paste the ARN of the IAM Role
you created here
STORAGE_ALLOWED_LOCATIONS = (' '); -- This is the URL of
the bucket/folder(s)

-- 4. Note the ​STORAGE_AWS_ROLE_ARN and STORAGE_AWS_EXTERNAL_ID


-- in the result from this next statement.
describe integration citibike_int;

-- 5. Back in the AWS Console, ​edit the Trust Relationship​ of


-- the policy to add the Snowflake IAM user.

-- 6. Create an External Stage that points to the S3 bucket,


-- and uses the credentials in the Storage Integration

grant usage on integration citibike_int to role sysadmin;

use role sysadmin;


use database citibike;
use schema demo;

-- create the external stage


create or replace stage citibike_unload
storage_integration = citibike_int
url = ' ' -- The URL of the S3 bucket.
file_format=(type=parquet);

Page 16
show stages;

4.1.2 Unload the data to the internal stage, partitioning it by date, and placing a cap on the file
size. 

-- 4.1.2 Execute the partitioned unload statement into


-- the internal stage

copy into @citibike_unload


from trips_mv
partition by (startdate::string)
file_format=(type=parquet)
max_file_size = 32000000
header=true;

-- Look at the contents of the stage.


list @citibike_unload;

Congratulations, you are now done with this lab!

Page 17

You might also like