0% found this document useful (0 votes)

469 views19 pages

Hands On Lab Guide For Data Lake PDF

This hands-on lab guide walks users through using Snowflake to query data stored in a cloud data lake. It demonstrates external tables to query partitioned data in the data lake, materialized views to improve query performance, and unloading curated data back to the data lake. The lab utilizes publicly available New York Citi Bike trip and station data stored in AWS S3 buckets to showcase these Snowflake capabilities.

Uploaded by

PrasadVallura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

469 views19 pages

Hands On Lab Guide For Data Lake PDF

Uploaded by

PrasadVallura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

HANDS-ON LAB
GUIDE FOR
DATA LAKE
To be used with the Snowflake free 30-day trial at:
https://trial.snowflake.com

Works for any Snowflake edition or cloud provider

Approximate duration: 90 minutes. Approximately 7 credits used.

Table of Contents

Lab Overview
Module 1: Setup
Module 2: External Table
Module 3: Materialized View on External Table
Module 4: Data Lake Export
Lab Overview
This hands on lab is intended to help you get up to speed on the features available in
Snowflake to augment your cloud data lake. This lab does not require that you have an
existing data lake. All data will be provided in a publicly available cloud storage location.
The datasets we’ll be using contain trip data from the Citibike transportation company in
New York.

Target Audience
Data engineers, Database and Cloud architects, and Data Warehouse administrators

What you'll learn

Features demonstrated in this lab include

● Query partitioned data in your data lake with External Tables;

● Improve performance of queries with Materialized Views on External Tables;
● Unload curated data back to your data lake using Copy into Location;

Prerequisites
● Access to a free 30-day Snowflake trial environment
● Basic knowledge of SQL, and database concepts and objects
● Familiarity with CSV comma-delimited files and JSON semi-structured data

Page 1
Module 1: Setup
1.1 Steps to Prepare Your Lab Environment

If not yet done, register for a Snowflake free 30-day trial at https://trial.snowflake.com
● The Snowflake edition (Standard, Enterprise, e.g.), cloud provider (AWS, Azure,
e.g.), and Region (US East, EU, e.g.) do *not* matter for this lab. But we suggest you
select the region which is physically closest to you. And select the Enterprise edition
so you can leverage some advanced capabilities that are not available in the
Standard edition.

● After registering, you will receive an email with an activation link and your Snowflake
account URL. Bookmark this URL for easy, future access. After activation, you will
create a user name and password. Write down these credentials.

Resize your browser windows so you can view this lab guide PDF and your web
browser side-by-side to more easily follow the lab instructions. If possible, even better is
to use a secondary display dedicated to the lab guide.

Click on this link and download the “lab_scripts_DataLake.sql” file to your local
machine. This file contains pre-written SQL commands and we will use this file later in
the lab. Note: attempting to copy and paste the SQL scripts from this lab guide can
result in syntax errors. Please open the SQL file in a text editor, and copy the
commands from there.

1.2 Setup the Environment

Start by navigating to the Worksheets tab, and three new empty worksheets. Name
them “Module 1” through “Module 4”, as shown below.
The SQL command file is broken up into 4 separate modules. Copy each section of
code from the SQL file into its corresponding worksheet tab. When all four worksheet
tabs are filled, you can close the text editor.

Page 2
Navigate back to the Module 1 worksheet, and we’ll begin executing each statement. To
execute a single statement, just position the cursor anywhere within the statement and
click the Run button. To execute several statements, they must be highlighted through
the final semi-colon prior to clicking the Run button.

Create a warehouse for use during this lab.

-- 1.2.1 Create a virtual warehouse cluster

use role SYSADMIN;
create or replace warehouse LOAD_WH with
warehouse_size = 'xlarge'
auto_suspend = 300
initially_suspended = true;

use warehouse LOAD_WH;

Create the empty CITIBIKE database that will be used by this lab.

-- 1.2.2 Create the new empty CITIBIKE database

create or replace database CITIBIKE;
create or replace schema DEMO;
create or replace schema UTILS;

Page 3
Create an external function call that is bound to a REST-based API

-- 1.2.3 Create an external function call to a REST API

use schema UTILS;
use role accountadmin;

-- Create an API Integration object

create or replace api integration fetch_http_data
api_provider = aws_api_gateway
api_aws_role_arn =
'arn:aws:iam::148887191972:role/ExecuteLambdaFunction'
enabled = true
api_allowed_prefixes =
('https://dr14z5kz5d.execute-api.us-east-1.amazonaws.com/prod/fe
tchhttpdata');

grant usage on integration fetch_http_data to role sysadmin;

-- Now create the external function that uses the API

integration object

use role sysadmin;

-- create the function

create or replace external function fetch_http_data(v varchar)
returns variant
api_integration = fetch_http_data
as
'https://dr14z5kz5d.execute-api.us-east-1.amazonaws.com/prod/fet
chhttpdata';

Create a few reference tables and populate them with data.

-- 1.2.4 Create a few reference tables and populate them with

data.

use schema DEMO;

create or replace table GBFS_JSON (
data varchar,
url varchar,
payload variant,
row_inserted timestamp_ltz);

-- Populate it with raw JSON data through the External Function

call
insert into GBFS_JSON
select
$1 data,

Page 4
$2 url,
citibike.utils.fetch_http_data( url ) payload,
current_timestamp() row_inserted
from
(values
('stations',
'https://gbfs.citibikenyc.com/gbfs/en/station_information.json')
,
('regions',
'https://gbfs.citibikenyc.com/gbfs/en/system_regions.json'));

-- Now refine that raw JSON data by extracting out the STATIONS
nodes
create or replace table STATION_JSON as
with s as (
select payload, row_inserted
from gbfs_json
where data = 'stations'
and row_inserted = (select max(row_inserted) from
gbfs_json)
)
select
value station_v,
payload:response.last_updated::timestamp last_updated,
row_inserted
from s,
lateral flatten (input => payload:response.data.stations) ;

-- extract the individual region records

create or replace table REGION_JSON as
with r as (
select payload, row_inserted
from gbfs_json
where data = 'regions'
and row_inserted = (select max(row_inserted) from
gbfs_json)
)
select
value region_v,
payload:response.last_updated::timestamp last_updated,
row_inserted
from r,
lateral flatten (input => payload:response.data.regions);

-- Lastly, create a view that "flattens" the JSON into a

standard table structure

Page 5
create or replace view STATIONS_VW as
with s as (
select * from station_json
where row_inserted = (select max(row_inserted) from
station_json)
),
r as (
select * from region_json
where row_inserted = (select max(row_inserted) from
region_json))
select
station_v:station_id::number station_id,
station_v:name::string station_name,
station_v:lat::float station_lat,
station_v:lon::float station_lon,
station_v:station_type::string station_type,
station_v:capacity::number station_capacity,
station_v:rental_methods rental_methods,
region_v:name::string region_name
from s
left outer join r
on station_v:region_id::integer =
region_v:region_id::integer ;

Module 2: External Tables

Snowflake has provided the Citibike TRIPS data in an AWS S3 bucket. The data files are in
Parquet format and are partitioned into folders by year. The bucket URL is:
s3://snowflake-corp-citibike-demo-master/V2/external/

2.1 Create an External Table linked to an S3 bucket

Select the worksheet labeled “Module 2”. The first step is to set our context to the
Citibike database and the Demo schema.

-- 2.1.1 Set context

use schema citibike.demo;
use role SYSADMIN;

Create an external stage that points to the public S3 bucket.

-- 2.1.2 Create an EXTERNAL STAGE that points to the
-- public S3 bucket

Page 6
create or replace stage CITIBIKE_STAGE
url = 's3://snowflake-corp-citibike-demo-master/V2/external/'
credentials = (aws_key_id = 'AKIAXSJGLDPF3ZICVOPB'
aws_secret_key = 'pfUJgyX4ukVmfMQr0qXLZ8Vmh5Q2yVnXXkr+CQQZ')
file_format=(type=parquet);

Let’s see what data is available:

-- 2.1.3 Let’s see what data is available:

-- Show the list of files in the external stage
list @citibike_stage/trips/2019;

-- Let’s take a peek inside the files themselves

select $1 from @citibike_stage/trips/2019 limit 100;

Click on row 1 in the results pane. This opens a dialog box showing the contents of a
single row from the Parquet file.

Page 7
Let’s create a basic external table on the Parquet files in the bucket.

-- 2.1.4 Create a basic External Table on the Parquet files

-- in the bucket

create or replace external table TRIPS_BASIC_XT

location = @citibike_stage/trips
auto_refresh = false
file_format=(type=parquet) ;

In this basic external table definition, there are only two columns available - a variant
named VALUE that contains the file data, and a pseudocolumn called
metadata$filename.

select metadata$filename, value

from TRIPS_BASIC_XT
LIMIT 100;

A single column named value is not going to be very user-friendly. It would be much
more useful to break out the individual fields into separate columns. And performance
against this external table will be less than optimal, because the query engine can’t
effectively prune out unneeded files from the execution plan. We can introduce a
partitioning scheme that makes the query engine much more efficient.

Let’s create a new external table on the same set of Parquet files, but this time we’ll
define each column separately, and partition the files on the date portion of their folder
name.

Page 8
-- 2.1.5 Create a new external table on the same set of Parquet
-- files, but this time we’ll define each column
-- separately, and partition the files on the date
-- portion of their folder names.

create or replace external table TRIPS_BIG_XT (

tripduration integer as (value:"TRIPDURATION"::integer),
startdate date as
to_date(split_part(metadata$filename, '/', 4) || '-' ||
split_part(metadata$filename, '/', 5) || '-01'),
starttime timestamp as (value:"STARTTIME"::timestamp),
stoptime timestamp as (value:"STOPTIME"::timestamp),
start_station_id integer as
(case when value:"START_STATION_ID"='NULL' then null
else value:"START_STATION_ID"::integer end),
end_station_id integer as
(case when value:"END_STATION_ID"='NULL' then null
else value:"END_STATION_ID"::integer end),
bikeid integer as (value:"BIKEID"::integer),
usertype string as (value:"USERTYPE"::string),
birth_year integer as
(case when value:"BIRTH_YEAR"='NULL' then null
else value:"BIRTH_YEAR"::integer end),
gender integer as (value:"GENDER"::integer),
program_id integer as (value:"PROGRAM_ID"::integer)
)
partition by (startdate)
location = @citibike_stage/trips
auto_refresh = false
file_format=(type=parquet)

Notice that every column definition consists of three parts: the column name, its
datatype, and the transformation clause following the “as” keyword. The most basic
transformation clause is just a reference to an element in the file as
value:”elementName”, followed by an explicit datatype casting as ::datatype.

Let’s see what the data looks like in the new external table.

-- 2.1.6 Let’s see what the data looks like in the new
-- external table.
select * from trips_big_xt limit 100;

Page 9
Now it’s time to start querying. This first query will be relatively slow, as we are not
filtering on the partition column.

-- 2.1.7 Now it's time to start running some real queries.

select
start_station_id,
count(*) num_trips,
avg(tripduration) avg_duration
from trips_big_xt
group by 1
order by 2 desc;

Let’s add the partition column into the filter

-- 2.1.8 Add the partition column into the query

select
start_station_id,
count(*) num_trips,
avg(tripduration) avg_duration
from trips_big_xt
where startdate between to_date('2014-01-01') and
to_date('2014-06-30')
group by 1;

Open the History panel (next to the Results pane) to compare the query profile for the
two queries we just executed.

Page 10
Query 2.1.7 – no partition filter Query 2.1.8 – partition on startdate

In query 2.1.7, the query engine was forced to read all 8,913 files and “spilled” nearly
3.15GB to local storage. The second query was able to effectively prune out over 90%
of the file reads, and only spilled 148MB into local storage. Consequently, the second
query executed in about 90% faster.

An external table acts just like a normal table, so we can join with other tables.

Page 11
-- 2.1.9 External tables act just like regular tables,
-- so we can join them with other tables

with t as (
select
start_station_id,
end_station_id,
count(*) num_trips
from trips_big_xt
where startdate between to_date('2014-01-01') and
to_date('2014-12-30')
group by 1, 2)

select
ss.station_name start_station,
es.station_name end_station,
num_trips
from t inner join stations_vw ss on t.start_station_id =
ss.station_id
inner join stations_vw es on t.end_station_id =
es.station_id
order by 3 desc;

2.2 Additional Options for External Tables

In this section, we’ll explore two important options for External Tables. There will not be any
additional scripts to execute, since demonstrating them in action would require access to an
AWS Console.

REFRESH_ON_CREATE = < TRUE | FALSE >

Specifies whether to automatically refresh the external table metadata once, immediately
after the external table is created. The metadata for an external table is the list of files
that exist in the specified storage location. Setting this option to FALSE essentially
creates an “empty” external table definition. To refresh the metadata requires execution
of the command ALTER EXTERNAL TABLE refresh;
Default: TRUE

AUTO_REFRESH = < TRUE | FALSE >

Specifies whether Snowflake should enable triggering automatic refreshes of the

external table metadata when new or updated data files are available in the named
external stage specified. Setting this option to TRUE will keep external tables in sync
with the contents of the related storage location.
Default: TRUE

Page 12
Module 3: Materialized Views over External Tables
External tables offer the ability to have an SQL interface on top of a file-based data lake,
without having to maintain an additional copy of the data in the Snowflake storage layer.
Automatic refresh can keep the external metadata in sync with the contents of the storage
location, eliminating many complex data engineering workflows. Defining the external table
with an effective partitioning scheme can greatly improve query performance against external
tables.

Materialized Views are pre-computed data sets derived from a query specification (the
SELECT in the view definition) and stored for later use. Because the data is pre-computed,
querying a materialized view is faster than executing the original query.

Combining these two techniques, i.e., creating a Materialized View over an External Table,
provides the optimum performance of SQL, with the benefits of maintaining the original data
source in an external stage.

3.1 Create a Materialized View on the External Table

3.1.1 Create the Materialized View.

-- 3.1.1 Create the materialized view

use role SYSADMIN;
use schema CITIBIKE.DEMO;

create or replace materialized view TRIPS_MV as

select
startdate,
start_station_id,
end_station_id,
count(*) num_trips
from trips_big_xt
group by 1, 2, 3;

Let’s see how many rows we have in the materialized view.

select
count(*) num_rows,
sum(num_trips) num_trips
from trips_mv;

3.1.2 Let’s re-run our join query, replacing the external table with the new materialized view.

Page 13
-- 3.1.2 Let’s re-run our join query, replacing the
-- external table with the new materialized view.

with t as (
select
start_station_id,
end_station_id,
sum(num_trips) num_trips
from trips_mv
where startdate between to_date('2014-01-01') and
to_date('2014-12-30')
group by 1, 2)

select
ss.station_name start_station,
es.station_name end_station,
num_trips
from t
inner join stations_vw ss
on t.start_station_id = ss.station_id
inner join stations_vw es
on t.end_station_id = es.station_id
order by 3 desc;

3.1.3 Open the History panel and review the query profile. Look how much faster this was
than the same query against the External Table.

Page 14
Query 2.1.9 – external table Query 3.1.2 – materialized view

Module 4: Data Lake Export

Now that we have a query that joins our materialized view of the Trip data with the Station and
Region tables, we would like to unload that enhanced dataset back to the data lake. This
exercise will unload the data from the materialized view into Parquet files, partitioned by date,
into a Snowflake internal stage.
Navigate to the worksheet labeled “Module 4”.

4.1.1 Let’s create an internal stage to unload the data. This is used as a simplification for
demonstration in the event you do not have access to your own S3 bucket.

-- 4.1.1 Let’s create an internal stage to unload the data.

use role SYSADMIN;
use schema CITIBIKE.DEMO;

create or replace stage CITIBIKE_UNLOAD

file_format=(type=parquet);

Page 15
Optional: Here are the commands to create an external stage on S3 for those that have
access to an AWS account and can create an S3 bucket. This section creates a Snowflake
Storage Integration object, which encapsulates the AWS IAM credentials to access the storage
location. Snowflake documentation on the creation and use of Storage Integrations can be
found here: https://docs.snowflake.com/en/user-guide/data-load-s3-config.html

-- The first two steps occur in the AWS Console.

reate an IAM policy that governs access to the S3 bucket.
-- 1. C

-- 2. Create an IAM role and assign the policy from step 1.

-- 3. This step uses the Snowflake worksheets tab to create a new

-- IAM user in the AWS account.

use role accountadmin;

CREATE OR REPLACE STORAGE INTEGRATION citibike_int

TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = ' ' -- Paste the ARN of the IAM Role
you created here
STORAGE_ALLOWED_LOCATIONS = (' '); -- This is the URL of
the bucket/folder(s)

-- 4. Note the STORAGE_AWS_ROLE_ARN and STORAGE_AWS_EXTERNAL_ID

-- in the result from this next statement.
describe integration citibike_int;

-- 5. Back in the AWS Console, edit the Trust Relationship of

-- the policy to add the Snowflake IAM user.

-- 6. Create an External Stage that points to the S3 bucket,

-- and uses the credentials in the Storage Integration

grant usage on integration citibike_int to role sysadmin;

use role sysadmin;

use database citibike;
use schema demo;

-- create the external stage

create or replace stage citibike_unload
storage_integration = citibike_int
url = ' ' -- The URL of the S3 bucket.
file_format=(type=parquet);

Page 16
show stages;

4.1.2 Unload the data to the internal stage, partitioning it by date, and placing a cap on the file
size.

-- 4.1.2 Execute the partitioned unload statement into

-- the internal stage

copy into @citibike_unload

from trips_mv
partition by (startdate::string)
file_format=(type=parquet)
max_file_size = 32000000
header=true;

-- Look at the contents of the stage.

list @citibike_unload;

Congratulations, you are now done with this lab!

Page 17

Snowpro™ Core: Study Guide
100% (1)
Snowpro™ Core: Study Guide
17 pages
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
API Dokumentation
No ratings yet
API Dokumentation
277 pages
Resume Ree
No ratings yet
Resume Ree
4 pages
Resume Nya 2017
No ratings yet
Resume Nya 2017
4 pages
Govardan Resume
No ratings yet
Govardan Resume
1 page
SnapLogic Second Edition
From Everand
SnapLogic Second Edition
Gerardus Blokdyk
No ratings yet
API Integration in Flutter Using GETX
No ratings yet
API Integration in Flutter Using GETX
7 pages
Snowflake Free Lab Guide
50% (4)
Snowflake Free Lab Guide
58 pages
Data Structure Unit 3
No ratings yet
Data Structure Unit 3
61 pages
OS LAb Manual
No ratings yet
OS LAb Manual
43 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Securing Snowflake
No ratings yet
Securing Snowflake
114 pages
File Transfers Module Cheat Sheet
No ratings yet
File Transfers Module Cheat Sheet
2 pages
4 Example For HNUB
No ratings yet
4 Example For HNUB
5 pages
APDA201 WORKBOOK 2022 v2
No ratings yet
APDA201 WORKBOOK 2022 v2
56 pages
APznzaY78tnl5Y oAf9eS5TdgeXPDlOW4T AmtqiY4PHThk2ZQBAlN TYg2qIhzN8is6Cyb37XgnGHte3fIwNnW5MPM2BaSySYl4QXhx fXWWBjZlqfyJgJ
No ratings yet
APznzaY78tnl5Y oAf9eS5TdgeXPDlOW4T AmtqiY4PHThk2ZQBAlN TYg2qIhzN8is6Cyb37XgnGHte3fIwNnW5MPM2BaSySYl4QXhx fXWWBjZlqfyJgJ
97 pages
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
PHP XML Expat Parser
No ratings yet
PHP XML Expat Parser
4 pages
Oups - Essential.guide - to.ANSI.C.1988.SCAN DARKCROWN
No ratings yet
Oups - Essential.guide - to.ANSI.C.1988.SCAN DARKCROWN
254 pages
Glpi Agent Readthedocs Io en Latest
No ratings yet
Glpi Agent Readthedocs Io en Latest
95 pages
Maya Iwabuchi
No ratings yet
Maya Iwabuchi
1 page
AWS RDS User Guide PDF
100% (1)
AWS RDS User Guide PDF
759 pages
Solutions Partner Technical Onboarding Guide
100% (1)
Solutions Partner Technical Onboarding Guide
27 pages
DBA - SQL Server Security Best Practices
No ratings yet
DBA - SQL Server Security Best Practices
5 pages
Mean Stack Sample Resume 3
No ratings yet
Mean Stack Sample Resume 3
4 pages
Building Data Pipelines in Python
No ratings yet
Building Data Pipelines in Python
49 pages
Master Snowflake Interview Q A 1729835390
No ratings yet
Master Snowflake Interview Q A 1729835390
7 pages
System&Applicationsoft
No ratings yet
System&Applicationsoft
48 pages
Manage Data Access With Unity Catalog
No ratings yet
Manage Data Access With Unity Catalog
17 pages
Introduction To Databricks
No ratings yet
Introduction To Databricks
149 pages
Assignment No: 08
No ratings yet
Assignment No: 08
5 pages
Snowflake Interview 2024 03
100% (1)
Snowflake Interview 2024 03
167 pages
Snowflake and Its Benefits
No ratings yet
Snowflake and Its Benefits
93 pages
IDL Smart Printing Resource Kit Developer Guide PDF
No ratings yet
IDL Smart Printing Resource Kit Developer Guide PDF
28 pages
AWS Glue Studio
100% (1)
AWS Glue Studio
126 pages
Medi-Caps University, Indore: SESSION 2021-2022 Software Engineering (CS3CO26) Lab File
No ratings yet
Medi-Caps University, Indore: SESSION 2021-2022 Software Engineering (CS3CO26) Lab File
20 pages
Email:: Professional Summary
No ratings yet
Email:: Professional Summary
4 pages
DBT Developer Guide
No ratings yet
DBT Developer Guide
28 pages
Fundamentals of Python Programming
No ratings yet
Fundamentals of Python Programming
135 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
TAFJ-AS JBossInstall v7 EAP PDF
No ratings yet
TAFJ-AS JBossInstall v7 EAP PDF
36 pages
Sinamics G120 Smart Access
No ratings yet
Sinamics G120 Smart Access
38 pages
Gentle Introduction To Apache Nifi For Data Flow. and Some Clojure
No ratings yet
Gentle Introduction To Apache Nifi For Data Flow. and Some Clojure
7 pages
Agriculture
No ratings yet
Agriculture
54 pages
Facility Commvault Bangalore
No ratings yet
Facility Commvault Bangalore
1 page
Chapter 1 Challenges: The Arduino Module
No ratings yet
Chapter 1 Challenges: The Arduino Module
7 pages
Azure Cosmos DB
100% (1)
Azure Cosmos DB
54 pages
Lfi To Rce
No ratings yet
Lfi To Rce
21 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Emc E05 001
No ratings yet
Emc E05 001
61 pages
IBMcisco 9700
No ratings yet
IBMcisco 9700
43 pages
Hitachi HH0 250
No ratings yet
Hitachi HH0 250
44 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
h13697 Emc Vmax3 Local Replication
No ratings yet
h13697 Emc Vmax3 Local Replication
41 pages
C++ Programming: From Problem Analysis To Program Design, Fifth Edition
No ratings yet
C++ Programming: From Problem Analysis To Program Design, Fifth Edition
70 pages
Cloud Computing Using AWS: C. Edward Chow
No ratings yet
Cloud Computing Using AWS: C. Edward Chow
20 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
Snowflake Architecture
No ratings yet
Snowflake Architecture
18 pages
Readmasterdata
No ratings yet
Readmasterdata
3 pages
TSM Admin
No ratings yet
TSM Admin
5 pages
Clojure Vs Groovy Vs Scala
No ratings yet
Clojure Vs Groovy Vs Scala
2 pages
AWS Re Invent 2020 Hands-On Labs 12162020
No ratings yet
AWS Re Invent 2020 Hands-On Labs 12162020
1,131 pages
AZ304 MicrosoftAzureArchitectDesign1
No ratings yet
AZ304 MicrosoftAzureArchitectDesign1
5 pages
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
3 Snowflake+Architecture
No ratings yet
3 Snowflake+Architecture
20 pages
Architecting A Data Lake
100% (8)
Architecting A Data Lake
60 pages
Snowflake
No ratings yet
Snowflake
16 pages
Snowflake Best Practice Guide
100% (1)
Snowflake Best Practice Guide
75 pages
AWS DataEngineering
100% (1)
AWS DataEngineering
23 pages
Snowpro™ Core: Exam Study Guide
No ratings yet
Snowpro™ Core: Exam Study Guide
17 pages
Snowflake To Oracle
No ratings yet
Snowflake To Oracle
16 pages
Azure-Databricks-Virtual-Workshop-21-Apr - FINAL PDF
No ratings yet
Azure-Databricks-Virtual-Workshop-21-Apr - FINAL PDF
43 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Redshift Vs Snowflake - An In-Depth Comparison PDF
100% (2)
Redshift Vs Snowflake - An In-Depth Comparison PDF
19 pages
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
Data Lake Storage
100% (1)
Data Lake Storage
237 pages
Databricks Lab 1
100% (3)
Databricks Lab 1
7 pages
SnowProCore Exam Study Guide 050423
No ratings yet
SnowProCore Exam Study Guide 050423
16 pages
349 GCP Lockdown Guide - 1568823133 PDF
No ratings yet
349 GCP Lockdown Guide - 1568823133 PDF
222 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
ELT Architecture in The Azure Cloud
No ratings yet
ELT Architecture in The Azure Cloud
8 pages
Snowflake 20 s1 A PDF
No ratings yet
Snowflake 20 s1 A PDF
254 pages
06.introduction To Data Factory
No ratings yet
06.introduction To Data Factory
26 pages
Snowflake Flatten PDF
100% (2)
Snowflake Flatten PDF
17 pages
Interactive Visual Data Exploration With Spark in Databricks Cloud
No ratings yet
Interactive Visual Data Exploration With Spark in Databricks Cloud
26 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
PASS Azure Data Engineering Bootcamp
No ratings yet
PASS Azure Data Engineering Bootcamp
35 pages
Snowflake:: Data Warehouse For Cloud
No ratings yet
Snowflake:: Data Warehouse For Cloud
2 pages
Unite Real-Time and Batch Analytics With AWS Glue
No ratings yet
Unite Real-Time and Batch Analytics With AWS Glue
28 pages
Cosmosdb: Understanding The Main Factors For Successful Deployment
No ratings yet
Cosmosdb: Understanding The Main Factors For Successful Deployment
58 pages
Optimizing Tableau Aws Redshift Whitepaper
No ratings yet
Optimizing Tableau Aws Redshift Whitepaper
33 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
Mongo DB Exercise
No ratings yet
Mongo DB Exercise
45 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Amazan Dump
No ratings yet
Amazan Dump
51 pages
Getting Started With Apache Nifi
No ratings yet
Getting Started With Apache Nifi
10 pages
1 AWS Analytics and Data Lakes
No ratings yet
1 AWS Analytics and Data Lakes
15 pages
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
No ratings yet
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
6 pages
Microsoft SQL Azure Enterprise Application Development
From Everand
Microsoft SQL Azure Enterprise Application Development
Jayaram Krishnaswamy
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet

Hands On Lab Guide For Data Lake PDF

Uploaded by

Hands On Lab Guide For Data Lake PDF

Uploaded by

Works for any Snowflake edition or cloud provider

What you'll learn

● Query partitioned data in your data lake with ​External Tables​;

1.2 Setup the Environment

Create a warehouse for use during this lab.

-- 1.2.1 Create a virtual warehouse cluster

use warehouse LOAD_WH;

-- 1.2.2 Create the new empty CITIBIKE database

-- 1.2.3 Create an external function call to a REST API

-- Create an API Integration object

grant usage on integration fetch_http_data to role sysadmin;

-- Now create the external function that uses the API

use role sysadmin;

-- create the function

Create a few reference tables and populate them with data.

-- 1.2.4 Create a few reference tables and populate them with

use schema DEMO;

-- Populate it with raw JSON data through the External Function

-- extract the individual region records

-- Lastly, create a view that "flattens" the JSON into a

Module 2: External Tables

2.1 Create an External Table linked to an S3 bucket

-- 2.1.1 Set context

Create an external stage that points to the public S3 bucket.

Let’s see what data is available:

-- 2.1.3 Let’s see what data is available:

-- Let’s take a peek inside the files themselves

-- 2.1.4 Create a basic External Table on the Parquet files

create or replace external table TRIPS_BASIC_XT

select metadata$filename, value

create or replace external table TRIPS_BIG_XT (

-- 2.1.7 Now it's time to start running some real queries.

Let’s add the partition column into the filter

-- 2.1.8 Add the partition column into the query

2.2 Additional Options for External Tables

REFRESH_ON_CREATE = < ​TRUE | FALSE >

AUTO_REFRESH = < ​TRUE | FALSE >

Specifies whether Snowflake should enable triggering automatic refreshes of the

3.1 Create a Materialized View on the External Table

3.1.1 Create the Materialized View.

-- 3.1.1 Create the materialized view

create or replace materialized view TRIPS_MV as

Let’s see how many rows we have in the materialized view.

Module 4: Data Lake Export

-- 4.1.1 Let’s create an internal stage to unload the data.

create or replace stage CITIBIKE_UNLOAD

-- The first two steps occur in the AWS Console.

-- 2. ​Create an IAM role​ and assign the policy from step 1.

-- 3. This step uses the Snowflake worksheets tab to create a new

use role accountadmin;

CREATE OR REPLACE STORAGE INTEGRATION citibike_int

-- 4. Note the ​STORAGE_AWS_ROLE_ARN and STORAGE_AWS_EXTERNAL_ID

-- 5. Back in the AWS Console, ​edit the Trust Relationship​ of

-- 6. Create an External Stage that points to the S3 bucket,

grant usage on integration citibike_int to role sysadmin;

use role sysadmin;

-- create the external stage

-- 4.1.2 Execute the partitioned unload statement into

copy into @citibike_unload

-- Look at the contents of the stage.

Congratulations, you are now done with this lab!

You might also like

● Query partitioned data in your data lake with External Tables;

REFRESH_ON_CREATE = < TRUE | FALSE >

AUTO_REFRESH = < TRUE | FALSE >

-- 2. Create an IAM role and assign the policy from step 1.

-- 4. Note the STORAGE_AWS_ROLE_ARN and STORAGE_AWS_EXTERNAL_ID

-- 5. Back in the AWS Console, edit the Trust Relationship of