0% found this document useful (0 votes)

16 views

The Missing Manual - SELECT - Data Council

Uploaded by

Vipin Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

The Missing Manual - SELECT - Data Council

Uploaded by

Vipin Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

The Missing Manual

Everything you need to know

about Snowﬂake optimization

Ian Whitestone & Niall Woodward

Data Council - March 28, 2023
👋 Hello!

Ian Whitestone Niall Woodward

[email protected] [email protected]
Why are we here?

● 🌮 Tacos
● End of the longest bull run in history
● Data teams are increasingly being
asked to better understand, monitor
and reduce their warehouse spend
● Snowﬂake is the market leader, with
many cost and performance levers
available

“sleeping bull” by Midjourney

Agenda

● Snowﬂake architecture overview

● How to lower costs

● How to optimize performance

● Next steps

Slides will be posted to select.dev

Snowﬂake Architecture

“arctic cloud data warehouse” by Midjourney

💸
💸
How to lower costs

“different sized computers in a row”

by Midjourney
How to lower costs

1. Understand Snowﬂake billing model

2. Optimize virtual warehouse conﬁguration
3. Consolidate warehouses
Compute Billing Model

● Only pay while virtual warehouses are active

● Per-second billing ($2-$4/credit)
○ X-Small consumes 1 credit / hour
○ Small consumes 2 credit / hour
○ …doubles with each size
● Minimum 60-seconds billed each time warehouse is resumed
How to lower costs

1. Understand Snowﬂake billing model

2. Optimize virtual warehouse conﬁguration
3. Consolidate warehouses
What are virtual warehouses?

● Abstraction over compute instances

● Each instance has 8 cores/threads,
16GB of RAM, and local SSD
● T-shirt sizes - XS -> 6XL
● Each size doubles compute resources
and cost - scaling ‘up’
Multi-cluster warehouses
Scale ‘out’ to process variable query volumes, e.g. peak hours
Recommended Warehouse Conﬁguration

● Start with an X-Small, single cluster warehouse

● Set max_cluster_count to satisfy peak concurrency needs
● 60s auto-suspend
● Set a query timeout (default is 2 days!)
● Resource monitor to alert on spikes

select.dev/posts/snowﬂake-warehouse-sizing
Warehouse Sizing

● Reduce warehouse size and max_cluster_count for workloads

which can tolerate some queueing e.g. data loading
● Use per-model warehouse conﬁguration in dbt vs increasing
warehouse size for entire project
● Larger warehouses can improve performance at minimal
additional cost, especially with remote disk spillage

select.dev/posts/snowﬂake-warehouse-sizing
Warehouse Sizing
● Larger warehouses improve performance at low additional cost – up to a point

select.dev/posts/snowﬂake-warehouse-sizing
How to lower costs

1. Understand Snowﬂake billing model

2. Optimize virtual warehouse conﬁguration
3. Consolidate warehouses
Consolidate Warehouses

● Fewer warehouses -> less

idle time
● Speeds up queries due
to caching
● Separate by workload
requirements, not
domain
Optimizing performance
Pruning and clustering

“thousands of tiny ﬁles” by Midjourney

Optimizing performance
Pruning and clustering

1. Micro-partitions
2. Pruning
3. Clustering
Micro-partitions
● Tables are stored in cloud
storage as micro-partitions
● Micro-partitions are a
proprietary, closed-source file
format created by Snowflake
● Heavily compressed and
~16MB each
● DML operations
(updates/inserts/deletes)
add/remove entire files

select.dev/posts/introduction-to-snowﬂake-micro-partitions
Micro-partition metadata
Snowﬂake stores column level statistics in the cloud services layer

select.dev/posts/introduction-to-snowﬂake-micro-partitions
Optimizing performance
Pruning and clustering

1. Micro-partitions
2. Pruning
3. Clustering
Pruning - every fast query’s secret

● Snowﬂake checks which partitions contain the relevant data

● In this example, only three 3 micro-partitions are read

select.dev/posts/introduction-to-snowﬂake-micro-partitions
Check for pruning using the Query Proﬁle

● Query proﬁle shows only 5

partitions scanned out of
the 3242 present for the
table
● Info also available in query
history view
Optimizing performance
Pruning and clustering

1. Micro-partitions
2. Pruning
3. Clustering
Clustering

● Describes the distribution of data across a table’s micro-partitions

● A ‘well-clustered’ column has a small range of values per
micro-partition for that column
● Snowﬂake can prune well when queries ﬁlter on that column

select.dev/posts/introduction-to-snowﬂake-clustering
Clustering methods

● Natural Clustering
○ Leverage wherever possible
● Automatic Clustering Service
○ Use where a table is commonly ﬁltered by a column
which isn’t the ‘natural’ clustering key
● Manual Sorting
○ Useful for one-off clustering at lowest cost

select.dev/posts/introduction-to-snowﬂake-clustering
Finding good candidates for clustering

● Columns used frequently in ‘where’ clauses

● Column should have a large enough number of distinct
values to enable effective pruning on the table
○ i.e. clustering on a categorical column with 2 distinct
values will only achieve ~50% pruning
● Use the query history + access history views to determine
usage patterns

select.dev/posts/introduction-to-snowﬂake-clustering
Optimizing performance
Query design

“fast running computer” by Midjourney

Optimizing performance
Query design

1. Before you begin…

2. Fastest way to process data? Don’t!
3. Use clustered columns in join predicates
4. Explicitly list columns in CTEs
5. Filter early
Before you begin…

● What’s the expected ROI?

● Does your query need to run every hour?
○ Is anyone looking at the dashboard multiple times per
day?
○ If a data models costs $10,000/year running hourly,
switching to daily can drop costs by ~95%
Optimizing performance
Query design

1. Before your begin…

2. Fastest way to process data? Don’t!
3. Use clustered columns in join predicates
4. Explicitly list columns in CTEs
5. Filter early
Fastest way to process data? Don’t!

1. Ensure query is pruning out unneeded micro-partitions

○ Pruning works with CTEs & subqueries
○ Can fail when applying functions on predicates, type
conversions, deeply nested views, table has degraded
clustering health
○ Always validate by checking query proﬁle/history

2. Use incremental materializations for larger datasets

Optimizing performance
Query design

1. Before your begin…

2. Fastest way to process data? Don’t!
3. Use clustered columns in join predicates
4. Explicitly list columns in CTEs
5. Filter early
Use clustered columns in join predicates

● Snowﬂake uses values from one side of join to enable pruning

● Applies to joins and merges
Optimizing performance
Query design

1. Before your begin…

2. Fastest way to process data? Don’t!
3. Use clustered columns in join predicates
4. Explicitly list columns in CTEs
5. Filter early
Column pruning doesn’t always work with CTEs

● Column pruning prevents unneeded columns from being read

● Column pruning stop working when CTEs are referenced more
than once or when used in join
○ Ensure required columns are explicitly listed in CTEs

⚠ ✅
select.dev/posts/should-you-use-ctes-in-snowﬂake
Optimizing performance
Query design

1. Before your begin…

2. Fastest way to process data? Don’t!
3. Use clustered columns in join predicates
4. Explicitly list columns in CTEs
5. Filter early
Filter early

● Most of the time, Snowﬂake

pushes down ﬁlters
● In certain cases it can’t
○ Qualify ﬁlter happens
post join, but should be
applied before in a CTE
Next Steps

“polar bear on a computer” by Midjourney

Bootstrap Cost & Performance Observability

● Understanding virtual warehouse cost drivers is critical

● Use our dbt package dbt-snowﬂake-monitoring
○ Cost per query, cost per dbt model, etc.
● Create dashboards for monitoring, alerts for big spikes
● Review monthly/quarterly

select.dev/posts/cost-per-query
github.com/get-select/dbt-snowﬂake-monitoring
Use SELECT

Lower Costs
Save Time
Optimize Performance

Reach out to join early access or book a demo → [email protected]

Thanks for listening!

“data nerds socializing” by Midjourney

Choosing the right warehouse size
● Start with X-SMALL warehouse
● Test with representative production queries
● If execution time is within SLO, leave as is. Otherwise, increase
warehouse until SLO is met.
● If on enterprise, conﬁgure maximum cluster count on warehouse to
meet peak concurrency needs. Simulate using historical production
data if available.

select.dev/posts/snowﬂake-warehouse-sizing
Impact of warehouse size on query execution
time

● Compute, memory, and disk space (cache size + space available for local spillage) double with
each size increase
● Generally speaking, query execution time will also halve, until…
○ A certain point where performance will either stop improving (Snowflake won't parallelize
further) or gets worse due to added communication costs outweighing performance benefits
select.dev/posts/snowflake-warehouse-sizing
Before you start, can you reduce the frequency?

● Does your query need to run every hour?

○ Is anyone looking at the dashboard multiple times per
day?
● If a data models costs $10,000/year running hourly,
switching to daily can drop costs by ~95%
Include additional join predicate to force pruning

● Static pruning vs. dynamic pruning

● During a join, Snowflake creates a hash table on the "build
side" (smaller table, on the left of the query profile)
● Statistics are collected for the distribution of join keys in
build-side records
● These are pushed to the probe side (bigger table) and can
be used to filter or skip entire files
Regular merge forces join

● A merge results in a join

● Table is well clustered with order timestamp, not order key
Regular merge forces a full table scan!
Add additional join condition on our clustered column

● Table is well clustered on order date

● Adding additional join key doesn’t change validity of join
Adding date predicate to join forces dynamic pruning,
query now scans <0.2% of table!
Should you use CTEs?

● Yes
● CTEs are computed once in Snowﬂake
● In certain scenarios where CTE is referenced more than
once, can be faster to repeat logic in subqueries rather
than use a CTE

select.dev/posts/should-you-use-ctes-in-snowﬂake

Snowflake Notes
100% (9)
Snowflake Notes
67 pages
Snowflake Snowpro Exam Cheatsheet
83% (12)
Snowflake Snowpro Exam Cheatsheet
7 pages
Programming+in+Snowflake+ +All+Slides
100% (1)
Programming+in+Snowflake+ +All+Slides
342 pages
Snowflake Snowpro Core
No ratings yet
Snowflake Snowpro Core
9 pages
JAVA Programming FINAL Exam
67% (3)
JAVA Programming FINAL Exam
15 pages
Snowflake Faq
No ratings yet
Snowflake Faq
185 pages
Snowflake - Syllubus
No ratings yet
Snowflake - Syllubus
10 pages
PLSQL Section 2 Quiz
No ratings yet
PLSQL Section 2 Quiz
3 pages
Rapid Development Taming Wild Software Schedules PDF
75% (4)
Rapid Development Taming Wild Software Schedules PDF
671 pages
Snowflake Training Presentation v1
No ratings yet
Snowflake Training Presentation v1
111 pages
Snowflake+Interview+Questions+ +Part+I
No ratings yet
Snowflake+Interview+Questions+ +Part+I
27 pages
Snowflake Snowpro Certification Exam Cheat Sheet by Jeno Yamma
100% (1)
Snowflake Snowpro Certification Exam Cheat Sheet by Jeno Yamma
7 pages
CRP Basic Settings Part 4 - Running Capacity Leveling in Background
No ratings yet
CRP Basic Settings Part 4 - Running Capacity Leveling in Background
5 pages
Lab 2
100% (1)
Lab 2
4 pages
Snowflake - Interview Questions
No ratings yet
Snowflake - Interview Questions
15 pages
Caching in The Snowflake Cloud Data Platform
No ratings yet
Caching in The Snowflake Cloud Data Platform
11 pages
All Course Slides
100% (1)
All Course Slides
192 pages
Snowflake Clustering
No ratings yet
Snowflake Clustering
19 pages
E - Snowflake-Snowpro-Core-1
No ratings yet
E - Snowflake-Snowpro-Core-1
79 pages
Snowflake
No ratings yet
Snowflake
16 pages
Snowflake
No ratings yet
Snowflake
7 pages
Snowflake Prctice1
No ratings yet
Snowflake Prctice1
51 pages
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
Micro
No ratings yet
Micro
265 pages
sno
No ratings yet
sno
16 pages
Snowflake
No ratings yet
Snowflake
11 pages
Snowflake - Syllubus and DBT
No ratings yet
Snowflake - Syllubus and DBT
11 pages
Snowflake Certification Practice Paper3 V1-Done
No ratings yet
Snowflake Certification Practice Paper3 V1-Done
22 pages
Snowflake Question Ppt
No ratings yet
Snowflake Question Ppt
446 pages
Snowflake Certification Syllabus
No ratings yet
Snowflake Certification Syllabus
4 pages
Snowflake Overview 5
No ratings yet
Snowflake Overview 5
2 pages
Snowflake Interview Question
No ratings yet
Snowflake Interview Question
20 pages
Snowflake Query Optimization Techniques Snow
No ratings yet
Snowflake Query Optimization Techniques Snow
13 pages
Query Optimization
No ratings yet
Query Optimization
24 pages
Getting Started With Snowflake Guide
100% (1)
Getting Started With Snowflake Guide
23 pages
Cabanasj486 Snowflake Snowpro Core
No ratings yet
Cabanasj486 Snowflake Snowpro Core
6 pages
Snowflake Training Slide SANMs
67% (6)
Snowflake Training Slide SANMs
218 pages
21 Ways To Reduce Your Cloud Data Warehouse Costs
No ratings yet
21 Ways To Reduce Your Cloud Data Warehouse Costs
30 pages
Practice Test 6 70 Questions Udemy
No ratings yet
Practice Test 6 70 Questions Udemy
51 pages
Snowflake Architecture - Concepts
No ratings yet
Snowflake Architecture - Concepts
38 pages
Micro Partitions and Clustering
No ratings yet
Micro Partitions and Clustering
6 pages
snowflake_syllabus
No ratings yet
snowflake_syllabus
8 pages
SnowFlake Notes
100% (1)
SnowFlake Notes
40 pages
Snowpro-Core LATEST
No ratings yet
Snowpro-Core LATEST
391 pages
practice_ques_100_200 (1)
No ratings yet
practice_ques_100_200 (1)
19 pages
SnowPro Core Study Guide
No ratings yet
SnowPro Core Study Guide
37 pages
Access Control Snowflake
No ratings yet
Access Control Snowflake
6 pages
8 Best Practices For Choosing Right Snowflake Warehouse Sizes
No ratings yet
8 Best Practices For Choosing Right Snowflake Warehouse Sizes
11 pages
Teradata - 8586485761185341111
No ratings yet
Teradata - 8586485761185341111
16 pages
SNOWPRO_COF-CO2_CORE_DUMPS_FULL_MATERIAL2024
No ratings yet
SNOWPRO_COF-CO2_CORE_DUMPS_FULL_MATERIAL2024
141 pages
Snowflake Overview
No ratings yet
Snowflake Overview
44 pages
Snowflake Interview Questions and Answers
No ratings yet
Snowflake Interview Questions and Answers
5 pages
Ravi Snowflake Interview Questions-1
No ratings yet
Ravi Snowflake Interview Questions-1
20 pages
3 Snowflake+Architecture
No ratings yet
3 Snowflake+Architecture
20 pages
Joanne Hershfield "Paradise Regained Sergei Eisenstein S Que Viva Mexico! As Ethnography"
No ratings yet
Joanne Hershfield "Paradise Regained Sergei Eisenstein S Que Viva Mexico! As Ethnography"
17 pages
Snowflake - T
No ratings yet
Snowflake - T
108 pages
Commonly Asked Snowflake
No ratings yet
Commonly Asked Snowflake
26 pages
Snow
No ratings yet
Snow
3 pages
Snowpro Core (1)
No ratings yet
Snowpro Core (1)
55 pages
Snowpro Core
No ratings yet
Snowpro Core
58 pages
Rocking Snowflake With Aws Co5BhSmn
No ratings yet
Rocking Snowflake With Aws Co5BhSmn
7 pages
snowpro-core_0
No ratings yet
snowpro-core_0
27 pages
Snowflake
No ratings yet
Snowflake
122 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
vSphere High Performance Cookbook
From Everand
vSphere High Performance Cookbook
Prasenjit Sarkar
No ratings yet
The Basics of Cloud Computing Economics - VEXXHOST
No ratings yet
The Basics of Cloud Computing Economics - VEXXHOST
8 pages
EconomicsofCloudComputing KER 2011
No ratings yet
EconomicsofCloudComputing KER 2011
29 pages
NoSQL Data Modeling Techniques - Highly Scalable Blog
0% (1)
NoSQL Data Modeling Techniques - Highly Scalable Blog
32 pages
Advanced RISC Machine-ARM Notes Bhurchandi
100% (1)
Advanced RISC Machine-ARM Notes Bhurchandi
8 pages
TutorialsPoint - Unix Socket Quick Guide
No ratings yet
TutorialsPoint - Unix Socket Quick Guide
26 pages
SAP Material Listing and Exclusion
No ratings yet
SAP Material Listing and Exclusion
15 pages
C Programming Language
No ratings yet
C Programming Language
113 pages
List of Courses 2023-24
No ratings yet
List of Courses 2023-24
80 pages
Oracle FCCS Data Load Using Data Management
100% (1)
Oracle FCCS Data Load Using Data Management
15 pages
Log Cat 1708825391915
No ratings yet
Log Cat 1708825391915
36 pages
Online Examination System Research Paper
No ratings yet
Online Examination System Research Paper
7 pages
Customizing Autodesk® Navisworks® 2013 With The: Learning Objectives
No ratings yet
Customizing Autodesk® Navisworks® 2013 With The: Learning Objectives
11 pages
Input and Output Statements - 1
No ratings yet
Input and Output Statements - 1
19 pages
Introduction To Database Programming in Python
No ratings yet
Introduction To Database Programming in Python
26 pages
Ind Robotics - Digital MFG Tech - CTS2.0 - NSQF-3
No ratings yet
Ind Robotics - Digital MFG Tech - CTS2.0 - NSQF-3
36 pages
Python Imp questions
No ratings yet
Python Imp questions
1 page
7- Lecture07 - Semantic Analysis, Exercise
No ratings yet
7- Lecture07 - Semantic Analysis, Exercise
18 pages
What Is New in Python
No ratings yet
What Is New in Python
5 pages
Program 47
No ratings yet
Program 47
6 pages
Memory Segmentation: by Nikhil Kumar Nirt Bhopal
No ratings yet
Memory Segmentation: by Nikhil Kumar Nirt Bhopal
11 pages
Lab Manual: School of Computing Science & Engineering
No ratings yet
Lab Manual: School of Computing Science & Engineering
36 pages
Program Questions
No ratings yet
Program Questions
9 pages
178 hw1
No ratings yet
178 hw1
4 pages
11th Computer Science Study Materials EM 2024
No ratings yet
11th Computer Science Study Materials EM 2024
177 pages
Ms Maestro
No ratings yet
Ms Maestro
482 pages
137 Data Types in People Code
100% (1)
137 Data Types in People Code
3 pages
Abdul Nayeem - Java Groovy Grails
No ratings yet
Abdul Nayeem - Java Groovy Grails
6 pages
COMPROG Flowchart For DCIT22.document
No ratings yet
COMPROG Flowchart For DCIT22.document
4 pages
### Backend (Golang) .Zip
No ratings yet
### Backend (Golang) .Zip
3 pages

The Missing Manual - SELECT - Data Council

Uploaded by

The Missing Manual - SELECT - Data Council

Uploaded by

The Missing Manual

Everything you need to know

Ian Whitestone & Niall Woodward

Ian Whitestone Niall Woodward

“sleeping bull” by Midjourney

● Snowﬂake architecture overview

● How to lower costs

● How to optimize performance

Slides will be posted to select.dev

“arctic cloud data warehouse” by Midjourney

“different sized computers in a row”

1. Understand Snowﬂake billing model

● Only pay while virtual warehouses are active

1. Understand Snowﬂake billing model

● Abstraction over compute instances

● Start with an X-Small, single cluster warehouse

● Reduce warehouse size and max_cluster_count for workloads

1. Understand Snowﬂake billing model

● Fewer warehouses -> less

“thousands of tiny ﬁles” by Midjourney

● Snowﬂake checks which partitions contain the relevant data

● Query proﬁle shows only 5

● Describes the distribution of data across a table’s micro-partitions

● Columns used frequently in ‘where’ clauses

“fast running computer” by Midjourney

1. Before you begin…

● What’s the expected ROI?

1. Before your begin…

1. Ensure query is pruning out unneeded micro-partitions

2. Use incremental materializations for larger datasets

1. Before your begin…

● Snowﬂake uses values from one side of join to enable pruning

1. Before your begin…

● Column pruning prevents unneeded columns from being read

1. Before your begin…

● Most of the time, Snowﬂake

“polar bear on a computer” by Midjourney

● Understanding virtual warehouse cost drivers is critical

Reach out to join early access or book a demo → [email protected]

“data nerds socializing” by Midjourney

● Does your query need to run every hour?

● Static pruning vs. dynamic pruning

● A merge results in a join

● Table is well clustered on order date

You might also like