0% found this document useful (0 votes)
524 views30 pages

Best Practices For Optimizing Your DBT and Snowflake Deployment

This document provides best practices for optimizing deployments of dbt and Snowflake together. It discusses optimizing Snowflake resources like using auto-suspend policies and resource monitors. It also covers optimizing dbt workloads for performance and concurrency. Specific recommendations are made around writing efficient SQL, using dbt environments and references, writing modular reusable code, and testing and documenting transformations. The goal is to help customers maximize efficiency and minimize costs when using dbt and Snowflake together.

Uploaded by

Gopal Krishan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
524 views30 pages

Best Practices For Optimizing Your DBT and Snowflake Deployment

This document provides best practices for optimizing deployments of dbt and Snowflake together. It discusses optimizing Snowflake resources like using auto-suspend policies and resource monitors. It also covers optimizing dbt workloads for performance and concurrency. Specific recommendations are made around writing efficient SQL, using dbt environments and references, writing modular reusable code, and testing and documenting transformations. The goal is to help customers maximize efficiency and minimize costs when using dbt and Snowflake together.

Uploaded by

Gopal Krishan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

BEST PRACTICES FOR

OPTIMIZING YOUR DBT AND


SNOWFLAKE DEPLOYMENT

WHITE PAPER
TABLE OF CONTENTS

Introduction 3 Optimizing dbt 22

What Is Snowflake? 3 Use environments 22

Snowflake architecture 3 Use the ref() function and sources 24

Benefits of using Snowflake 4 Write modular, DRY code 25

What Is dbt? 5 Use dbt tests and documentation 26

dbt Cloud 5 Use packages 27

Customer Use Case 6 Be intentional about your materializations 27

Optimizing Snowflake 6 Optimize for scalability 28

Automated resource optimization - Plan for project scalability from the outset 28
for dbt query tuning 8
- Follow a process for upgrading dbt versions 28
- Automatic clustering 8
Conclusion 28
- Materialized views 8
Contributors 29
- Query acceleration services 9
Reviewers 29
Resource management and monitoring 10
Document Revisions 29
- Auto-suspend policies 10
About dbt Labs 30
- Resource monitors 11
About Snowflake 30
- Naming conventions 12

Role-based access control (RBAC) 13

- Monitoring 13

- Monitoring credit usage 15

- Monitoring storage usage 15

Individual dbt workload elasticity 17

- Scaling up for performance 18

- Scaling out for concurrency 20

Writing effective SQL statements 20

- Query order of execution 20

- Applying filters as early as possible 21

- Querying only what you need 21

- Joining on unique keys 21

- Avoiding complex functions and


UDFs in WHERE clauses 22

WHITE PAPER 2
INTRODUCTION analytics data platform as a service, billed based on
consumption. It is faster, easier to use, and far more
Companies in every industry acknowledge that
flexible than traditional data warehouse offerings.
data is one of their most important assets. And yet,
companies consistently fall short of realizing the Snowflake uses a SQL database engine and a unique
potential of their data. architecture designed specifically for the cloud. There
is no hardware (virtual or physical) or software for you
Why is this the case? One key reason is the
to select, install, configure, or manage. In addition,
proliferation of data silos, which create expensive and
ongoing maintenance, management, and tuning are
time-consuming bottlenecks, erode trust, and render
handled by Snowflake.
governance and collaboration nearly impossible.
All components of Snowflake’s service (other than
This is where Snowflake and dbt come in.
optional customer clients) run in a secure cloud
The Snowflake Data Cloud is one global, unified infrastructure.
system connecting companies and data providers to
Snowflake is cloud-agnostic and uses virtual compute
relevant data for their business. Wherever data or
instances from each cloud provider (Amazon
users live, Snowflake delivers a single and seamless
EC2, Azure VM, and Google Compute Engine). In
experience across multiple public clouds, eliminating
addition, it uses object or file storage from Amazon
previous silos.
S3, Azure Blob Storage, or Google Cloud Storage
dbt is a transformation workflow that lets teams for persistent storage of data. Due to Snowflake’s
quickly and collaboratively deploy analytics code unique architecture and cloud independence, you can
following software engineering best practices such as seamlessly replicate data and operate from any of
modularity, portability, CI/CD, and documentation. these clouds simultaneously.
With dbt, anyone who knows SQL can contribute to
production-grade data pipelines.
SNOWFLAKE ARCHITECTURE
By combining dbt with Snowflake, data teams can
collaborate on data transformation workflows while Snowflake’s architecture is a hybrid of traditional
operating out of a central source of truth. Snowflake shared-disk database architectures and shared-
and dbt form the backbone of a data infrastructure nothing database architectures. Similar to shared-disk
designed for collaboration, agility, and scalability. architectures, Snowflake uses a central data repository
for persisted data that is accessible from all compute
When Snowflake is combined with dbt, customers nodes in the platform. But similar to shared-nothing
can operationalize and automate Snowflake’s architectures, Snowflake processes queries using
hallmark scalability within dbt as part of their analytics massively parallel processing (MPP) compute clusters
engineering workflow. The result is that Snowflake where each node in the cluster stores a portion of the
customers pay only for the resources they need, when entire data set locally. This approach offers the data
they need them, which maximizes efficiency and management simplicity of a shared disk architecture,
results in minimal waste and lower costs. but with the performance and scale-out benefits of a
This paper will provide some best practices for using shared-nothing architecture.
dbt with Snowflake to create this efficient workflow. As shown in Figure 1, Snowflake’s unique architecture
consists of three layers built upon a public cloud
infrastructure:
WHAT IS SNOWFLAKE?
• Cloud services: Cloud services coordinate activities
Snowflake’s Data Cloud is a global network where across Snowflake, processing user requests from login
thousands of organizations mobilize data with near- to query dispatch. This layer provides optimization,
unlimited scale, concurrency, and performance. Inside management, security, sharing, and other features.
the Data Cloud, organizations have a single unified
• Multi-cluster compute: Snowflake processes queries
view of data so they can easily discover and securely using virtual warehouses. Each virtual warehouse is
share governed data, and execute diverse analytics an MPP compute cluster composed of multiple compute
workloads. Snowflake provides a tightly integrated nodes allocated by Snowflake from Amazon EC2,
Azure VM, or Google Cloud Compute. Each virtual

WHITE PAPER 3
warehouse has independent compute resources, so • Learning new tools and expanded SQL capabilities:
high demand in one virtual warehouse has no impact on Snowflake is fully compliant with ANSI-SQL, so you can
the performance of other virtual warehouses. For more use the skills and tools you already have. Snowflake
information, see “Virtual Warehouses” in the Snowflake provides connectors for ODBC, JDBC, Python,
documentation. Spark, and Node.js, as well as web and command-line
interfaces. On top of that, Snowpark is an initiative that
• Centralized storage: Snowflake uses Amazon S3, will provide even more options for data engineers to
Azure Blob Storage, or Google Cloud Storage to express their business logic by directly working with
store data into its internal optimized, compressed, Scala, Java, and Python Data Frames.
columnar format using micro-partitions. Snowflake
manages the data organization, file size, structure, • Siloed structured and semi-structured data: Business
compression, metadata, statistics, and replication. Data users increasingly need to work with both traditionally
objects stored by Snowflake are not directly visible by structured data (for example, data in VARCHAR, INT,
customers, but they are accessible through SQL query and DATE columns in tables) as well as semi-structured
operations that are run using Snowflake. data in formats such as XML, JSON, and Parquet.
Snowflake provides a special data type called VARIANT
that enables you to load your semi-structured data
BENEFITS OF USING SNOWFLAKE natively and then query it with SQL.

Snowflake is a cross-cloud platform, which means • Optimizing and maintaining your data: You can run
analytic queries quickly and easily without worrying
there are several things users coming from a more
about managing how your data is indexed or distributed
traditional on-premises solution will no longer need to across partitions. Snowflake also provides built-in data
worry about: protection capabilities, so you don’t need to worry
about snapshots, backups, or other administrative tasks
• Installing, provisioning, and maintaining hardware and
such as running VACUUM jobs.
software: All you need to do is create an account and
load your data. You can then immediately connect from • Securing data and complying with international privacy
dbt and start transforming data. regulations: All data is encrypted when it is loaded into
Snowflake, and it is kept encrypted at all times when
• Determining the capacity of a data warehouse:
at rest and in transit. If your business requirements
Snowflake has scalable compute and storage, so it can
include working with data that requires HIPAA, PII,
accommodate all of your data and all of your users.
PCI DSS, FedRAMP compliance, and more, Snowflake’s
You can adjust the count and size of your virtual
Business Critical edition and higher editions can
warehouses to handle peaks and lulls in your data
support these validations.
usage. You can even turn your warehouses completely
off to stop incurring costs when you are not using them.

Figure 1: Three layers of Snowflake’s architecture

WHITE PAPER 4
• Sharing data securely: Snowflake Secure Data Sharing compiles the code, infers dependency graphs, runs
enables you to share near real-time data internally and models in order, and writes the necessary DDL/DML
externally between Snowflake accounts without copying
to execute against your Snowflake instance. This
and moving data sets. Data providers provide secure
data shares to their data consumers, who can view makes it possible for users to focus on writing SQL and
and seamlessly combine the data with their own data not worry about the rest. For writing code that is DRY
sources. Snowflake Data Marketplace includes many (don't repeat yourself), users can use Jinja alongside
data sets that you can incorporate into your existing SQL to express repeated logic using control structures
business data—such as data for weather, demographics,
such as loops and statements.
or traffic—for greater data-driven insights.

DBT CLOUD
WHAT IS DBT? dbt Cloud is the fastest and most reliable way to
When data teams work in silos, data quality suffers. deploy dbt. It provides a centralized experience for
dbt provides a common space for analysts, data teams to develop, test, schedule, and investigate data
engineers, and data scientists to collaborate on models—all in one web-based UI (see Figure 2). This
transformation workflows using their shared is made possible through features such as an intuitive
knowledge of SQL. IDE, automated testing and documentation, in-app
scheduling and alerting, access control, and a native
By applying proven software development best
Git integration.
practices such as modularity, portability, version
control, testing, and documentation, dbt’s analytics dbt Cloud also eliminates the setup and
engineering workflow helps data teams build trusted maintenance work required to manage data
data, faster. transformations in Snowflake at scale. A turn-key
adapter establishes a secure connection built to
dbt transforms the data already in your data
handle enterprise loads, while allowing for fine-
warehouse. Transformations are expressed in simple
grained policies and permissions.
SQL SELECT statements and, when executed, dbt

Figure 2: dbt Cloud provides a centralized experience for developing, testing, scheduling, and investigating data models.

WHITE PAPER 5
CUSTOMER USE CASE
When Ben Singleton joined JetBlue as its Director of
Data Science & Analytics, he stepped into a whirlpool
of demands that his team struggled to keep up with.
The data team was facing a barrage of concerns and
low stakeholder trust.

“My welcome to JetBlue involved a group of senior


leaders making it clear that they were frustrated with
the current state of data,” Singleton said.

What made matters worse was the experts were not


empowered to take ownership of their own data due
to the inaccessibility of the data stack.

As Singleton dug, he realized the solution wasn’t


incremental performance improvement but rather
a complete infrastructure overhaul. By pairing “Every C-level executive
Snowflake with dbt, JetBlue was able to transform wants more questions
the data team from being a bottleneck to being the
answered with data, they want
enablers of a data democracy.
that data faster, and they want
“Every C-level executive wants more questions
answered with data, they want that data faster, and it in many different ways.
they want it in many different ways. It’s critical for It’s critical for us.”
us,” Singleton said. All of this was done without an
increase in infrastructure costs. To read more about Ben Singleton
Director of Data Science
JetBlue’s success story, see the JetBlue case study.¹
& Analytics, JetBlue
The remainder of this paper dives into the exact
dbt and Snowflake best practices that JetBlue and
thousands of other clients have implemented to
optimize performance.

OPTIMIZING SNOWFLAKE
Your business logic is defined in dbt, but dbt
ultimately pushes down all processing to Snowflake.
For that reason, optimizing the Snowflake side of
your deployment is critical to maximizing your query
performance and minimizing deployment costs. The
table on the following page summarizes the main
areas and relevant best practices for Snowflake and
serves as a checklist for your deployment.

WHITE PAPER 6
AREA BEST PRACTICES WHY

Automated resource Automatic clustering Automated table maintenance


optimization for dbt
query tuning Materialized views Pre-compute complex logic

Query acceleration services Automated scale out part of query


to speed up performance without
resizing warehouse

Resource management Auto-suspend policies Automatic stop of warehouse to


reduce costs
and monitoring
Resource monitors Control of resource utilization
and cost

Naming conventions Ease of tracking, allocation,


and reporting

Role-based access control Governance and cost allocation

Monitoring Resource and cost


consumption monitoring

Individual dbt Scaling up for performance Resizing warehouse to increase


performance for complex workload
workload elasticity

Scaling out for concurrency Spinning up additional warehouses


to support a spike in concurrency

Writing effective Applying filters as early as possible Optimizing row operations and
reducing records in subsequent
SQL statements
operations

Querying only what you need Selecting only the columns needed
to optimize columnar store

Joining on unique keys Optimizing JOIN operations and


avoiding cross-joins

Avoiding complex functions and Pruning


UDFs in WHERE clauses

WHITE PAPER 7
AUTOMATED RESOURCE OPTIMIZATION • You no longer need to run manual operations to re-
FOR DBT QUERY TUNING cluster data.

Performance and scale are core to Snowflake. • Incremental clustering is done as new data arrives or a
Snowflake’s functionality is designed such that large amount of data is modified.

users can focus on core analytical tasks instead of • Data pipelines consisting of DML operations (INSERT,
on tuning the platform or investing in complicated DELETE, UPDATE, MERGE) can run concurrently and
workload management. are not blocked.

• Snowflake performs automatic reclustering in the


Automatic clustering background, and you do not need to specify a
Traditionally, legacy on-premises and cloud data warehouse to use.
warehouses relied on static partitioning of large • You can resume and suspend automatic clustering on
tables to achieve acceptable performance and enable a per-table basis, and you are billed by the second for
better scaling. In these systems, a partition is a unit only the compute resources used.
of management that is manipulated independently • Snowflake internally manages the state of clustered
using specialized DDL and syntax; however, static tables, as well as the resources (servers, memory, and
partitioning has a number of well-known limitations, so on) used for all automated clustering operations. This
such as maintenance overhead and data skew, which allows Snowflake to dynamically allocate resources as
needed, resulting in the most efficient and effective
can result in disproportionately sized partitions. It
reclustering. The Automatic Clustering service does
was the user’s responsibility to constantly optimize not perform any unnecessary reclustering. Reclustering
the underlying data storage. This involved work is triggered only when a table would benefit from the
such as updating indexes and statistics, post- operation.
load vacuuming procedures, choosing the right
dbt supports table clustering on Snowflake. To
distribution keys, dealing with slow partitions due to
control clustering for a table or incremental model,
growing skews, and manually reordering data as new
use the cluster_by configuration. Refer to the
data arrived or got modified.
Snowflake configuration guide for more details.
In contrast to a data warehouse, Snowflake
implements a powerful and unique form of Materialized views
partitioning called micro-partitioning, which delivers A materialized view is a pre-computed data set
all the advantages of static partitioning without the derived from a query specification (the SELECT in
known limitations, as well as providing additional the view definition) and stored for later use. Because
significant benefits. Snowflake scalable, multi- the data is pre-computed, querying a materialized
cluster virtual warehouse technology automates view (MV) is faster than executing a query against the
the maintenance of micro-partitions. This means base table of the view. This performance difference
Snowflake efficiently and automatically executes can be significant when a query is run frequently or
the re-clustering in the background. There’s no need is sufficiently complex. As a result, MVs can speed
to create, size, or resize a virtual warehouse. The up expensive aggregation, projection, and selection
compute service continuously monitors the clustering operations, especially those that run frequently and
quality of all registered clustered tables. It starts with that run on large data sets. dbt does not support
the most unclustered micro-partitions and iteratively MVs out of the box as materializations; therefore,
performs the clustering until an optimal clustering we recommend using custom materializations
depth is achieved. as a solution to achieve similar purposes. The
With Snowflake, you can define clustered tables if dbt materializations section in this white paper
the natural ingestion order is not sufficient in the explains how MVs can be used in dbt via a custom
presence of varying data access patterns. Automatic materialization.
clustering is a Snowflake service that seamlessly and
continually manages all reclustering, as needed, of
clustered tables. Its benefits include the following:

WHITE PAPER 8
MVs are particularly useful when: Snowflake’s compute service monitors the base
• Query results contain a small number of rows and/or tables for MVs and kicks off refresh statements for
columns relative to the base table (the table on which the corresponding MVs if significant changes are
the view is defined) detected. This maintenance process of all dependent
• Query results contain results that require significant
MVs is asynchronous. In scenarios where a user
processing, including: is accessing an MV that has yet to be updated,
– Analysis of semi-structured data Snowflake’s query engine will perform a combined
– Aggregates that take a long time to calculate execution with the base table to always ensure
• The query is on an external table (that is, data sets consistent query results. Similar to Snowflake’s
stored in files in an external stage), which might have automatic clustering with the ability to resume or
slower performance compared to querying native suspend per table, a user can resume and suspend
database tables the automatic maintenance on a per-MV basis. The
• The view’s base table does not change frequently automatic refresh process consumes resources
and can result in increased credit usage. However,
In general, when deciding whether to create an MV Snowflake ensures efficient credit usage by billing
or a regular view, use the following criteria: your account only for the actual resources used.
• Create an MV when all of the following are true: Billing is calculated in one-second increments.
– The query results from the view don’t change often. You can control the cost of maintaining MVs by
This almost always means that the underlying/base carefully choosing how many views to create,
table for the view doesn’t change often or at least the
which tables to create them on, and each view’s
subset of base table rows used in the MV
doesn’t change often. definition (including the number of rows and columns
in that view).
– The results of the view are used often
(typically, significantly more often than the query You can also control costs by suspending or resuming
results change). a MV; however, suspending maintenance typically
– The query consumes a lot of resources. Typically, only defers costs rather than reducing them. The
this means that the query consumes a lot of longer that maintenance has been deferred, the more
processing time or credits, but it could also mean maintenance there is to do.
that the query consumes a lot of storage space for
intermediate results. If you are concerned about the cost of maintaining
MVs, we recommend you start slowly with this
• Create a regular view when any of the following
feature (that is, create only a few MVs on selected
are true:
tables) and monitor the costs over time.
– The results of the view change often.
It’s a good idea to carefully evaluate these guidelines
– The results are not used often (relative to the rate at based on your dbt deployment to see if querying
which the results change).
from MVs will boost performance compared to base
– The query is not resource-intensive so it is not costly tables or regular views without cost overhead.
to re-run it.
Query acceleration services
These criteria are just guidelines. An MV might
Sizing the warehouse just right for a workload is
provide benefits even if it is not used often—
generally a hard trade-off between minimizing
especially if the results change less frequently than
cost and maximizing query performance. You’ll
the usage of the view.
usually have to monitor, measure, and pick an
There are also other factors to consider when acceptable point in this price-performance spectrum
deciding whether to use a regular view or an MV. One and readjust as required. Workloads that are
such example is the cost of storing and maintaining unpredictable in terms of either the number of
the MV. If the results are not used very often (even concurrent queries or the amount of data required for
if they are used more often than they change), the a given query make this challenging.
additional storage and compute resource costs might
not be worth the performance gain.

WHITE PAPER 9
Multi-cluster warehouses handle the first case well RESOURCE MANAGEMENT AND MONITORING
and scale out only when there are enough queries to
A virtual warehouse consumes Snowflake credits
justify it. For the case where there is an unpredictable
while it runs, and the amount consumed depends
amount of data in the queries, you usually have to
on the size of the warehouse and how long it
either wait longer for queries that look at larger data
runs. Snowflake provides a rich set of resource
sets or resize the entire warehouse, which affects all
management and monitoring capabilities to help
clusters in the warehouse and the entire workload.
control costs and avoid unexpected credit usage, not
Snowflake’s Query Acceleration Service provides a just for dbt transformation jobs but for all workloads.
good default for the price-performance spectrum by
Auto-suspend policies
automatically identifying and scaling out parts of the
The very first resource control that you should
query plan that are easily parallelizable (for example,
implement is setting auto-suspend policies for each
per-file operations such as filters, aggregations, scans,
of your warehouses. This feature automatically
and join probes using bloom filters). The benefit is
stops warehouses after they’ve been idle for a
a much reduced query runtime at a lower cost than
predetermined amount of time.
would result from just using a larger warehouse.
We recommend setting auto-suspend according
The Query Acceleration Service achieves this by
to your workload and your requirements for
elastically recruiting ephemeral worker nodes to
warehouse availability:
lend a helping hand to the warehouse. Parallelizable
fragments of the query plan are queued up for • If you enable auto-suspend for your dbt workload, we
processing on leased workers, and the output of recommend setting a more aggressive policy with the
standard recommendation being 60 seconds, because
this fragment execution is materialized and
there is little benefit from caching.
consumed by the warehouse workers as a stream.
As a result, a query over a large data set can finish • You might want to consider disabling auto-suspend for
a warehouse if:
faster, use fewer resources on the warehouse, and
potentially, cost fewer total credits than it would with – You have a heavy, steady workload for
the current model. the warehouse.

What makes this feature unique is: – You require the warehouse to be available with no
delay or lag time. While warehouse provisioning is
• It supports filter types, including joins generally very fast (for example, 1 or 2 seconds), it’s
• No specialized hardware is required not entirely instant; depending on the size of the
warehouse and the availability of compute resources
• You can enable, disable, or configure the service to provision, it can take longer.
without disrupting your workload
If you do choose to disable auto-suspend, you should
This is a great feature to use in your dbt deployment carefully consider the costs associated with running a
if you are looking to: warehouse continually even when the warehouse is
• Accelerate long-running dbt queries that scan a not processing queries. The costs can be significant,
lot of data especially for larger warehouses (X-Large, 2X-Large,
• Reduce the impact of scan-heavy outliers
or larger.).

• Scale performance beyond the largest warehouse size We recommend that you customize auto-suspend
thresholds for warehouses assigned to different
• Speed up performance without changing the
workloads to assist in warehouse responsiveness:
warehouse size
• Warehouses used for queries that benefit from caching
Please note that this feature is currently managed should have a longer auto-suspend period to allow for
outside of dbt. the reuse of results in the query cache.

This feature is in private preview at the time of this • Warehouses used for data loading can be suspended
white paper’s first publication; please reach out to immediately after queries are completed. Enabling auto-
resume will restart a virtual warehouse as soon as it
your Snowflake representative if you are interested in
receives a query.
experiencing this feature with your dbt deployment.

WHITE PAPER 10
Resource monitors • If either the warehouse-level or account-level resource
monitor reaches its defined threshold, the warehouse is
Resource monitors can be used by account suspended. This enables controlling global credit usage
administrators to impose limits on the number of while also providing fine-grained control over credit
credits that are consumed by different workloads, usage for individual or specific warehouses.
including dbt jobs within each monthly billing
• In addition, an account-level resource monitor does
period, by: not control credit usage by the Snowflake-provided
• User-managed virtual warehouses warehouses (used for Snowpipe, automatic reclustering,
and MVs); the monitor controls only the virtual
• Virtual warehouses used by cloud services warehouses created in your account.

When these limits are either close to being reached Considering these rules, the following are some
or have been reached, the resource monitor can send recommendations on resource monitoring strategy:
alert notifications or suspend the warehouses.
• Define an account-level budget.
It is essential to be aware of the following rules about • Define priority warehouse(s) including warehouses for
resource monitors: dbt workloads and carve from the master budget for
• A monitor can be assigned to one or more warehouses. priority warehouses.

• Each warehouse can be assigned to only one • Create a resource allocation story and map.
resource monitor.
Figure 3 illustrates an example scenario for a resource
• A monitor can be set at the account level to control monitoring strategy in which one resource monitor is
credit usage for all warehouses in your account. set at the account level, and individual warehouses
• An account-level resource monitor does not override are assigned to two other resource monitors:
resource monitor assignment for individual warehouses.

Credit quota = 5,000 Credit quota = 1,000 Credit quota = 2,500

RESOURCE RESOURCE RESOURCE


MONITOR 1 MONITOR 2 MONITOR 3

Set for
the account Assigned to Assigned to

WAREHOUSE 1 WAREHOUSE 2 WAREHOUSE 3 WAREHOUSE 4 WAREHOUSE 5

Figure 3: Example scenario for a resource monitoring strategy

WHITE PAPER 11
In the example (Figure 3 on the previous page), the time to suspend, even when the action is Suspend
credit quota for the entire account is 5,000 per Immediate, thereby consuming additional credits.
month; if this quota is reached within the interval, the
If you wish to strictly enforce your quotas, we
actions defined for the resource monitor (Suspend,
recommend the following:
Suspend Immediate, and so on) are enforced for all
five warehouses. • Utilize buffers in the quota thresholds for actions (for
example, set a threshold to 90% instead of 100%).
Warehouse 3 performs ETL including ETL for dbt This will help ensure that your credit usage doesn’t
jobs. From historical ETL loads, we estimated it can exceed the quota.
consume a maximum of 1,000 credits for the month. • To more strictly control credit usage for individual
We assigned this warehouse to Resource Monitor 2. warehouses, assign only a single warehouse to
each resource monitor. When multiple warehouses
Warehouse 4 and 5 are dedicated to the business are assigned to the same resource monitor, they
intelligence and data science teams. Based on their share the same quota thresholds, which may result in
historical usage, we estimated they can consume a credit usage for one warehouse impacting the other
maximum combined total of 2,500 credits for the assigned warehouses.
month. We assigned these warehouses to Resource When a resource monitor reaches the threshold
Monitor 3. for an action, it generates one of the following
Warehouse 1 and 2 are for development and testing. notifications, based on the action performed:
Based on historical usage, we don’t need to place a • The assigned warehouses will be suspended after all
specific resource monitor for them. running queries complete.

The credits consumed by Warehouses 3, 4, and 5 may • All running queries in the assigned warehouses will be
be less than their quotas if the account-level quota is canceled and the warehouses suspended immediately.
reached first. • A threshold has been reached, but no action has
been performed.
The used credits for a resource monitor reflect
the sum of all credits consumed by all assigned Notifications are disabled by default and can be
warehouses within the specified interval. If a monitor received only by account administrators with the
has a Suspend or Suspend Immediately action ACCOUNTADMIN role. To receive notifications,
defined and its used credits reach the threshold for each account administrator must explicitly enable
the action, any warehouses assigned to the monitor notifications through their preferences in the web
are suspended and cannot be resumed until one of interface. In addition, if an account administrator
the following conditions is met: chooses to receive email notifications, they must
• The next interval, if any, starts, as dictated by the start provide (and verify) a valid email address before they
date for the monitor. will receive any emails.
• The credit quota for the monitor is increased. We recommend having well-defined naming
• The credit threshold for the suspended action conventions to separate warehouses between hub
is increased. and spokes for tracking, governance (RBAC), and
resource monitors for consumption alerts.
• The warehouses are no longer assigned to the monitor.

• The monitor is dropped.


Naming conventions
Resource monitors are not intended for strictly
Your resource monitor naming conventions are a
controlling consumption on an hourly basis; they
foundation for tracking, allocation, and reporting.
are intended for tracking and controlling credit
They should follow an enterprise plan for the domain
consumption per interval (day, week, month, and
(that is, function/market + environment). They
so on). Also, they are not intended for setting
should also align to your virtual warehouse naming
precise limits on credit usage (that is, down to
convention when more granularity is needed.
the level of individual credits). For example, when
credit quota thresholds are reached for a resource
monitor, the assigned warehouses may take some

WHITE PAPER 12
The following is a sample naming convention: USA_PRD_DATASCIENCE_ADHOC: A resource
monitor set to monitor and send alerts for just the
<domain>_<team>_<function>_<base_name>
single production data science warehouse for the USA.
<team>: The name of the team (for example,
USA_PRD_SERVICE_WAREHOUSES: A resource
engineering, analytics, data science, service, and so
monitor set to monitor and send alerts for all
on) that the warehouses being monitored have been
production services (for example, ELT, reporting tools,
allocated to. When used, this should be the same
and so on) warehouses for the USA.
as the team name used within the names of the
warehouses.

<function>: The processing function (for example, Role-based access control (RBAC)
development, ELT, reporting, ad hoc, and so on) Team members have access only to their assigned
generally being performed by the warehouses to be database and virtual warehouse resources to ensure
monitored. When used, this should be the same as accurate cost allocation.
the processing function name used within the names
of the warehouses. Monitoring
<base name>: A general-purpose name segment to An important first step to managing credit
further distinguish one resource monitor from another. consumption is to monitor it. Snowflake offers
When used, this may be aligned with the base names several capabilities to closely monitor resource
used within the names of the warehouses or it may consumption.
be something more generic to represent the group of The first such resource is the Admin Billing and Usage
warehouses. page in the web interface, which offers a breakdown
An example of applying the naming conventions above of consumption by day and hour for individual
might look something like this: warehouses as well as for cloud services. This data
can be downloaded for further analysis. Figure 4
USA_WAREHOUSES: A resource monitor set to
through Figure 6 show example credit, storage,
monitor and send alerts for all warehouses allocated to
and data transfers consumption from the
the USA spoke.
Snowsight dashboard.

Figure 4: Example credit consumption from the Snowsight dashboard

WHITE PAPER 13
Figure 5: Example storage consumption from the Snowsight dashboard

Figure 6: Example data transfers consumption from the Snowsight dashboard

WHITE PAPER 14
For admins who are interested in diving even deeper This historical data can be used to build
into resource optimization, Snowflake provides the advanced forecasting models to predict future
account usage and information schemas. These tables credit consumption. This trove of data is especially
offer granular details on every aspect of account important to customers who have complex
usage, including for roles, sessions, users, individual multiaccount organizations.
queries, and even the performance or “load” on each
virtual warehouse.

Monitoring credit usage

VIEW DESCRIPTION

Daily credit usage and rebates across all service types


METERING_DAILY_HISTORY
within the last year

WAREHOUSE_METERING_HISTORY Hourly credit usage per warehouse within the last year

A record of every query (including SQL text), elapsed


QUERY_HISTORY
and compute time, and key statistics

Monitoring storage usage

VIEW DESCRIPTION

DATABASE_STORAGE_USAGE_HISTORY Average daily usage (bytes) by database

TABLE_STORAGE_METRICS Detailed storage records for tables

WHITE PAPER 15
The account usage and information schemas can be Warehouses in the web interface. As shown in Figure
queried directly using SQL or analyzed and charted 7, the Warehouse Load Over Time page provides a bar
using Snowsight. The example provided below is chart and a slider for selecting the window of time to
of a load monitoring chart. To view the chart, click view in the chart.

Figure 7: Warehouse Load Over Time page

dbt offers a package called the Snowflake spend including the ability to forecast future usage. We
package that can be used to monitor Snowflake usage. recommend sharing the account usage dashboards
Refer to the dbt package section of this white paper offered by your customers’ preferred BI vendors to
for more details. help them gain visibility on their Snowflake usage
and easily forecast future usage. Figure 8 shows an
Many third-party BI vendors offer pre-built dashboards
example from Tableau.²
that can be used to automatically visualize this data,

Figure 8: Tableau dashboard for monitoring performance

WHITE PAPER 16
INDIVIDUAL DBT WORKLOAD ELASTICITY Snowflake supports resizing a warehouse at any
time, even while running. If a query is running slowly
Snowflake supports two ways to scale warehouses:
and you have additional queries of similar size
• Scale up by resizing a warehouse. and complexity that you want to run on the same
• Scale out by adding warehouses to a multi-cluster warehouse, you might choose to resize the warehouse
warehouse (requires Snowflake Enterprise Edition while it is running; however, note the following:
or higher).
• As stated earlier, larger is not necessarily faster; for
Resizing a warehouse generally improves query smaller, basic queries that are already executing quickly,
you may not see any significant improvement after
performance, particularly for larger, more complex
resizing.
queries. It can also help reduce the queuing that
occurs if a warehouse does not have enough compute • Resizing a running warehouse does not impact queries
that are already being processed by the warehouse; the
resources to process all the queries that are submitted
additional compute resources, once fully provisioned,
concurrently. Note that warehouse resizing is not are used only for queued and new queries.
intended for handling concurrency issues. Instead,
in such cases, we recommend you use additional • Resizing between a 5XL or 6XL warehouse to a 4XL or
smaller warehouse will result in a brief period during
warehouses or use a multi-cluster warehouse (if this
which you are charged for both the new warehouse and
feature is available for your account). the old warehouse while the old warehouse is quiesced.

WHITE PAPER 17
Figure 9: User running an X-Small virtual warehouse

Figure 10: User resizes the warehouse to X-Large

Scaling up for performance subsequent queries are started on the newly


The purpose of scaling up is to improve query allocated virtual warehouse.
performance and save cost. Let’s look at an example Note that if you start a massive task and amend the
to illustrate this. warehouse size while the query is executing, it will
A user running an X-Small virtual warehouse is continue to execute on the original warehouse size.
illustrated in Figure 9. The user executes an ALTER This means you may need to kill and restart a large
WAREHOUSE statement to resize the warehouse to running task to gain benefits of the larger warehouse.
X-Large, as shown in Figure 10. Also note that it is not possible to automatically adjust
As a result, the number of nodes increases from 1 to warehouse size. However, you could script the ALTER
16. During the resize operation, any currently running WAREHOUSE statement to automate the process as
queries are allowed to complete and any queued or part of a batch ETL operation, for example.

WHITE PAPER 18
Let’s now look at some benchmark data. Below is a create table terabyte_sized_copy as
select *
simple query, similar to many ETL queries in practice,
from sample_data.tpcds_sf10tcl.store_sales;
to load 1.3 TB of data. It was executed on various
warehouse sizes. The table below shows the elapsed time and cost for
different warehouses.

T-SHIRT SIZE ELAPSED TIME COST (CREDITS)

X-Small 5 hours and 30 minutes 5.5

Small 1 hour and 53 minutes 3.7

Medium 1 hour and zero minutes 4.0

Large 37 minutes and 57 seconds 5.0

XLarge 16 minutes and 7 seconds 4.2

2X-Large 7 minutes and 41 seconds 4.0

3X-Large 4 minutes and 52 seconds 5.1

4X-Large 2 minutes and 32 seconds 5.4

Improvement 132 X Same

Here are some interesting observations from the transformations, and querying. Previously, customers
table above: who needed to support compute-intensive workloads
for data processing had to do batch processing and use
• For a large operation, as the warehouse size increases, multiple 4XL warehouses to accomplish their tasks. The
the elapsed time drops by approximately half. new 5XL and 6XL virtual warehouse sizes give users the
ability to run larger compute-intensive workloads in a
• Each step up in warehouse size doubles the
performant fashion without any batching.
cost per hour.

• However, since the warehouse can be suspended after For a dbt workload, you should be strategic about
the task is completed, the actual cost of each operation what warehouse size you use. By default, dbt will
is approximately the same. use the warehouse declared in the connection. If you
• Going from X-Small to 4X-Large yields a 132x want to adjust the warehouse size, you can either
performance improvement with the same cost. This declare a static warehouse configuration on the
clearly illustrates how and why scaling up helps to model or project level or as a dynamic macro such as
improve performance and save cost. the one shared in the Snowflake_utils package.
• Look at how compute resources can be dynamically This allows you to automate selection of the
scaled up, down, or out for each individual workload
warehouse used for your models without manually
based on demand, and also suspend automatically to
stop incurring cost, which is based on per-second billing. updating your connection. Our recommendation
is to use a larger warehouse for incremental
• New 5XL and 6XL virtual warehouse sizes are now
full-refresh runs where you are rebuilding a large
available on AWS and in public preview on Azure at
the time of this white paper’s first publication. These
table from scratch.
sizes give users the ability to add more compute power
to their workloads and enable faster data loading,

WHITE PAPER 19
Scaling out for concurrency anticipates that there will soon be a dramatic change
Multi-cluster warehouses are best utilized for scaling in the number of users online. In Figure 12, the
resources to improve concurrency for users and customer executes an ALTER WAREHOUSE command
queries. They are not as beneficial for improving the to enable the multi-cluster warehouse feature. This
performance of slow-running queries or data loading; command might look like:
for those types of operations, resizing the warehouse alter warehouse PROD_VWH set
provides more benefits.
min_cluster_count = 1
Figure 11 illustrates a customer running queries
max_cluster_count = 10;
against an X-Small warehouse. The performance
is satisfactory, but in this example, the customer

Automatically Scale Out: 1 – 10 same size clusters

Figure 11: Customer runs queries Figure 12: Customer executes an ALTER WAREHOUSE
against an X-Small warehouse command to enable the multi-cluster warehouse feature

The system will automatically scale out by adding WRITING EFFECTIVE SQL STATEMENTS
additional clusters of the same size as additional
To optimize performance, it’s crucial to write effective
concurrent users run queries. The system also will
SQL queries in dbt for execution on Snowflake.
automatically scale back as demand is reduced. As a
result, the customer pays only for resources that were Query order of execution
active during the period.
A query is often written is this order:
In cases where a large load is anticipated from a
SELECT
pipeline or from usage patterns, the min_cluster
parameter can be set beforehand to bring all FROM
compute resources online. This will reduce the delays JOIN
in bringing compute online, which usually happens
only after query queuing and only gradually with a WHERE
cluster every 20 seconds. GROUP BY

ORDER BY

LIMIT

WHITE PAPER 20
ROWS GROUPS RESULT

• GROUP BY • ORDER BY
• FROM
• HAVING • LIMIT
• JOIN
• SELECT
• WHERE

Figure 13: The order of query execution

The order of execution for this query in Snowflake Joining on unique keys
is shown in Figure 13 above. Accordingly, the Joining on nonunique keys can make your data output
example above would execute in the following order: explode in magnitude, for example, where each row in
Step 1: FROM clause (cross-product and table1 matches multiple rows in table2. Figure 14 shows
join operators) an example execution plan where this happens, wherein
the JOIN operation is the most costly operation.
Step 2: WHERE clause (row conditions)

Step 3: GROUP BY clause (sort on grouping


columns, compute aggregates)

Step 4: HAVING clause (group conditions)

Step 5: ORDER BY clause

Step 6: Columns not in SELECT eliminated


(projection operation)

SQL first checks which data table it will work


with, and then it checks the filters, after which it
groups the data. Finally it retrieves the data—and, if
necessary, sorts it and prints only the first <X> lines.

Applying filters as early as possible


As you can see from the order of execution, ROW Figure 14: Example execution plan in which the JOIN is costly
operations are performed before GROUP operations.
Thus, it’s important to think about optimizing ROW
operations before GROUP operations in your Best practices for JOIN operations are:
query. It’s recommended to apply filters early at the • Ensuring keys are distinct (deduplicate)
WHERE-clause level.
• Understanding the relationships between your tables
before joining
Querying only what you need
• Avoiding many-to-many joins
Snowflake uses a columnar format to store data,
so the number of columns retrieved from a query • Avoiding unintentional cross-joins

matters a great deal for performance. Best practice is


to select only the columns you need. You should:
• Avoid using SELECT * to return all columns

• Avoid queries with SELECT long string columns or


SELECT entire variant column

WHITE PAPER 21
Avoiding complex functions and UDFs in On the Snowflake layer, the account should be set up
WHERE clauses with minimal separation of raw and analytics databases,
While built-in functions and UDFs can be as well as with clearly defined production and
tremendously useful, they can also impact development schemas. There are different iterations of
performance when used in query predicates. Figure this setup, and you should create what meets the needs
15 is an example of this scenario in which a log of your workflow. The goal here is
function is used where it should not be used. to remove any confusion as to where objects should
be built during the different stages of development
and deployment.

dbt developers should have control of their own


development sandboxes so they can safely build any
objects they have permissions to build. A sandbox often
takes the form of a personal schema to ensure that other
developers don’t accidentally delete or update objects.
Figure 15: Example of using a log function inappropriately
To learn more, check out this blog post.

On the dbt layer, environment definitions consist of


OPTIMIZING DBT two things: the connection details and a dbt concept
called target.
dbt is a transformation workflow that lets analytics
engineers transform data by simply writing SQL When setting up your connection, you provide a data
statements. At its core, the way it operates with warehouse and schema. Those will be the default
Snowflake is by compiling the SQL for Snowflake Snowflake schema and database you will be building
to execute. This means you can perform all of your objects into.
data transformations inside of your data warehouse, Meanwhile, how your target comes into play differs
making your process more efficient because there’s slightly depending on the dbt interface you are using.
no need for data transference. You also get full access If you’re using the command line, the target is the
to Snowflake’s extensive analytics functionalities, connection you wish to connect to (and thus the
now framed by the dbt workflow. In this section, default schema/database). You can also use the target
we discuss specific dbt best practices that optimize to apply Jinja conditions in your code, allowing you to
Snowflake resources and functionalities. For broader adjust the compiled code based on the target. If you're
dbt best practices, check out this discourse post. using dbt Cloud, the target can be used only to apply
conditional logic; the default schema/database will be
Use environments
defined in the environment settings.
Mitigate risk by defining environments in
Snowflake and dbt. Making use of distinct As a best practice for development, the default schema
environments may not be new in the world of should be your sandbox schema, while for production,
software engineering but definitely can be in the the default should be a production schema. As a project
world of data. The primary benefit of using clearly grows in size, you should define custom databases/
defined production and development environments schemas either via hard-coding or via dynamic logic
is the mitigation of risk: in particular, the risk of using targets so that, depending on the environment
costly rebuilds if anything breaks in production. you’re working in, the database/schema changes to the
With dbt and Snowflake, you can define cohesive associated Snowflake environment.
environments and operate in them with minimal When you combine environments with the ref
friction. Before even beginning development work in function, code promotion is dramatically simplified. The
dbt, you should create and strictly implement ref function dynamically changes the object being
these environments.

WHITE PAPER 22
referenced based on the environment, without you allows updating the logic only in one place. You can also
having to write conditional logic. This means that implement a variable in the logic to adjust the time period
when you select from a referenced object, dbt will specified in the WHERE clause (with a default date that
automatically know the appropriate schema and/or can be overridden in a run).
database to interpolate.
Here is sample code that allows you to call this macro
This makes it possible for your code to never have into a dbt model and add the WHERE clause when the
to change as it’s promoted from development to target is dev:
production because dbt is always aware of the
underlying environment. Figure 16 (below) shows an {% macro limit_in_dev(timestamp) %}
example of how a dbt model relates to a Snowflake -- this filter will only apply during a dev run
database. You can configure the dbt model
{% if target.name == 'dev' %}
df_model to explicitly build into the Snowflake
df_{environment} every time or based on where {{timestamp}} > dateadd('day',
-{{var('development_days_of_data')}}, current_date)
conditional logic.
{% endif %}
In addition to creating clearly defined environments,
there is an additional cost (and time) saving measure
that target makes possible. During development, For larger projects, you can also use macros to limit
you may find that you often need only a subset of rebuilding existing objects. By operationalizing the
your data set to test and iterate over. A good way Snowflake Zero-Copy Cloning feature, you can ensure
to limit your data set in this way is to implement that your environments are synced up by cloning from
conditional logic to limit data in dev. another environment to stay up to date. This is fantastic
for developers who prefer to simply clone from an existing
Such macros can automate when a filter is applied
development schema or from the production schema to
and ensure only a limited amount of data is run.
have all the necessary objects to run the entire project
This allows you to do away with the hassle of
and update only what is necessary. By putting this macro
remembering to apply and remove data limitations
into your project, you ensure that developers are writing
through environments.
the correct DDL every time because all they have to do is
To more systematically apply this through a project, execute it rather than manually write it every time.
a good practice is to put the conditional logic into a
macro and then call the macro across models. This

Figure 16: Example of how a dbt model relates to a Snowflake database

WHITE PAPER 23
USE THE REF() FUNCTION AND SOURCES Our recommendation is to start with eight threads
(meaning up to eight parallel models that do not
Always use the ref function and sources in
violate dependencies can be run at the same time), and
combination with threads.
then increase the number of threads as your project
To best leverage Snowflake’s resources, it’s important expands. While there is no maximum number of threads
to carefully consider the design of your dbt project. you can declare, it’s important to note that increasing
One key way to do that is to ensure you are using the the number of threads increases the load on your
ref()and source()functions in every dbt model, warehouse, potentially constraining other usage.
rather than hard-coding database objects.
The number of concurrent models being run is also
The ref function is a keystone of dbt’s functionality. a factor of your project’s dependencies. For that
By using the function, dbt is able to infer reason, we recommend structuring your code as
dependencies and ensure that the correct upstream multiple models, maximizing the number that can be
tables and views are selected based on your run simultaneously.
environment. Simply put, it makes sense to always
As your project expands, you should continue to
use the ref function when selecting from another
increase the number of threads while keeping an
model, rather than using the direct relation reference
eye on your Snowflake compute. Hitting compute
(for example, my_schema.my_table).
limitations as you increase the number of threads may
When you use the ref function, dbt automatically be a good signal that it’s time to increase the Snowflake
establishes a lineage from the model being warehouse size as well.
referenced to the model where that reference is
Figure 17 shows a sample dbt DAG. In this example,
declared, and then it uses it to optimize the build
if a user declared three threads, dbt would know to
order and document lineage.
run the first three staging models prior to running
After the ref() function creates the directed acyclic dim_suppliers. By specifying three threads, dbt will
graph (DAG), dbt is able to optimally execute models work on up to three models at once without violating
based on the DAG and the number of threads or dependencies; the actual number of models it can work
maximum number of paths through the graph dbt on is constrained by the available paths through the
is allowed to work on. As you increase the number dependency graph.
of threads, dbt increases the number of paths in the
graph that it can work on at the same time, thus
reducing the runtime of your project.

Figure 17: A sample dbt DAG

WHITE PAPER 24
Sources work similarly to the ref() function, with administrative tasks such as grant statements, or they
the key distinction being that rather than telling dbt can systemically remove deprecated objects.
how a model relates to another model, sources tell
In the past, during object creation, there often needed
dbt how a model relates to a source object. Declaring
to be a parallel administrative workflow alongside
a dependency from a model to a source in this way
development that ensured proper permissions were
enables a couple of important things: It allows you
granted on Snowflake objects. Today all of this can
to select from source tables in your models, and it
be done via Snowflake GRANT statements. dbt adds
opens the door to more extensive project testing and
another layer of functionality here: It allows you to
documentation involving your source data. Figure 18
take all your GRANT statements, ensure they are
(below) shows a sample dbt DAG including a source
run consistently, and version control them for simple
node. The green node represents the source table
auditability.
that stg_tpch_nation has a dependency on.
See this example of a macro written to GRANT
statements. This macro, once implemented as a dbt
WRITE MODULAR, DRY CODE
hook, ensures that the GRANT statements are run after
Use Jinja to write DRY code and operationalize every dbt run, thus ensuring the right roles have access
Snowflake administrative workflows. to newly created objects or future objects.
dbt allows you to use Jinja, a Pythonic templating Similarly, as projects grow in maturity, it’s common
language that can expand on SQL’s capabilities. Jinja for them to have deprecated or unused objects in
gives you the ability to use control structures and Snowflake. dbt allows you to maintain a standardized
apply environment variables. approach for culling such objects, using macros
such as the one mentioned here. This allows you to
Pieces of code written with Jinja that can be reused
operationalize how you tidy up your instance and to
throughout a dbt project are called macros. They
ensure that it is done on a schedule (via a dbt job).
are analogous to functions in other programming
languages, allowing you to define code in one central Macros, in addition to making your SQL more flexible,
location and reuse it in other places. The ref and allow you to compartmentalize your Snowflake
source functions mentioned above are examples administrative code and run it in a systematic fashion.
of Jinja.

In addition to being helpful for environmental


logic, macros can help operationalize Snowflake

Figure 18: A sample dbt DAG including a source node

WHITE PAPER 25
USE DBT TESTS AND DOCUMENTATION to use the node selector state:modified to run
only models that have changes, which is much more
Have at least one dbt test and one model-level
resource-efficient.
description associated with each model.
dbt Documentation ensures that your data team and
Robust systems of quality assurance and discovery
your data stakeholders have the resources they need
are key to establishing organizational trust in
for effective data discovery. The documentation brings
data. This is where dbt tests and documentation
clarity and consistency to the data models your team
are invaluable.
ships, so you can collectively focus on extracting value
dbt tests allow you to validate assumptions about from the models instead of trying to understand them.
your data. Tests are an integral part of a CI/CD
Every dbt model should be documented with a
workflow, allowing you to mitigate downtime and
model description and, when possible, a column-level
prevent costly rebuilds. Over time, tests not only
description. Use doc blocks to create a description in
save you debugging time, but they also help optimize
one file to be applied throughout the project; these are
your usage of Snowflake resources so you’re using
useful particularly for column descriptions that appear
them where they are most valuable rather than to fix
on multiple models.
preventable mistakes.
If you’re interested in documentation for the Snowflake
We recommend that, unless there is a compelling
side, apply query tags to your models. These allow
reason not to, every dbt model has a test associated
you to conveniently tag in Snowflake’s query history
with it. Primary key tests are a good default, as failure
where a model was run. You can get as granular as
there points to a granularity change.
is convenient there, by either implementing model-
When you implement a CI/CD process, be sure to specific query tags that allow you to see the query run
use Slim CI builds for systemic quality checks. With attributed to a specific dbt model or by having one
Slim CI, you don't have to rebuild and test all your automatically set on the project level, such as with the
models; you can instead instruct dbt to run jobs on following macro:
only modified or new resources. This allows you

{% macro set_query_tag() -%}

{% set new_query_tag = model.name %} {# always use model name #}

{% if new_query_tag %}

{% set original_query_tag = get_current_query_tag() %}

{{ log("Setting query_tag to '" ~ new_query_tag ~ "'. Will reset to '" ~


original_query_tag ~ "' after materialization.") }}

{% do run_query("alter session set query_tag = '{}'".format(new_query_tag)) %}

{{ return(original_query_tag)}}

{% endif %}

{{ return(none)}}

{% endmacro %}

WHITE PAPER 26
USE PACKAGES section). Doing this ensures that projects are aligned
in companywide definitions of, for example, what a
Don’t reinvent the wheel. Use packages to help scale
customer is, and it limits the amount of WET (write
up your dbt project quickly.
every time) code.
Packages can be described as dbt’s version of Python
libraries. They are shareable pieces of code that you BE INTENTIONAL ABOUT YOUR
can incorporate into your own project to help you MATERIALIZATIONS
tackle a problem someone else has already solved Choose the right materialization for your current
or to share your knowledge with others. Packages needs and scale.
allow you to free up your time and energy to focus on
One of the easiest ways to fine-tune performance
implementing your own unique business logic.
and control your runtimes is via materializations.
Some key packages live on the dbt Package Hub. Materializations are build strategies for how your dbt
There, you can find packages that simplify things models persist in Snowflake. Four materializations
such as: are supported out of the box by dbt: view, table,
• Transforming data from a consistently structured incremental, and ephemeral.
SaaS data set
By default, dbt models are materialized as views.
• Writing dbt macros that solve the question Views are saved queries that are always up to date,
“How do I write this in SQL?” but they do not store results for faster querying later.
• Navigating models and macros for a particular tool If an alternative materialization is not declared, dbt will
in your data stack create a view. View materializations are a very natural
starting point in a new project.
Every dbt project on Snowflake should have at least
the dbt_utils package installed. This is an invaluable As the volume of your data increases, however, you
package that provides macros that help you write will want to look into alternative materializations that
common data logic, such as creating a surrogate key or store results and thus front-load the time spent when
a list of dates to join. This package will help you scale you query from an object. The next step up is a table
up your dbt project much faster. materialization, which stores results as a queryable
table. We recommend this materialization for any
If you’re using the Snowflake Dynamic Data Masking
models queried by BI tools, or simply when you are
feature, we recommend using the dbt_snow_mask
querying a larger data set.
package. This package provides pre-written macros to
operationalize your dynamic masking application in a Incremental materialization offers a way to improve
way that’s scalable and follows best practices. build time without compromising query speed.
Incremental models materialize as tables in Snowflake,
The snowflake spend package is another great
but they have more-complex underlying DDL, making
package that allows you to easily implement analytics
them more complex configurations. They reduce build
for your Snowflake usage. You can use it to model how
time by transforming only what has been declared to
your warehouses are being used in detail, so you can
be a new record (via logic you supply).
make sure you use only the resources you actually
want to use. We recommend using this package with In addition to the materializations outlined above,
a job in dbt Cloud, so you can easily set up alerting in you also have the option of writing your own custom
case your usage crosses a certain threshold. materializations in your project and then use them in
the same way as you would use materializations that
At larger organizations, it is not uncommon to
come with dbt. This enables you to declare the model
create custom internal packages that are shared
to be materialized as a materialized_view and
among teams. This is a great way to standardize
grants you the same abilities as maintaining lineage
logic and definitions as projects expand across
with the ref function, testing, and documentation.
multiple repositories (something we discuss in a later

WHITE PAPER 27
OPTIMIZE FOR SCALABILITY dbt project. Next, implement a “timebox” for testing
the upgrade and, if possible, require either every user
Even when they start lean, dbt projects can expand
or a group of power users to upgrade to the latest
in scale very quickly. We have seen dbt projects with
dbt version for a set amount of time (such as 1 hour.)
about 100 models expand, with good reason, to
over 1,000 for large enterprises. Because of this, we In that time, you should make clear there should
recommend the following approaches to help you be no merges to production and users should
avoid issues down the line. develop only on the updated version. If the test
goes smoothly, you can then have everyone on your
Plan for project scalability from the outset team upgrade to the latest version both in the IDE
Being proactive about project scalability requires and locally (or continue with their updated version,
that you have a good understanding of how your as the case may be.) On the other hand, if for some
team members work with each other and what your reason the test was not successful, you can make an
desired workflow looks like. We recommend reading informed decision on whether your team will stay on
this Discourse post as an overview of factors and the newest release or roll back to the previous dbt
then considering what options are right for version, and then plan for the next steps to upgrade
your team. at a later date.

Generally speaking, we recommend maintaining the


mono-repository approach as long as possible. This CONCLUSION
allows you to have the simplest possible git workflow Modern businesses need a modern data strategy
and provides a single pane through which to oversee built on platforms that support agility, scalability, and
your project. operational efficiency. dbt and Snowflake are two
As your project and data team scale, you may want technologies that work together to provide just such
to consider breaking the project up into multiple a platform. They’re capable of unlocking tremendous
repositories to simplify the processes of approval and value when used together. Following the best
code promotion. If you do this, we recommend you practices highlighted in this white paper allows you to
make sure your Snowflake environments are aligned unlock the most value possible while minimizing the
with this approach and there is a continual, clear amount of resources expended.
distinction regarding what project, git branch, and
users are building into which Snowflake database
or schema.

Follow a process for upgrading dbt versions


One of the ways teams get caught off guard is by not
establishing how they plan to go about upgrading dbt.
This can lead to teams deciding to forgo upgrading
entirely or to different team members having
different versions, which has downstream effects on
which dbt features can be leveraged in the project.
Being on top of upgrading your dbt version ensures
you have access to the latest dbt functionality,
including support for new Snowflake features.

Our recommended method to upgrading dbt is to use


a timeboxed approach. You should start by reading
the necessary changelog and migration guides to get
a sense of what changes might be needed for your

WHITE PAPER 28
CONTRIBUTORS

Contributors to this document include:

• BP Yau
Senior Partner Sales Engineer, Snowflake

• Amy Chen
Partner Solutions Architect, dbt Labs

REVIEWERS

Thanks to the following individuals and


organizations for reviewing this document:

• Dmytro Yaroshenko
Principal Data Platform Architect, Snowflake

• Jeremiah Hansen
Principal Data Platform Architect, Snowflake

• Brad Culberson
Principal Data Platform Architect, Snowflake

• Azzam Aijazi
Senior Product Marketing Manager, dbt Labs

DOCUMENT REVISIONS

Date: September 2021


Description: First publication

WHITE PAPER 29
ABOUT DBT LABS
dbt Labs was founded to solve the workflow problem in analytics, and created dbt to help.
With dbt, anyone on the data team can model, test, and deploy data sets using just SQL.

By applying proven software development best practices like modularity, version control,
testing, and documentation, dbt’s analytics engineering workflow helps data teams work
faster and more efficiently to bring order to organizational knowledge. Getdbt.com.

ABOUT SNOWFLAKE
Snowflake delivers the Data Cloud—a global network where thousands of organizations mobilize
data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations
unite their siloed data, easily discover and securely share governed data, and execute diverse analytic
workloads. Wherever data or users live, Snowflake delivers a single and seamless experience across
multiple public clouds. Snowflake’s platform is the engine that powers and provides access to the
Data Cloud, creating a solution for data warehousing, data lakes, data engineering, data science, data
application development, and data sharing. Join Snowflake customers, partners, and data providers
already taking their businesses to new frontiers in the Data Cloud. Snowflake.com.

©2021 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service names mentioned herein
are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used
herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be spon-
sored or endorsed by, any such holder(s).

E N DN OTE S

1 bit.ly/3BqWt3T 2 tabsoft.co/2YcdJM3 WHITE PAPER

You might also like