Best Practices For Optimizing Your DBT and Snowflake Deployment
Best Practices For Optimizing Your DBT and Snowflake Deployment
WHITE PAPER
TABLE OF CONTENTS
Automated resource optimization - Plan for project scalability from the outset 28
for dbt query tuning 8
- Follow a process for upgrading dbt versions 28
- Automatic clustering 8
Conclusion 28
- Materialized views 8
Contributors 29
- Query acceleration services 9
Reviewers 29
Resource management and monitoring 10
Document Revisions 29
- Auto-suspend policies 10
About dbt Labs 30
- Resource monitors 11
About Snowflake 30
- Naming conventions 12
- Monitoring 13
WHITE PAPER 2
INTRODUCTION analytics data platform as a service, billed based on
consumption. It is faster, easier to use, and far more
Companies in every industry acknowledge that
flexible than traditional data warehouse offerings.
data is one of their most important assets. And yet,
companies consistently fall short of realizing the Snowflake uses a SQL database engine and a unique
potential of their data. architecture designed specifically for the cloud. There
is no hardware (virtual or physical) or software for you
Why is this the case? One key reason is the
to select, install, configure, or manage. In addition,
proliferation of data silos, which create expensive and
ongoing maintenance, management, and tuning are
time-consuming bottlenecks, erode trust, and render
handled by Snowflake.
governance and collaboration nearly impossible.
All components of Snowflake’s service (other than
This is where Snowflake and dbt come in.
optional customer clients) run in a secure cloud
The Snowflake Data Cloud is one global, unified infrastructure.
system connecting companies and data providers to
Snowflake is cloud-agnostic and uses virtual compute
relevant data for their business. Wherever data or
instances from each cloud provider (Amazon
users live, Snowflake delivers a single and seamless
EC2, Azure VM, and Google Compute Engine). In
experience across multiple public clouds, eliminating
addition, it uses object or file storage from Amazon
previous silos.
S3, Azure Blob Storage, or Google Cloud Storage
dbt is a transformation workflow that lets teams for persistent storage of data. Due to Snowflake’s
quickly and collaboratively deploy analytics code unique architecture and cloud independence, you can
following software engineering best practices such as seamlessly replicate data and operate from any of
modularity, portability, CI/CD, and documentation. these clouds simultaneously.
With dbt, anyone who knows SQL can contribute to
production-grade data pipelines.
SNOWFLAKE ARCHITECTURE
By combining dbt with Snowflake, data teams can
collaborate on data transformation workflows while Snowflake’s architecture is a hybrid of traditional
operating out of a central source of truth. Snowflake shared-disk database architectures and shared-
and dbt form the backbone of a data infrastructure nothing database architectures. Similar to shared-disk
designed for collaboration, agility, and scalability. architectures, Snowflake uses a central data repository
for persisted data that is accessible from all compute
When Snowflake is combined with dbt, customers nodes in the platform. But similar to shared-nothing
can operationalize and automate Snowflake’s architectures, Snowflake processes queries using
hallmark scalability within dbt as part of their analytics massively parallel processing (MPP) compute clusters
engineering workflow. The result is that Snowflake where each node in the cluster stores a portion of the
customers pay only for the resources they need, when entire data set locally. This approach offers the data
they need them, which maximizes efficiency and management simplicity of a shared disk architecture,
results in minimal waste and lower costs. but with the performance and scale-out benefits of a
This paper will provide some best practices for using shared-nothing architecture.
dbt with Snowflake to create this efficient workflow. As shown in Figure 1, Snowflake’s unique architecture
consists of three layers built upon a public cloud
infrastructure:
WHAT IS SNOWFLAKE?
• Cloud services: Cloud services coordinate activities
Snowflake’s Data Cloud is a global network where across Snowflake, processing user requests from login
thousands of organizations mobilize data with near- to query dispatch. This layer provides optimization,
unlimited scale, concurrency, and performance. Inside management, security, sharing, and other features.
the Data Cloud, organizations have a single unified
• Multi-cluster compute: Snowflake processes queries
view of data so they can easily discover and securely using virtual warehouses. Each virtual warehouse is
share governed data, and execute diverse analytics an MPP compute cluster composed of multiple compute
workloads. Snowflake provides a tightly integrated nodes allocated by Snowflake from Amazon EC2,
Azure VM, or Google Cloud Compute. Each virtual
WHITE PAPER 3
warehouse has independent compute resources, so • Learning new tools and expanded SQL capabilities:
high demand in one virtual warehouse has no impact on Snowflake is fully compliant with ANSI-SQL, so you can
the performance of other virtual warehouses. For more use the skills and tools you already have. Snowflake
information, see “Virtual Warehouses” in the Snowflake provides connectors for ODBC, JDBC, Python,
documentation. Spark, and Node.js, as well as web and command-line
interfaces. On top of that, Snowpark is an initiative that
• Centralized storage: Snowflake uses Amazon S3, will provide even more options for data engineers to
Azure Blob Storage, or Google Cloud Storage to express their business logic by directly working with
store data into its internal optimized, compressed, Scala, Java, and Python Data Frames.
columnar format using micro-partitions. Snowflake
manages the data organization, file size, structure, • Siloed structured and semi-structured data: Business
compression, metadata, statistics, and replication. Data users increasingly need to work with both traditionally
objects stored by Snowflake are not directly visible by structured data (for example, data in VARCHAR, INT,
customers, but they are accessible through SQL query and DATE columns in tables) as well as semi-structured
operations that are run using Snowflake. data in formats such as XML, JSON, and Parquet.
Snowflake provides a special data type called VARIANT
that enables you to load your semi-structured data
BENEFITS OF USING SNOWFLAKE natively and then query it with SQL.
Snowflake is a cross-cloud platform, which means • Optimizing and maintaining your data: You can run
analytic queries quickly and easily without worrying
there are several things users coming from a more
about managing how your data is indexed or distributed
traditional on-premises solution will no longer need to across partitions. Snowflake also provides built-in data
worry about: protection capabilities, so you don’t need to worry
about snapshots, backups, or other administrative tasks
• Installing, provisioning, and maintaining hardware and
such as running VACUUM jobs.
software: All you need to do is create an account and
load your data. You can then immediately connect from • Securing data and complying with international privacy
dbt and start transforming data. regulations: All data is encrypted when it is loaded into
Snowflake, and it is kept encrypted at all times when
• Determining the capacity of a data warehouse:
at rest and in transit. If your business requirements
Snowflake has scalable compute and storage, so it can
include working with data that requires HIPAA, PII,
accommodate all of your data and all of your users.
PCI DSS, FedRAMP compliance, and more, Snowflake’s
You can adjust the count and size of your virtual
Business Critical edition and higher editions can
warehouses to handle peaks and lulls in your data
support these validations.
usage. You can even turn your warehouses completely
off to stop incurring costs when you are not using them.
WHITE PAPER 4
• Sharing data securely: Snowflake Secure Data Sharing compiles the code, infers dependency graphs, runs
enables you to share near real-time data internally and models in order, and writes the necessary DDL/DML
externally between Snowflake accounts without copying
to execute against your Snowflake instance. This
and moving data sets. Data providers provide secure
data shares to their data consumers, who can view makes it possible for users to focus on writing SQL and
and seamlessly combine the data with their own data not worry about the rest. For writing code that is DRY
sources. Snowflake Data Marketplace includes many (don't repeat yourself), users can use Jinja alongside
data sets that you can incorporate into your existing SQL to express repeated logic using control structures
business data—such as data for weather, demographics,
such as loops and statements.
or traffic—for greater data-driven insights.
DBT CLOUD
WHAT IS DBT? dbt Cloud is the fastest and most reliable way to
When data teams work in silos, data quality suffers. deploy dbt. It provides a centralized experience for
dbt provides a common space for analysts, data teams to develop, test, schedule, and investigate data
engineers, and data scientists to collaborate on models—all in one web-based UI (see Figure 2). This
transformation workflows using their shared is made possible through features such as an intuitive
knowledge of SQL. IDE, automated testing and documentation, in-app
scheduling and alerting, access control, and a native
By applying proven software development best
Git integration.
practices such as modularity, portability, version
control, testing, and documentation, dbt’s analytics dbt Cloud also eliminates the setup and
engineering workflow helps data teams build trusted maintenance work required to manage data
data, faster. transformations in Snowflake at scale. A turn-key
adapter establishes a secure connection built to
dbt transforms the data already in your data
handle enterprise loads, while allowing for fine-
warehouse. Transformations are expressed in simple
grained policies and permissions.
SQL SELECT statements and, when executed, dbt
Figure 2: dbt Cloud provides a centralized experience for developing, testing, scheduling, and investigating data models.
WHITE PAPER 5
CUSTOMER USE CASE
When Ben Singleton joined JetBlue as its Director of
Data Science & Analytics, he stepped into a whirlpool
of demands that his team struggled to keep up with.
The data team was facing a barrage of concerns and
low stakeholder trust.
OPTIMIZING SNOWFLAKE
Your business logic is defined in dbt, but dbt
ultimately pushes down all processing to Snowflake.
For that reason, optimizing the Snowflake side of
your deployment is critical to maximizing your query
performance and minimizing deployment costs. The
table on the following page summarizes the main
areas and relevant best practices for Snowflake and
serves as a checklist for your deployment.
WHITE PAPER 6
AREA BEST PRACTICES WHY
Writing effective Applying filters as early as possible Optimizing row operations and
reducing records in subsequent
SQL statements
operations
Querying only what you need Selecting only the columns needed
to optimize columnar store
WHITE PAPER 7
AUTOMATED RESOURCE OPTIMIZATION • You no longer need to run manual operations to re-
FOR DBT QUERY TUNING cluster data.
Performance and scale are core to Snowflake. • Incremental clustering is done as new data arrives or a
Snowflake’s functionality is designed such that large amount of data is modified.
users can focus on core analytical tasks instead of • Data pipelines consisting of DML operations (INSERT,
on tuning the platform or investing in complicated DELETE, UPDATE, MERGE) can run concurrently and
workload management. are not blocked.
WHITE PAPER 8
MVs are particularly useful when: Snowflake’s compute service monitors the base
• Query results contain a small number of rows and/or tables for MVs and kicks off refresh statements for
columns relative to the base table (the table on which the corresponding MVs if significant changes are
the view is defined) detected. This maintenance process of all dependent
• Query results contain results that require significant
MVs is asynchronous. In scenarios where a user
processing, including: is accessing an MV that has yet to be updated,
– Analysis of semi-structured data Snowflake’s query engine will perform a combined
– Aggregates that take a long time to calculate execution with the base table to always ensure
• The query is on an external table (that is, data sets consistent query results. Similar to Snowflake’s
stored in files in an external stage), which might have automatic clustering with the ability to resume or
slower performance compared to querying native suspend per table, a user can resume and suspend
database tables the automatic maintenance on a per-MV basis. The
• The view’s base table does not change frequently automatic refresh process consumes resources
and can result in increased credit usage. However,
In general, when deciding whether to create an MV Snowflake ensures efficient credit usage by billing
or a regular view, use the following criteria: your account only for the actual resources used.
• Create an MV when all of the following are true: Billing is calculated in one-second increments.
– The query results from the view don’t change often. You can control the cost of maintaining MVs by
This almost always means that the underlying/base carefully choosing how many views to create,
table for the view doesn’t change often or at least the
which tables to create them on, and each view’s
subset of base table rows used in the MV
doesn’t change often. definition (including the number of rows and columns
in that view).
– The results of the view are used often
(typically, significantly more often than the query You can also control costs by suspending or resuming
results change). a MV; however, suspending maintenance typically
– The query consumes a lot of resources. Typically, only defers costs rather than reducing them. The
this means that the query consumes a lot of longer that maintenance has been deferred, the more
processing time or credits, but it could also mean maintenance there is to do.
that the query consumes a lot of storage space for
intermediate results. If you are concerned about the cost of maintaining
MVs, we recommend you start slowly with this
• Create a regular view when any of the following
feature (that is, create only a few MVs on selected
are true:
tables) and monitor the costs over time.
– The results of the view change often.
It’s a good idea to carefully evaluate these guidelines
– The results are not used often (relative to the rate at based on your dbt deployment to see if querying
which the results change).
from MVs will boost performance compared to base
– The query is not resource-intensive so it is not costly tables or regular views without cost overhead.
to re-run it.
Query acceleration services
These criteria are just guidelines. An MV might
Sizing the warehouse just right for a workload is
provide benefits even if it is not used often—
generally a hard trade-off between minimizing
especially if the results change less frequently than
cost and maximizing query performance. You’ll
the usage of the view.
usually have to monitor, measure, and pick an
There are also other factors to consider when acceptable point in this price-performance spectrum
deciding whether to use a regular view or an MV. One and readjust as required. Workloads that are
such example is the cost of storing and maintaining unpredictable in terms of either the number of
the MV. If the results are not used very often (even concurrent queries or the amount of data required for
if they are used more often than they change), the a given query make this challenging.
additional storage and compute resource costs might
not be worth the performance gain.
WHITE PAPER 9
Multi-cluster warehouses handle the first case well RESOURCE MANAGEMENT AND MONITORING
and scale out only when there are enough queries to
A virtual warehouse consumes Snowflake credits
justify it. For the case where there is an unpredictable
while it runs, and the amount consumed depends
amount of data in the queries, you usually have to
on the size of the warehouse and how long it
either wait longer for queries that look at larger data
runs. Snowflake provides a rich set of resource
sets or resize the entire warehouse, which affects all
management and monitoring capabilities to help
clusters in the warehouse and the entire workload.
control costs and avoid unexpected credit usage, not
Snowflake’s Query Acceleration Service provides a just for dbt transformation jobs but for all workloads.
good default for the price-performance spectrum by
Auto-suspend policies
automatically identifying and scaling out parts of the
The very first resource control that you should
query plan that are easily parallelizable (for example,
implement is setting auto-suspend policies for each
per-file operations such as filters, aggregations, scans,
of your warehouses. This feature automatically
and join probes using bloom filters). The benefit is
stops warehouses after they’ve been idle for a
a much reduced query runtime at a lower cost than
predetermined amount of time.
would result from just using a larger warehouse.
We recommend setting auto-suspend according
The Query Acceleration Service achieves this by
to your workload and your requirements for
elastically recruiting ephemeral worker nodes to
warehouse availability:
lend a helping hand to the warehouse. Parallelizable
fragments of the query plan are queued up for • If you enable auto-suspend for your dbt workload, we
processing on leased workers, and the output of recommend setting a more aggressive policy with the
standard recommendation being 60 seconds, because
this fragment execution is materialized and
there is little benefit from caching.
consumed by the warehouse workers as a stream.
As a result, a query over a large data set can finish • You might want to consider disabling auto-suspend for
a warehouse if:
faster, use fewer resources on the warehouse, and
potentially, cost fewer total credits than it would with – You have a heavy, steady workload for
the current model. the warehouse.
What makes this feature unique is: – You require the warehouse to be available with no
delay or lag time. While warehouse provisioning is
• It supports filter types, including joins generally very fast (for example, 1 or 2 seconds), it’s
• No specialized hardware is required not entirely instant; depending on the size of the
warehouse and the availability of compute resources
• You can enable, disable, or configure the service to provision, it can take longer.
without disrupting your workload
If you do choose to disable auto-suspend, you should
This is a great feature to use in your dbt deployment carefully consider the costs associated with running a
if you are looking to: warehouse continually even when the warehouse is
• Accelerate long-running dbt queries that scan a not processing queries. The costs can be significant,
lot of data especially for larger warehouses (X-Large, 2X-Large,
• Reduce the impact of scan-heavy outliers
or larger.).
• Scale performance beyond the largest warehouse size We recommend that you customize auto-suspend
thresholds for warehouses assigned to different
• Speed up performance without changing the
workloads to assist in warehouse responsiveness:
warehouse size
• Warehouses used for queries that benefit from caching
Please note that this feature is currently managed should have a longer auto-suspend period to allow for
outside of dbt. the reuse of results in the query cache.
This feature is in private preview at the time of this • Warehouses used for data loading can be suspended
white paper’s first publication; please reach out to immediately after queries are completed. Enabling auto-
resume will restart a virtual warehouse as soon as it
your Snowflake representative if you are interested in
receives a query.
experiencing this feature with your dbt deployment.
WHITE PAPER 10
Resource monitors • If either the warehouse-level or account-level resource
monitor reaches its defined threshold, the warehouse is
Resource monitors can be used by account suspended. This enables controlling global credit usage
administrators to impose limits on the number of while also providing fine-grained control over credit
credits that are consumed by different workloads, usage for individual or specific warehouses.
including dbt jobs within each monthly billing
• In addition, an account-level resource monitor does
period, by: not control credit usage by the Snowflake-provided
• User-managed virtual warehouses warehouses (used for Snowpipe, automatic reclustering,
and MVs); the monitor controls only the virtual
• Virtual warehouses used by cloud services warehouses created in your account.
When these limits are either close to being reached Considering these rules, the following are some
or have been reached, the resource monitor can send recommendations on resource monitoring strategy:
alert notifications or suspend the warehouses.
• Define an account-level budget.
It is essential to be aware of the following rules about • Define priority warehouse(s) including warehouses for
resource monitors: dbt workloads and carve from the master budget for
• A monitor can be assigned to one or more warehouses. priority warehouses.
• Each warehouse can be assigned to only one • Create a resource allocation story and map.
resource monitor.
Figure 3 illustrates an example scenario for a resource
• A monitor can be set at the account level to control monitoring strategy in which one resource monitor is
credit usage for all warehouses in your account. set at the account level, and individual warehouses
• An account-level resource monitor does not override are assigned to two other resource monitors:
resource monitor assignment for individual warehouses.
Set for
the account Assigned to Assigned to
WHITE PAPER 11
In the example (Figure 3 on the previous page), the time to suspend, even when the action is Suspend
credit quota for the entire account is 5,000 per Immediate, thereby consuming additional credits.
month; if this quota is reached within the interval, the
If you wish to strictly enforce your quotas, we
actions defined for the resource monitor (Suspend,
recommend the following:
Suspend Immediate, and so on) are enforced for all
five warehouses. • Utilize buffers in the quota thresholds for actions (for
example, set a threshold to 90% instead of 100%).
Warehouse 3 performs ETL including ETL for dbt This will help ensure that your credit usage doesn’t
jobs. From historical ETL loads, we estimated it can exceed the quota.
consume a maximum of 1,000 credits for the month. • To more strictly control credit usage for individual
We assigned this warehouse to Resource Monitor 2. warehouses, assign only a single warehouse to
each resource monitor. When multiple warehouses
Warehouse 4 and 5 are dedicated to the business are assigned to the same resource monitor, they
intelligence and data science teams. Based on their share the same quota thresholds, which may result in
historical usage, we estimated they can consume a credit usage for one warehouse impacting the other
maximum combined total of 2,500 credits for the assigned warehouses.
month. We assigned these warehouses to Resource When a resource monitor reaches the threshold
Monitor 3. for an action, it generates one of the following
Warehouse 1 and 2 are for development and testing. notifications, based on the action performed:
Based on historical usage, we don’t need to place a • The assigned warehouses will be suspended after all
specific resource monitor for them. running queries complete.
The credits consumed by Warehouses 3, 4, and 5 may • All running queries in the assigned warehouses will be
be less than their quotas if the account-level quota is canceled and the warehouses suspended immediately.
reached first. • A threshold has been reached, but no action has
been performed.
The used credits for a resource monitor reflect
the sum of all credits consumed by all assigned Notifications are disabled by default and can be
warehouses within the specified interval. If a monitor received only by account administrators with the
has a Suspend or Suspend Immediately action ACCOUNTADMIN role. To receive notifications,
defined and its used credits reach the threshold for each account administrator must explicitly enable
the action, any warehouses assigned to the monitor notifications through their preferences in the web
are suspended and cannot be resumed until one of interface. In addition, if an account administrator
the following conditions is met: chooses to receive email notifications, they must
• The next interval, if any, starts, as dictated by the start provide (and verify) a valid email address before they
date for the monitor. will receive any emails.
• The credit quota for the monitor is increased. We recommend having well-defined naming
• The credit threshold for the suspended action conventions to separate warehouses between hub
is increased. and spokes for tracking, governance (RBAC), and
resource monitors for consumption alerts.
• The warehouses are no longer assigned to the monitor.
WHITE PAPER 12
The following is a sample naming convention: USA_PRD_DATASCIENCE_ADHOC: A resource
monitor set to monitor and send alerts for just the
<domain>_<team>_<function>_<base_name>
single production data science warehouse for the USA.
<team>: The name of the team (for example,
USA_PRD_SERVICE_WAREHOUSES: A resource
engineering, analytics, data science, service, and so
monitor set to monitor and send alerts for all
on) that the warehouses being monitored have been
production services (for example, ELT, reporting tools,
allocated to. When used, this should be the same
and so on) warehouses for the USA.
as the team name used within the names of the
warehouses.
<function>: The processing function (for example, Role-based access control (RBAC)
development, ELT, reporting, ad hoc, and so on) Team members have access only to their assigned
generally being performed by the warehouses to be database and virtual warehouse resources to ensure
monitored. When used, this should be the same as accurate cost allocation.
the processing function name used within the names
of the warehouses. Monitoring
<base name>: A general-purpose name segment to An important first step to managing credit
further distinguish one resource monitor from another. consumption is to monitor it. Snowflake offers
When used, this may be aligned with the base names several capabilities to closely monitor resource
used within the names of the warehouses or it may consumption.
be something more generic to represent the group of The first such resource is the Admin Billing and Usage
warehouses. page in the web interface, which offers a breakdown
An example of applying the naming conventions above of consumption by day and hour for individual
might look something like this: warehouses as well as for cloud services. This data
can be downloaded for further analysis. Figure 4
USA_WAREHOUSES: A resource monitor set to
through Figure 6 show example credit, storage,
monitor and send alerts for all warehouses allocated to
and data transfers consumption from the
the USA spoke.
Snowsight dashboard.
WHITE PAPER 13
Figure 5: Example storage consumption from the Snowsight dashboard
WHITE PAPER 14
For admins who are interested in diving even deeper This historical data can be used to build
into resource optimization, Snowflake provides the advanced forecasting models to predict future
account usage and information schemas. These tables credit consumption. This trove of data is especially
offer granular details on every aspect of account important to customers who have complex
usage, including for roles, sessions, users, individual multiaccount organizations.
queries, and even the performance or “load” on each
virtual warehouse.
VIEW DESCRIPTION
WAREHOUSE_METERING_HISTORY Hourly credit usage per warehouse within the last year
VIEW DESCRIPTION
WHITE PAPER 15
The account usage and information schemas can be Warehouses in the web interface. As shown in Figure
queried directly using SQL or analyzed and charted 7, the Warehouse Load Over Time page provides a bar
using Snowsight. The example provided below is chart and a slider for selecting the window of time to
of a load monitoring chart. To view the chart, click view in the chart.
dbt offers a package called the Snowflake spend including the ability to forecast future usage. We
package that can be used to monitor Snowflake usage. recommend sharing the account usage dashboards
Refer to the dbt package section of this white paper offered by your customers’ preferred BI vendors to
for more details. help them gain visibility on their Snowflake usage
and easily forecast future usage. Figure 8 shows an
Many third-party BI vendors offer pre-built dashboards
example from Tableau.²
that can be used to automatically visualize this data,
WHITE PAPER 16
INDIVIDUAL DBT WORKLOAD ELASTICITY Snowflake supports resizing a warehouse at any
time, even while running. If a query is running slowly
Snowflake supports two ways to scale warehouses:
and you have additional queries of similar size
• Scale up by resizing a warehouse. and complexity that you want to run on the same
• Scale out by adding warehouses to a multi-cluster warehouse, you might choose to resize the warehouse
warehouse (requires Snowflake Enterprise Edition while it is running; however, note the following:
or higher).
• As stated earlier, larger is not necessarily faster; for
Resizing a warehouse generally improves query smaller, basic queries that are already executing quickly,
you may not see any significant improvement after
performance, particularly for larger, more complex
resizing.
queries. It can also help reduce the queuing that
occurs if a warehouse does not have enough compute • Resizing a running warehouse does not impact queries
that are already being processed by the warehouse; the
resources to process all the queries that are submitted
additional compute resources, once fully provisioned,
concurrently. Note that warehouse resizing is not are used only for queued and new queries.
intended for handling concurrency issues. Instead,
in such cases, we recommend you use additional • Resizing between a 5XL or 6XL warehouse to a 4XL or
smaller warehouse will result in a brief period during
warehouses or use a multi-cluster warehouse (if this
which you are charged for both the new warehouse and
feature is available for your account). the old warehouse while the old warehouse is quiesced.
WHITE PAPER 17
Figure 9: User running an X-Small virtual warehouse
WHITE PAPER 18
Let’s now look at some benchmark data. Below is a create table terabyte_sized_copy as
select *
simple query, similar to many ETL queries in practice,
from sample_data.tpcds_sf10tcl.store_sales;
to load 1.3 TB of data. It was executed on various
warehouse sizes. The table below shows the elapsed time and cost for
different warehouses.
Here are some interesting observations from the transformations, and querying. Previously, customers
table above: who needed to support compute-intensive workloads
for data processing had to do batch processing and use
• For a large operation, as the warehouse size increases, multiple 4XL warehouses to accomplish their tasks. The
the elapsed time drops by approximately half. new 5XL and 6XL virtual warehouse sizes give users the
ability to run larger compute-intensive workloads in a
• Each step up in warehouse size doubles the
performant fashion without any batching.
cost per hour.
• However, since the warehouse can be suspended after For a dbt workload, you should be strategic about
the task is completed, the actual cost of each operation what warehouse size you use. By default, dbt will
is approximately the same. use the warehouse declared in the connection. If you
• Going from X-Small to 4X-Large yields a 132x want to adjust the warehouse size, you can either
performance improvement with the same cost. This declare a static warehouse configuration on the
clearly illustrates how and why scaling up helps to model or project level or as a dynamic macro such as
improve performance and save cost. the one shared in the Snowflake_utils package.
• Look at how compute resources can be dynamically This allows you to automate selection of the
scaled up, down, or out for each individual workload
warehouse used for your models without manually
based on demand, and also suspend automatically to
stop incurring cost, which is based on per-second billing. updating your connection. Our recommendation
is to use a larger warehouse for incremental
• New 5XL and 6XL virtual warehouse sizes are now
full-refresh runs where you are rebuilding a large
available on AWS and in public preview on Azure at
the time of this white paper’s first publication. These
table from scratch.
sizes give users the ability to add more compute power
to their workloads and enable faster data loading,
WHITE PAPER 19
Scaling out for concurrency anticipates that there will soon be a dramatic change
Multi-cluster warehouses are best utilized for scaling in the number of users online. In Figure 12, the
resources to improve concurrency for users and customer executes an ALTER WAREHOUSE command
queries. They are not as beneficial for improving the to enable the multi-cluster warehouse feature. This
performance of slow-running queries or data loading; command might look like:
for those types of operations, resizing the warehouse alter warehouse PROD_VWH set
provides more benefits.
min_cluster_count = 1
Figure 11 illustrates a customer running queries
max_cluster_count = 10;
against an X-Small warehouse. The performance
is satisfactory, but in this example, the customer
Figure 11: Customer runs queries Figure 12: Customer executes an ALTER WAREHOUSE
against an X-Small warehouse command to enable the multi-cluster warehouse feature
The system will automatically scale out by adding WRITING EFFECTIVE SQL STATEMENTS
additional clusters of the same size as additional
To optimize performance, it’s crucial to write effective
concurrent users run queries. The system also will
SQL queries in dbt for execution on Snowflake.
automatically scale back as demand is reduced. As a
result, the customer pays only for resources that were Query order of execution
active during the period.
A query is often written is this order:
In cases where a large load is anticipated from a
SELECT
pipeline or from usage patterns, the min_cluster
parameter can be set beforehand to bring all FROM
compute resources online. This will reduce the delays JOIN
in bringing compute online, which usually happens
only after query queuing and only gradually with a WHERE
cluster every 20 seconds. GROUP BY
ORDER BY
LIMIT
WHITE PAPER 20
ROWS GROUPS RESULT
• GROUP BY • ORDER BY
• FROM
• HAVING • LIMIT
• JOIN
• SELECT
• WHERE
The order of execution for this query in Snowflake Joining on unique keys
is shown in Figure 13 above. Accordingly, the Joining on nonunique keys can make your data output
example above would execute in the following order: explode in magnitude, for example, where each row in
Step 1: FROM clause (cross-product and table1 matches multiple rows in table2. Figure 14 shows
join operators) an example execution plan where this happens, wherein
the JOIN operation is the most costly operation.
Step 2: WHERE clause (row conditions)
WHITE PAPER 21
Avoiding complex functions and UDFs in On the Snowflake layer, the account should be set up
WHERE clauses with minimal separation of raw and analytics databases,
While built-in functions and UDFs can be as well as with clearly defined production and
tremendously useful, they can also impact development schemas. There are different iterations of
performance when used in query predicates. Figure this setup, and you should create what meets the needs
15 is an example of this scenario in which a log of your workflow. The goal here is
function is used where it should not be used. to remove any confusion as to where objects should
be built during the different stages of development
and deployment.
WHITE PAPER 22
referenced based on the environment, without you allows updating the logic only in one place. You can also
having to write conditional logic. This means that implement a variable in the logic to adjust the time period
when you select from a referenced object, dbt will specified in the WHERE clause (with a default date that
automatically know the appropriate schema and/or can be overridden in a run).
database to interpolate.
Here is sample code that allows you to call this macro
This makes it possible for your code to never have into a dbt model and add the WHERE clause when the
to change as it’s promoted from development to target is dev:
production because dbt is always aware of the
underlying environment. Figure 16 (below) shows an {% macro limit_in_dev(timestamp) %}
example of how a dbt model relates to a Snowflake -- this filter will only apply during a dev run
database. You can configure the dbt model
{% if target.name == 'dev' %}
df_model to explicitly build into the Snowflake
df_{environment} every time or based on where {{timestamp}} > dateadd('day',
-{{var('development_days_of_data')}}, current_date)
conditional logic.
{% endif %}
In addition to creating clearly defined environments,
there is an additional cost (and time) saving measure
that target makes possible. During development, For larger projects, you can also use macros to limit
you may find that you often need only a subset of rebuilding existing objects. By operationalizing the
your data set to test and iterate over. A good way Snowflake Zero-Copy Cloning feature, you can ensure
to limit your data set in this way is to implement that your environments are synced up by cloning from
conditional logic to limit data in dev. another environment to stay up to date. This is fantastic
for developers who prefer to simply clone from an existing
Such macros can automate when a filter is applied
development schema or from the production schema to
and ensure only a limited amount of data is run.
have all the necessary objects to run the entire project
This allows you to do away with the hassle of
and update only what is necessary. By putting this macro
remembering to apply and remove data limitations
into your project, you ensure that developers are writing
through environments.
the correct DDL every time because all they have to do is
To more systematically apply this through a project, execute it rather than manually write it every time.
a good practice is to put the conditional logic into a
macro and then call the macro across models. This
WHITE PAPER 23
USE THE REF() FUNCTION AND SOURCES Our recommendation is to start with eight threads
(meaning up to eight parallel models that do not
Always use the ref function and sources in
violate dependencies can be run at the same time), and
combination with threads.
then increase the number of threads as your project
To best leverage Snowflake’s resources, it’s important expands. While there is no maximum number of threads
to carefully consider the design of your dbt project. you can declare, it’s important to note that increasing
One key way to do that is to ensure you are using the the number of threads increases the load on your
ref()and source()functions in every dbt model, warehouse, potentially constraining other usage.
rather than hard-coding database objects.
The number of concurrent models being run is also
The ref function is a keystone of dbt’s functionality. a factor of your project’s dependencies. For that
By using the function, dbt is able to infer reason, we recommend structuring your code as
dependencies and ensure that the correct upstream multiple models, maximizing the number that can be
tables and views are selected based on your run simultaneously.
environment. Simply put, it makes sense to always
As your project expands, you should continue to
use the ref function when selecting from another
increase the number of threads while keeping an
model, rather than using the direct relation reference
eye on your Snowflake compute. Hitting compute
(for example, my_schema.my_table).
limitations as you increase the number of threads may
When you use the ref function, dbt automatically be a good signal that it’s time to increase the Snowflake
establishes a lineage from the model being warehouse size as well.
referenced to the model where that reference is
Figure 17 shows a sample dbt DAG. In this example,
declared, and then it uses it to optimize the build
if a user declared three threads, dbt would know to
order and document lineage.
run the first three staging models prior to running
After the ref() function creates the directed acyclic dim_suppliers. By specifying three threads, dbt will
graph (DAG), dbt is able to optimally execute models work on up to three models at once without violating
based on the DAG and the number of threads or dependencies; the actual number of models it can work
maximum number of paths through the graph dbt on is constrained by the available paths through the
is allowed to work on. As you increase the number dependency graph.
of threads, dbt increases the number of paths in the
graph that it can work on at the same time, thus
reducing the runtime of your project.
WHITE PAPER 24
Sources work similarly to the ref() function, with administrative tasks such as grant statements, or they
the key distinction being that rather than telling dbt can systemically remove deprecated objects.
how a model relates to another model, sources tell
In the past, during object creation, there often needed
dbt how a model relates to a source object. Declaring
to be a parallel administrative workflow alongside
a dependency from a model to a source in this way
development that ensured proper permissions were
enables a couple of important things: It allows you
granted on Snowflake objects. Today all of this can
to select from source tables in your models, and it
be done via Snowflake GRANT statements. dbt adds
opens the door to more extensive project testing and
another layer of functionality here: It allows you to
documentation involving your source data. Figure 18
take all your GRANT statements, ensure they are
(below) shows a sample dbt DAG including a source
run consistently, and version control them for simple
node. The green node represents the source table
auditability.
that stg_tpch_nation has a dependency on.
See this example of a macro written to GRANT
statements. This macro, once implemented as a dbt
WRITE MODULAR, DRY CODE
hook, ensures that the GRANT statements are run after
Use Jinja to write DRY code and operationalize every dbt run, thus ensuring the right roles have access
Snowflake administrative workflows. to newly created objects or future objects.
dbt allows you to use Jinja, a Pythonic templating Similarly, as projects grow in maturity, it’s common
language that can expand on SQL’s capabilities. Jinja for them to have deprecated or unused objects in
gives you the ability to use control structures and Snowflake. dbt allows you to maintain a standardized
apply environment variables. approach for culling such objects, using macros
such as the one mentioned here. This allows you to
Pieces of code written with Jinja that can be reused
operationalize how you tidy up your instance and to
throughout a dbt project are called macros. They
ensure that it is done on a schedule (via a dbt job).
are analogous to functions in other programming
languages, allowing you to define code in one central Macros, in addition to making your SQL more flexible,
location and reuse it in other places. The ref and allow you to compartmentalize your Snowflake
source functions mentioned above are examples administrative code and run it in a systematic fashion.
of Jinja.
WHITE PAPER 25
USE DBT TESTS AND DOCUMENTATION to use the node selector state:modified to run
only models that have changes, which is much more
Have at least one dbt test and one model-level
resource-efficient.
description associated with each model.
dbt Documentation ensures that your data team and
Robust systems of quality assurance and discovery
your data stakeholders have the resources they need
are key to establishing organizational trust in
for effective data discovery. The documentation brings
data. This is where dbt tests and documentation
clarity and consistency to the data models your team
are invaluable.
ships, so you can collectively focus on extracting value
dbt tests allow you to validate assumptions about from the models instead of trying to understand them.
your data. Tests are an integral part of a CI/CD
Every dbt model should be documented with a
workflow, allowing you to mitigate downtime and
model description and, when possible, a column-level
prevent costly rebuilds. Over time, tests not only
description. Use doc blocks to create a description in
save you debugging time, but they also help optimize
one file to be applied throughout the project; these are
your usage of Snowflake resources so you’re using
useful particularly for column descriptions that appear
them where they are most valuable rather than to fix
on multiple models.
preventable mistakes.
If you’re interested in documentation for the Snowflake
We recommend that, unless there is a compelling
side, apply query tags to your models. These allow
reason not to, every dbt model has a test associated
you to conveniently tag in Snowflake’s query history
with it. Primary key tests are a good default, as failure
where a model was run. You can get as granular as
there points to a granularity change.
is convenient there, by either implementing model-
When you implement a CI/CD process, be sure to specific query tags that allow you to see the query run
use Slim CI builds for systemic quality checks. With attributed to a specific dbt model or by having one
Slim CI, you don't have to rebuild and test all your automatically set on the project level, such as with the
models; you can instead instruct dbt to run jobs on following macro:
only modified or new resources. This allows you
{% if new_query_tag %}
{{ return(original_query_tag)}}
{% endif %}
{{ return(none)}}
{% endmacro %}
WHITE PAPER 26
USE PACKAGES section). Doing this ensures that projects are aligned
in companywide definitions of, for example, what a
Don’t reinvent the wheel. Use packages to help scale
customer is, and it limits the amount of WET (write
up your dbt project quickly.
every time) code.
Packages can be described as dbt’s version of Python
libraries. They are shareable pieces of code that you BE INTENTIONAL ABOUT YOUR
can incorporate into your own project to help you MATERIALIZATIONS
tackle a problem someone else has already solved Choose the right materialization for your current
or to share your knowledge with others. Packages needs and scale.
allow you to free up your time and energy to focus on
One of the easiest ways to fine-tune performance
implementing your own unique business logic.
and control your runtimes is via materializations.
Some key packages live on the dbt Package Hub. Materializations are build strategies for how your dbt
There, you can find packages that simplify things models persist in Snowflake. Four materializations
such as: are supported out of the box by dbt: view, table,
• Transforming data from a consistently structured incremental, and ephemeral.
SaaS data set
By default, dbt models are materialized as views.
• Writing dbt macros that solve the question Views are saved queries that are always up to date,
“How do I write this in SQL?” but they do not store results for faster querying later.
• Navigating models and macros for a particular tool If an alternative materialization is not declared, dbt will
in your data stack create a view. View materializations are a very natural
starting point in a new project.
Every dbt project on Snowflake should have at least
the dbt_utils package installed. This is an invaluable As the volume of your data increases, however, you
package that provides macros that help you write will want to look into alternative materializations that
common data logic, such as creating a surrogate key or store results and thus front-load the time spent when
a list of dates to join. This package will help you scale you query from an object. The next step up is a table
up your dbt project much faster. materialization, which stores results as a queryable
table. We recommend this materialization for any
If you’re using the Snowflake Dynamic Data Masking
models queried by BI tools, or simply when you are
feature, we recommend using the dbt_snow_mask
querying a larger data set.
package. This package provides pre-written macros to
operationalize your dynamic masking application in a Incremental materialization offers a way to improve
way that’s scalable and follows best practices. build time without compromising query speed.
Incremental models materialize as tables in Snowflake,
The snowflake spend package is another great
but they have more-complex underlying DDL, making
package that allows you to easily implement analytics
them more complex configurations. They reduce build
for your Snowflake usage. You can use it to model how
time by transforming only what has been declared to
your warehouses are being used in detail, so you can
be a new record (via logic you supply).
make sure you use only the resources you actually
want to use. We recommend using this package with In addition to the materializations outlined above,
a job in dbt Cloud, so you can easily set up alerting in you also have the option of writing your own custom
case your usage crosses a certain threshold. materializations in your project and then use them in
the same way as you would use materializations that
At larger organizations, it is not uncommon to
come with dbt. This enables you to declare the model
create custom internal packages that are shared
to be materialized as a materialized_view and
among teams. This is a great way to standardize
grants you the same abilities as maintaining lineage
logic and definitions as projects expand across
with the ref function, testing, and documentation.
multiple repositories (something we discuss in a later
WHITE PAPER 27
OPTIMIZE FOR SCALABILITY dbt project. Next, implement a “timebox” for testing
the upgrade and, if possible, require either every user
Even when they start lean, dbt projects can expand
or a group of power users to upgrade to the latest
in scale very quickly. We have seen dbt projects with
dbt version for a set amount of time (such as 1 hour.)
about 100 models expand, with good reason, to
over 1,000 for large enterprises. Because of this, we In that time, you should make clear there should
recommend the following approaches to help you be no merges to production and users should
avoid issues down the line. develop only on the updated version. If the test
goes smoothly, you can then have everyone on your
Plan for project scalability from the outset team upgrade to the latest version both in the IDE
Being proactive about project scalability requires and locally (or continue with their updated version,
that you have a good understanding of how your as the case may be.) On the other hand, if for some
team members work with each other and what your reason the test was not successful, you can make an
desired workflow looks like. We recommend reading informed decision on whether your team will stay on
this Discourse post as an overview of factors and the newest release or roll back to the previous dbt
then considering what options are right for version, and then plan for the next steps to upgrade
your team. at a later date.
WHITE PAPER 28
CONTRIBUTORS
• BP Yau
Senior Partner Sales Engineer, Snowflake
• Amy Chen
Partner Solutions Architect, dbt Labs
REVIEWERS
• Dmytro Yaroshenko
Principal Data Platform Architect, Snowflake
• Jeremiah Hansen
Principal Data Platform Architect, Snowflake
• Brad Culberson
Principal Data Platform Architect, Snowflake
• Azzam Aijazi
Senior Product Marketing Manager, dbt Labs
DOCUMENT REVISIONS
WHITE PAPER 29
ABOUT DBT LABS
dbt Labs was founded to solve the workflow problem in analytics, and created dbt to help.
With dbt, anyone on the data team can model, test, and deploy data sets using just SQL.
By applying proven software development best practices like modularity, version control,
testing, and documentation, dbt’s analytics engineering workflow helps data teams work
faster and more efficiently to bring order to organizational knowledge. Getdbt.com.
ABOUT SNOWFLAKE
Snowflake delivers the Data Cloud—a global network where thousands of organizations mobilize
data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations
unite their siloed data, easily discover and securely share governed data, and execute diverse analytic
workloads. Wherever data or users live, Snowflake delivers a single and seamless experience across
multiple public clouds. Snowflake’s platform is the engine that powers and provides access to the
Data Cloud, creating a solution for data warehousing, data lakes, data engineering, data science, data
application development, and data sharing. Join Snowflake customers, partners, and data providers
already taking their businesses to new frontiers in the Data Cloud. Snowflake.com.
©2021 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service names mentioned herein
are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used
herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be spon-
sored or endorsed by, any such holder(s).
E N DN OTE S