Snowflake Optimization Best Practices: A Guide To Balancing Cost and Performance at Scale
Snowflake Optimization Best Practices: A Guide To Balancing Cost and Performance at Scale
OPTIMIZATION
BEST PRACTICES
A guide to balancing cost and
performance at scale
1
TABLE OF CONTENTS
GETTING STARTED
BEST PRACTICES
2
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
GETTING STARTED
WITH SNOWFLAKE
Snowflake’s AI Data Cloud is a global network that connects customers to the data,
applications and models that are most critical for their businesses. Organizations
successfully use Snowflake as a data warehouse to run complex analytics on large volumes
of data concurrently without sacrificing performance. Due to its unique architecture,
Snowflake allows organizations to scale their data operations while paying for only what
they use.
Understanding Snowflake’s architecture, the pricing structure and how it can be optimized
is vital to maximizing the use of Snowflake for a well-managed data platform, including
managing costs and minimizing operational inefficiencies.
3
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
Understanding Snowflake
architecture: Separation of
compute and storage
A distinct characteristic of Snowflake’s architecture is the
separation of processing power (compute) and storage, allowing
users to scale each component independently of the other based
on need. This decoupling of compute and storage, which have
traditionally been tied together in on-premise systems, allows for
great flexibility and elasticity such as the near-infinite scaling of
data storage.
Massively parallel processing (MPP)
The decoupling of storage and compute is achieved through
Control Node
a hybrid of the standard shared-disk and shared-nothing
database approaches. In a shared-disk architecture, all servers
Application/ Distributed Query
or processors (nodes) in the database cluster have access to User Request Processing Engine
centrally located disk storage. Data is stored in a single, common
location that is accessible to all the nodes simultaneously. In
contrast, a shared-nothing architecture separates data across
nodes and each node operates independently with its own
N/W Layer
disk storage, resources and set of data. Such systems scale
horizontally by simply adding more nodes.
4
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
Compute
Multiple, independent compute clusters process queries.
Storage
Scale up
Concurrency
Scale out
5
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
Provisioning warehouses
Banks
Virtual
Warehouse
US Card Auto
Virtual Virtual
Warehouse Warehouse
Risk BI User
Virtual
Warehouse
Fraud
Analysis
6
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
BALANCING COST
AND PERFORMANCE
IN SNOWFLAKE
Data warehouses are crucial for data storage, running queries and data manipulation.
Due to its flexibility and scalability, it’s easy to under- or over-utilize resources in Snowflake
warehouses. Understanding Snowflake optimization techniques and best practices can
help teams manage consumption costs while maximizing benefits to the business.
7
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
1 RIGHTSIZE WAREHOUSES
Determining the correct size for warehouses is one powerful
3 SET THE RIGHT AUTO-SUSPEND TIME
way organizations can strike the right balance between cost An idle warehouse consumes Snowflake credits even if there
and performance while reducing waste. A business wants to use are no queries or workloads running. Snowflake’s auto-suspend
compute only when necessary and reallocate costs from poorly setting helps minimize the cost of unused warehouses by
optimized configurations. In general, the goal is configuring the suspending idle warehouses and is set at 10 minutes by default.
smallest warehouse size that will complete queries while meeting At Capital One, we’ve found that setting the auto-suspend times
the organization’s performance needs. in smaller increments, for example reducing the auto-suspend
time from 10 minutes to 2 minutes, decreases idleness while
How does a business determine the right size for the needs? There saving costs. Some cache is lost with this method, but we found
are many factors to consider in selecting the right warehouse size an initial decrease in performance of queries is made up for by
that minimizes waste while improving the performance overall. the warehouse running at optimal levels subsequently.
Key metrics to consider include query size, data spillage, query
load and queued queries.
“
reveal the organization has been paying for an XL warehouse
when all that was needed was a medium. Viewing patterns in data
spillage will help identify the queries that lead to disk spillage and
the actual amount of disk storage needed to complete requests.
Query load, or the time it takes for a query to run in a warehouse,
is also an important consideration in the trade off between sizing
An important principle in
up (decreasing query load times) and increasing costs. Lastly,
examining the number of queued queries due to the load on the optimizing warehouses is
warehouse will help determine the right scaling configurations for
clustering. that one warehouse size
An important principle in optimizing warehouses is that one does not fit all.”
warehouse size does not fit all. One warehouse size should not be
running at all times and should be scheduled to change based on
the query load on a certain day of the week or at a given time
of day.
8
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
4 CHOOSE THE RIGHT SCALING POLICY queries so that users can avoid writing bad queries from the
outset or identify these queries before they run. A few of the
In Snowflake, you can configure the multi-cluster feature
mistakes in inefficient queries we have identified and educate
to automatically scale compute resources and support the
users on to avoid include:
concurrency of users and queries. Configuring multi-cluster in
Snowflake with the right scaling policy is important to ensure
• Single row inserts
the environment handles the demands of the workload while
• Selecting * from a large table
managing costs. There are two types of scaling policies: standard,
• Cartesian product joins
which automatically adds or removes clusters based on query
• Deep nested views
demand, and economy, which fully utilizes current clusters and
• Performances that spill to disk
avoids spinning new clusters to save credits. In our experience,
we have found if the workload could afford a delay of more than
five minutes, there are significant cost savings with choosing the
6 SET UP MONITORING AND REPORTING
economy setting.
Snowflake users need the ability to monitor their Snowflake usage
continuously and to be notified of credit usage as it happens.
5 IMPROVE QUERY PERFORMANCE Snowflake’s powerful elasticity and by-the-second charging
structure means costs can quickly rise and get out of hand, and
Queries are instrumental to the health and success of data-driven
the business may not find out until the monthly bill. These events
organizations, but they can also be costly. Inefficient queries
can occur through faulty configurations or inefficient queries.
can strain any business with more use of resources like CPU
Identifying these errors and inefficiencies and then remediating
and memory and slowdowns in performance that affect teams
can be time-consuming and costly to do manually, pulling
across the organization. By taking the time to improve the quality
valuable developer resources away from strategic tasks. Putting in
of queries through query optimization companies can gain
place alerts and limits will bring greater visibility and control over
significant cost advantages.
Snowflake spend so the organization is not overspending
and blindsided.
There are a number of ways to optimize queries for cost savings
and better performance in Snowflake. Companies should have
alerting mechanisms in place that give notice when a query STANDARDIZE PROCESSES AND
7
is running when it should not be. Alerts provide an opportunity AUTOMATE GOVERNANCE
for intervention to avoid an unwanted large bill at the end of
As a business scales, the traditional way of managing data
the month. At Capital One, we’ve also invested in education on
through a central data team breaks down in the face of
exponentially growing volumes of data and requests, leading
to bottlenecks and inefficiencies. Ensuring data is managed
“
responsibly becomes critical. By standardizing data processes
and automating data governance across the organization, the
business can empower teams to perform tasks independently,
such as provisioning warehouses on their own, in a secure and
9
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
SNOWPARK-OPTIMIZED WAREHOUSES
Choosing the right warehouse type and size for each workload
is key to optimizing Snowflake costs. In Snowflake, there are
“
two warehouse types that determine the memory-to-CPU ratio:
Standard and Snowpark-optimized warehouses. There are also
multiple sizes (e.g., XS, S, M, L … 6XL), which set the total amount
of CPU and memory available.
10
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
SNOWFLAKE CACHING
Query performance, or the time to execute a query and return results, can suffer in traditional data warehouses when complex queries
require large datasets and heavy resources. In Snowflake, where using resources efficiently is necessary to manage costs, caching is a
key performance tuning feature that allows quick access to frequently used data. Caching enhances query performance by reducing the
time it takes to get results while using fewer resources, which in turn leads to cost savings. Let’s take a look at two types of caches in
Snowflake: warehouse cache and query cache.
11
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
SCALE SNOWFLAKE
CONFIDENTLY WITH
CAPITAL ONE SLINGSHOT
At Capital One, we found we needed additional ways to manage our Snowflake costs
and streamline work processes at scale across our teams. To minimize bottlenecks
and streamline governance, we developed internally what would later become
Capital One Slingshot, a cloud-based data management solution that optimizes
Snowflake usage and helps businesses maximize their Snowflake investment. For an
organization that uses both, Snowflake is the data warehouse platform for storing
and analyzing data sets while Slingshot is the tool that builds on Snowflake to
enhance its capabilities for businesses.
12
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
COST VISIBILITY
Visibility into Snowflake usage is crucial to optimizing spend in the ways that most benefit the business. With the
ability to view usage at a granular level through tagging and attribute spending to specific teams and business
units, Slingshot users can gain greater insight and understanding into Snowflake costs and usage. Additionally,
while Snowflake provides information by account, Slingshot’s Cost Breakdown Report is valuable for organizations
looking to deepen their understanding with a view of costs across accounts.
13
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
COST VISIBILITY
Organizations gain deep visibility into cost drivers and line-of-
business allocations within Snowflake with detailed dashboards
of all the key metrics impacting cost, performance and usage of
Snowflake. Tagging enables a granular level of visibility, breaking
down costs by user, accounts, warehouses, lines of business and
queries. The Cost Breakdown Report allows admins to further
customize the way they attribute spending to specific teams or
business units, enabling chargebacks.
WAREHOUSE GOVERNANCE
Users can provision new warehouses using pre-configured
templates and fine tune size and scaling policies to dynamically
adjust warehouses based on need. Users can adjust warehouse
sizes based on time and day of the week, allowing organizations
to optimize Snowflake warehouses to run as efficiently as
possible. Once a warehouse request is complete, Slingshot sends
it to the right owner for approval.
14
GETTING STARTED BEST PRACTICES SCALE SNOWFLAKE CONFIDENTLY
WAREHOUSE OPTIMIZATION
For organizations looking for guidance on warehouse usage,
Slingshot provides data-driven recommendations to right-size
warehouses and dashboards that track warehouse performance.
Slingshot analyzes historical usage patterns and warehouse
metadata to determine the best schedule and sizing for balancing
cost and performance, taking the guesswork out of optimization.
QUERY OPTIMIZATION
Based on best practices for writing efficient queries,
Query Advisor analyzes queries to identify inefficiencies.
It surfaces the costliest queries and the most frequently executed
queries for analysis. Then the tool provides opportunities to
improve the query for better performance. Slingshot also provides
the query impact before and after so users can confidently
apply the recommendations. At Capital One, this tool helped us
decrease cost per query by 43%.
15
Visit capitalone.com/software/solutions to learn how
Slingshot can help you optimize Snowflake and
request a demo today.
16