0% found this document useful (0 votes)
153 views15 pages

7 Best Practices For Building Data Applications On Snowflake

This document provides 7 best practices for building data applications on Snowflake. It discusses selecting appropriate virtual warehouse sizes based on workloads, adjusting cluster numbers to match workloads, targeting workloads to the right technologies, reducing operations burden through self-tuning and self-healing, leveraging integrations and partnerships, using materialized views strategically, and using default settings for suspension. The document aims to help data application builders customize and maximize the benefits of Snowflake's cloud-built data platform.

Uploaded by

Lalindra Kumara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views15 pages

7 Best Practices For Building Data Applications On Snowflake

This document provides 7 best practices for building data applications on Snowflake. It discusses selecting appropriate virtual warehouse sizes based on workloads, adjusting cluster numbers to match workloads, targeting workloads to the right technologies, reducing operations burden through self-tuning and self-healing, leveraging integrations and partnerships, using materialized views strategically, and using default settings for suspension. The document aims to help data application builders customize and maximize the benefits of Snowflake's cloud-built data platform.

Uploaded by

Lalindra Kumara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

7 BEST PRACTICES FOR

BUILDING DATA APPLICATIONS


ON SNOWFLAKE
Break out of scalability, concurrency, and performance barriers EBOOK
TABLE OF
CONTENTS
3 Begin with the basics
4 Invest in a modern architecture
5 Discover the Snowflake difference
6 Best practice #1: Select strategic virtual warehouse sizes by service or feature
8 Best practice #2: Adjust minimum and maximum cluster numbers to match expected workloads
9 Best practice #3: Target workloads to the right technologies
10 Best practice #4: Reduce SRE/DevOps burden with self-tuning and self-healing
11 Best practice #5: Benefit from a range of integrations and partnerships
12 Best practice #6: Be strategic with materialized views
13 Best practice #7: Use defaults for suspension
14 Conclusion
15 About Snowflake Inc.
CHAMPION GUIDES
BEGIN WITH
THE BASICS

There has never been a better time to build SaaS data applications. industry or any other industry: The demands on data ingestion and
International Data Corporation (IDC) predicts that big data and the need for real-time analytics are universal.
business analytics solutions will generate revenues of $189 billion in
Once they have met these fundamental needs, data app builders
2019, experiencing double-digit growth through 2022.1 Startups and
must demonstrate their product’s strong performance for a large
independent software vendors must meet two basic requirements
number of concurrent users on a global scale. Adding to the
for their apps to be competitive in this market:
challenge, it’s imperative for builders to keep their own expenses in
THEY MUST INGEST LARGE VOLUMES OF DATA WITH SPEED. line while growing the business and do the best job possible future-
THEY MUST ANALYZE ALL THAT DATA QUICKLY AND EASILY. proofing their technology investments.
These requirements hold true for all data app types, including This ebook explains how data apps, and the customers they serve,
business intelligence (BI), Internet of Things (IoT), marketing and benefit from development on a cloud-built data platform, and it
sales automation, customer relationship management (CRM), and provides seven best practices around architectural, deployment,
machine learning, to name a few. It no longer matters if you’re and operational settings to ensure you customize and maximize
building a data app for the financial, retail, insurance, or healthcare those benefits.
CHAMPION GUIDES
INVEST IN A MODERN
ARCHITECTURE

Data app providers must carefully consider than building new features, app developers find
their data stack architecture. A massive-scale themselves constantly rearchitecting to solve intrinsic
SaaS application simply cannot handle problems with fragmented open-source technology.
modern customer demands if it was not These challenges include latency, incomplete data
analysis, and overhead requirements around system
built for the cloud.
maintenance, upgrades, and security.
Yet, many data apps today are built on traditional
The heart of the problem is that typical massively
data stacks, including legacy on-premises and
parallel processing (MPP) data warehouse
“cloud-washed” data warehouses. These
architectures or cluster-based solutions (such as
technologies were either created before the cloud
Hadoop) tie storage and compute resources together.
existed or shoehorned into the cloud. As a result,
This coupling makes it extremely challenging to
they lack the cloud-native attributes that make
support concurrent workloads. Users get frustrated
modern apps successful.
by slow and disruptive scaling, and engineers waste
Traditional data stacks have limitations around time working around these issues for users.
scalability, concurrency, and performance. Supporting
You can avoid all these scenarios by adopting a
multiple separate workloads is almost impossible in
modern data stack with unlimited and automatic
a single-cluster architecture where everyone fights
scalability, concurrency, instant elasticity, and support
for resources. Even if different clusters are used for
for semi-structured data. Cloud-built technologies
concurrent workloads, the risk of data inconsistency
have these strengths built into their core architecture,
is high when data changes. The outcome is
which enables customers to extract maximum
inaccurate analytics and unhappy customers. Data
business insights and value from their data.
app builders who attempt to scale within these
architectures require large capital investments, To keep data app developers focused on new
which can be a death knell for a startup with limited customer features, the underlying technology should
financial resources. also be a fully managed service with a secure data
environment. This modern architecture keeps costs
The same challenges hold true for open-source
under control by providing smarter query execution
databases, where components don’t scale and
and a “pay for what you use” model.
the underlying infrastructure is limited. Rather

4
CHAMPION GUIDES
DISCOVER THE SNOWFLAKE
DIFFERENCE

Snowflake’s cloud-built data platform provides ARCHITECTURE • Compute: Independent compute resources
the modern stack you need to develop and execute data processing tasks, such as loading,
Snowflake’s modern architecture corrects all the
scale modern data applications. Snowflake transformation, and querying. Snowflake
flaws inherent in legacy systems to deliver a new
is built on and for the cloud and therefore provides “virtual warehouses,” or compute
type of data platform that enables scalability,
clusters, that access databases in the storage layer
includes fundamental cloud benefits elasticity, and concurrency:
and execute queries with automatically cached
that become clear when you examine its
data. Virtual warehouses can be created, resized,
architecture, deployment, and operations. SEPARATION OF RESOURCES
and deleted dynamically.
When we designed Snowflake, one of the most
• Services: The cloud services layer handles
important decisions we made was to physically
system services such as infrastructure, security,
separate, but logically integrate, compute and
automatic metadata management and resilience,
storage. This eliminates the cluster-building efforts
access control, and optimization. Services also
that other systems must perform to make separate
coordinate query processing and return results
layers work together.
by communicating with client applications via
As a result of this architectural decision, Snowflake connectors (including JDBC, ODBC, and Kafka
provides a multi-cluster, shared data architecture where clients), and “plug in” services via APIs.
three main components work together seamlessly.
Separating compute and storage means the same
• Storage: All data is stored in a persistent storage
data can be used by multiple users simultaneously
layer, which resides in a scalable cloud storage
through multiple compute clusters. Workload
service (such as Amazon S3, Microsoft Azure,
contention is eliminated with dedicated and
and Google Cloud Storage) for maximum data
independent compute resources, so slowdowns or
replication, scaling, and availability without
disruption to queries never happen.
customer management.

5
CHAMPION GUIDES
BEST PRACTICE #1:
SELECT STRATEGIC VIRTUAL WAREHOUSE
SIZES BY SERVICE OR FEATURE
Dedicate separate Snowflake virtual AUTO-SCALING WITH MULTI-TENANCY That means data app providers have the ability to
warehouses (compute clusters) for queries instantly and infinitely scale compute up, down, and
With Snowflake, scalability is automatic and without
and workloads. This practice helps enable out to any and all workloads. Provisioning is immediate
limits, thanks again to a multi-cluster, shared data
for on-demand performance. Snowflake enables any
lower compute usage by allocating the right- architecture that separates compute from storage. All
number of virtual warehouses (compute engines) to
sized compute resources to specific services, users experience seamless, nondisruptive scaling and
be spun up to support any number of customers—
features, or workloads. instant elasticity without the need to redistribute data.
hundreds or thousands. Each customer can be
With the Snowflake point-and-click user interface, assigned its own virtual warehouse and have complete
For example, rather than use a large virtual
or with a few short SQL statements, you can create control over the size and amount of compute
warehouse (eight credits per hour), you
virtual warehouses in seconds. As part of the resources. It takes less than a minute to launch a
may discover that a medium-size virtual warehouse creation process, simply tell Snowflake Snowflake virtual warehouse, and they can be set to
warehouse (four credits per hour) and a small- whether you want it to auto-scale and auto-suspend; scale automatically based on workload demand.
size virtual warehouse (two credits per hour) the Snowflake service then orchestrates the entire
The same holds true with storage: Resources can
match your application’s needs better. This cloud infrastructure for you, detecting when scaling is
be scaled to virtually any capacity at any time,
strategy saves two credits per hour without needed and performing the operation automatically.
with the added benefit of not incurring charges for
No administrator intervention is required.
any sacrifice in performance. unnecessary compute resources.
And, for those times when a heavy one-time
analysis is needed, you can run queries on a
separate right-sized warehouse that doesn’t
impact other queries. If there’s a fixed amount
of work, it often makes sense to use the
biggest warehouse size. Query performance
tends to scale linearly, so a large warehouse
will end up delivering faster analysis and costs
the same as a smaller warehouse that takes
more time.

6
CHAMPION GUIDES
SNOWFLAKE IS A HIGHLY SCALABLE, MULTI-TENANT ENVIRONMENT

BUILT FOR THE CLOUD

BUILT-IN CLOUD SERVICES COMPUTER STORAGE

AUTHENTICATION AND
ACCESS CONTROL COMPUTE
CACHE
ENGINE

Automatic scaling—up, down, suspend


INFRASTRUCTURE MANAGER

COMPUTE
CACHE
OPTIMIZER ENGINE

TRANSACTION MANAGER
COMPUTE
CACHE
ENGINE

Security, data protection SINGLE


SECURITY and retention, availability
and more, built-in INTEGRATED SYSTEM
COMPUTE
CACHE
META METADATA Automatic
management
metadata ENGINE

Compute and storage scale independently Non-disruptive Size when needed

Snowflake is a data platform powered by a unique auto-scaling architecture that delivers high performance.

7
CHAMPION GUIDES
BEST PRACTICE #2:
ADJUST MINIMUM AND MAXIMUM CLUSTER
NUMBERS TO MATCH EXPECTED WORKLOADS
Select a warehouse size (small, medium, large)
that provides adequate performance for each
individual query that runs, keeping in mind that
a given warehouse size can run individual queries
twice as fast as the size below it, and each
additional cluster allows the warehouse to run
more queries in parallel to increase concurrency.
Then, to maximize performance and minimize
costs, it’s important to adjust a virtual warehouse’s
minimum and maximum number of clusters
based on the corresponding concurrent
throughput you expect for the workload.
Keep in mind that, as workloads subside, one
cluster at a time shuts off so you pay only for
the resources needed in any given moment.
This Snowflake strategy provides consistent
performance regardless of the number of queries.

8
CHAMPION GUIDES
BEST PRACTICE #3:
TARGET WORKLOADS TO
THE RIGHT SERVICES
If you’re ingesting multiple types of data from SUPPORT FOR SEMI-STRUCTURED AND With Snowflake, analytics always run against a
multiple sources, it’s important to recognize STREAMING DATA complete data set, which represents a single source
what your data needs and set up your Traditional data stacks fall short due to singular of truth. New data that’s already in Amazon S3, Azure
architecture to support separate workloads support for structured data, limited processing Data Lake Storage, or Google Cloud Platform can be
capabilities, and inadequate memory. Snowflake’s loaded in mere minutes, and customers can execute
that are targeted at the technologies that make
architecture shines by enabling the consolidation of complex queries, including joins, without performing
the most sense. For example, you might want
all data into one platform. any pretransformations. That means data app
to process some streaming data in near real
providers can deliver real-time insights to customers
time and take action on it, while other data • Semi-structured data: For semi-structured data
using all of their data—without any additional work or
types might not need immediate attention and types such as JSON, Avro, Parquet, and XML, wasted effort.
instead should be sent straight to storage for Snowflake’s patented VARIANT data type loads,
future complex analytical segmentation. transforms, and integrates semi-structured data
SELF-SERVICE
natively alongside structured data. Other platforms
By building these capabilities into your require multiple data stores and query grids, but Traditional infrastructures have a dependency:
architecture early on, you accelerate the ability Snowflake makes it easy to query semi-structured Requests for data access and resource scaling must
to manage your data and derive fast insights data immediately in a fully relational manner. go through the data team, which is responsible for
exactly where they are needed. With Snowflake, • Streaming data: Streaming data from sources
orchestrating all resources of the data platform. Time
every piece of data can be sent to two places, such IoT devices, mobile devices, and advertising to value slows down because the data team becomes
technology is necessary if you want to perform a natural bottleneck. This situation is especially
which makes it easy to immediately process
complete data analysis. That’s why Snowflake built frustrating for data scientists, integration developers,
and store it, and it’s easy to handle unstructured
a service called Snowpipe to provide continuous business analysts, data stewards, executives, and the
data without forcing schemas on customers.
loading for streaming data and serverless finance and sales teams, all of whom need fast access
computing for data loading. Traditional systems to data and resources for real-time decision-making
often rely on latent tactics such as batch loading; and collaboration opportunities.
Snowflake automatically loads data into target
In contrast, Snowflake provides a data platform
tables through a programmatic REST API or within
where self-service is enabled at any scale. The
a minute of receiving AWS S3 event notifications.
complexities and workflows that create bottlenecks
are completely removed to provide higher value for
organizations and faster time to value.

9
CHAMPION GUIDES
BEST PRACTICE #4:
REDUCE SRE/DEVOPS BURDEN WITH
SELF-TUNING AND SELF-HEALING
When traditional systems break down due DEPLOYMENT As such, all data and data objects are fully
to software or hardware defects or intensive recoverable during a retention period (the length
Many data app providers adopt a continuous
workloads, manual intervention is needed, of which is determined by a service agreement),
integration/continuous delivery (CI/CD) pipeline
which ensures that accidental errors or bad
which in turn requires Site Reliability model that enables apps to be improved and code to
releases can be rolled back with ease.
Engineering (SRE) or DevOps teams to be deployed to customers on a regular basis. That’s
• Fail-Safe: After a retention period has passed,
intervene. Administrators have to be prepared why Snowflake provides data recovery features to
ensure deployments always go smoothly, even when Snowflake has a built-in “fail-safe” seven-day
to analyze workloads and tweak any of the
period where data can still be recovered
hundreds or thousands of controls that might they don’t.
upon request. After this extended
be available. time has passed, an automated
DATA RECOVERY
In contrast, Snowflake builds high availability process deletes the data.
Traditional data warehouses struggle with CI/CD
and the ability to self-tune and self-heal into
because it requires backups before a release and
every layer of the system. A great example
complex table isolation and ETL data loading during
is the way that Snowflake handles data. a release. And that’s assuming all goes well: If a
Any time it’s written to a table, that data is database needs to be rolled back after a bad release,
synchronously written to highly durable cloud then DevOps is looking at another costly and time-
storage in three different data centers. If a consuming process.
compute cluster in one data center starts losing
To ensure all releases and data are backed up
machines, or if the entire data center goes properly, Snowflake provides continuous data
down, Snowflake instantly provisions a cluster protection (CDP) through two built-in features that
in another data center that has access to all eliminate the need for traditional backup scripts
your data. and processes:

• Time Travel: Snowflake provides a fully updatable


relational database and uses data modification
language (DML) operations to capture updates
or deletion of data rows. These changes are
written internally to a new storage object that
automatically retains the previous storage object.

10
CHAMPION GUIDES
BEST PRACTICE #5:
BENEFIT FROM A RANGE OF
INTEGRATIONS AND PARTNERSHIPS
Snowflake’s ecosystem is designed for flexibility, OPERATIONS • Encryption of data stored in the cloud

openness, and extensibility without exposure Snowflake is a fully managed service that requires • Automatic protection against accidental or
to security threats. That means you benefit near-zero maintenance and provides complete malicious loss of data
from partnerships and seamless integrations security. The burden to build and maintain complex • Fine-grained, role-based access control for data
with vendors of your choice, including vendors data infrastructures is removed when there’s nothing and actions
for BI tools, data science technologies, to manage or optimize, and no downtime is required
• Isolation of query processing and data storage
marketing and sales automation software, CRM for software updates. Data app engineers can focus
solutions, and machine learning. all their energies on building new features rather than Snowflake is SOC II Type 2 certified, HIPAA and
on managing the data stack. PCI DSS compliant, and FedRAMP Ready. In
Another area you might use vendors for is
addition, Snowflake integrates with existing security
security analytics. Snowflake enables you COMPLETE SECURITY information and event management (SIEM) software,
to connect to existing SIEM software, fraud
Every aspect of Snowflake’s cloud data platform is as well as with anti-fraud products, BI tools, ticketing
products, and threat intelligence feeds. In systems, and data science technologies.
designed to deliver end-to-end data security and
addition, Snowflake provides built-in security protect data against current and evolving security
options from leading BI partners, including threats. Snowflake follows best-in-class standards
Tableau, Looker, Sigma Computing, and Periscope and practices, leverages NIST 800-53 and the CIS
Data, so you can create a wide range of user Critical Security Controls, and provides always-on
interfaces, visualizations, and reports that align secure data environments with ACID compliance,
to your needs, processes, and workflows. including multi-statement transaction support.

Snowflake protects against security threats and


ensures consistency and data integrity by providing:

11
CHAMPION GUIDES
BEST PRACTICE #6:
BE STRATEGIC WITH
MATERIALIZED VIEWS
Another way to save money is to be strategic PER-SECOND PRICING
about creating materialized views and Snowflake provides a clear pricing model with only
measure the performance benefits you get two items to consider:
before deciding whether to keep them. If it’s
• Storage is charged per terabyte, compressed,
inexpensive to run a query on the base table,
per month.
or if a big query is run infrequently, the cost
• Compute is based on how many processing units
of materialized view maintenance will not be
are consumed to run queries or perform a service.
worth it. The only candidates recommended
Compute charges are billed on actual usage, per
for materialized views are extremely expensive
second, and only active clusters accrue charges.
aggregations, projections, and selections that
must be run frequently.

12
CHAMPION GUIDES
BEST PRACTICE #7:
USE DEFAULTS
FOR SUSPENSION
A traditional data warehouse must run at full There are no added usage quotas or hidden
capacity 24/7, regardless of whether it’s being price premiums. You simply pay for what you use.
used. In contrast, Snowflake is built to match With Snowflake’s cloud-built architecture, data
usage. Simply enable all virtual warehouses app providers can start small and increase usage as
needed with auto-scaling. Instant elasticity enables
to automatically be suspended when they are
Snowflake’s cloud-built data platform to scale up,
idle and automatically resume when they are
down, and out for complete efficiency
queried. That means you won’t pay for these
and affordability.
virtual warehouses when your users aren’t
running queries. Snowflake also includes integrated usage tracking
by time or by accumulated usage, which allows
you to easily administer cost allocations and
chargebacks. It also lets you stay on top of usage
for all virtual warehouses and monitor hot spots
throughout any period.

13
CHAMPION GUIDES
CONCLUSION

The best way to differentiate your data


application is to provide customers with a
highly performant service that analyzes all
data together and delivers insights with speed
and agility. By future-proofing your data stack
with a cloud-built data platform, you deliver
remarkable customer experiences while
guaranteeing the right framework and support
for your own organic growth. You’ll achieve all
that while not having to worry about planning
for and performing any of the menial, time-
and cost-heavy tasks that were once required
to scale your systems, product, and business.

14
ABOUT SNOWFLAKE
Snowflake’s cloud data platform shatters the barriers that have prevented organizations of all sizes from
unleashing the true value from their data. More than 2,000 customers deploy Snowflake to advance
their businesses beyond what was once possible by deriving all the insights from all their data by all
their business users. Snowflake equips organizations with a single, integrated platform that offers the
only data warehouse built for the cloud; instant, secure, and governed access to their entire network
of data; and a core architecture to enable many types of data workloads, including a single platform for
developing modern data applications. Snowflake: Data without limits. Find out more at snowflake.com

© 2020 Snowflake. All rights reserved.

CITATIONS
https://www.idc.com/getdoc.jsp?containerId=prUS44998419
1

You might also like