0% found this document useful (0 votes)
23 views11 pages

Caching in The Snowflake Cloud Data Platform

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

Caching in The Snowflake Cloud Data Platform

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Caching in the Snowflake Cloud Data Platform

This article explains how each layer of caching works in Snowflake while a query is executed.

July 31, 2023


Solution
This article was originally published on analytics.today/blog

In terms of performance tuning in Snowflake, there are very few options available. However, it is worth understanding how the Snowflake architecture includes various levels of

caching to help speed your queries. This article provides an overview of the techniques used, and some best practice tips on how to maximize system performance using caching.

Snowflake Database Architecture

Before starting it’s worth considering the underlying Snowflake architecture, and explaining when Snowflake caches data. The diagram below illustrates the overall architecture

which consists of three layers:-


1. Service Layer: Which accepts SQL requests from users, coordinates queries, managing transactions and results. Logically, this can be assumed to hold the result
cache – a cached copy of the results of every query executed.
2. Compute Layer: Which actually does the heavy lifting. This is where the actual SQL is executed across the nodes of a Virtual Data Warehouse. This layer holds a
cache of data queried, and is often referred to as Local Disk I/O although in reality this is implemented using SSD storage. All data in the compute layer is
temporary, and only held as long as the virtual warehouse is active.
3. Storage Layer: Which provides long term storage of results. This is often referred to as Remote Disk, and is currently implemented on either Amazon S3 or
Microsoft Blob storage.

Snowflake Cache Layers

The diagram below illustrates the levels at which data and results are cached for subsequent use. These are:-

1. Result Cache: Which holds the results of every query executed in the past 24 hours. These are available across virtual warehouses, so query results returned to one
user is available to any other user on the system who executes the same query, provided the underlying data has not changed.
2. Local Disk Cache: Which is used to cache data used by SQL queries. Whenever data is needed for a given query it's retrieved from the Remote Disk storage, and
cached in SSD and memory.
3. Remote Disk: Which holds the long term storage. This level is responsible for data resilience, which in the case of Amazon Web Services, means 99.999999999%
durability. Even in the event of an entire data centre failure.
Snowflake Benchmark Performance
Every Snowflake database is delivered with a pre-built and populated set of Transaction Processing Council (TPC) benchmark tables. To test the result of caching, I set up a series

of test queries against a small sub-set of the data, which is illustrated below.

All the queries were executed on a MEDIUM sized cluster (4 nodes), and joined the tables. The tables were queried exactly as is, without any performance tuning.

The following query was executed multiple times, and the elapsed time and query plan were recorded each time.
The screenshot below illustrates the results of the query which summarise the data by Region and Country. In total the SQL queried, summarised and counted over 1.5 Billion

rows. The screenshot shows the first eight lines returned.

Benchmark Test Sequence


The test sequence was as follows:-

1. Run from cold: Which meant starting a new virtual warehouse (with no local disk caching), and executing the query.
2. Run from warm: Which meant disabling the result caching, and repeating the query. This makes use of the local disk caching, but not the result cache.
3. Run from hot: Which again repeated the query, but with the result caching switched on.

Each query ran against 60Gb of data, although as Snowflake returns only the columns queried, and was able to automatically compress the data, the actual data transfers were

around 12Gb. As Snowflake is a columnar data warehouse, it automatically returns the columns needed rather then the entire row to further help maximize query performance.

Performance Run from Cold

This query returned in around 20 seconds, and demonstrates it scanned around 12Gb of compressed data, with 0% from the local disk cache. This means it had no benefit from disk

caching.
The bar chart above demonstrates around 50% of the time was spent on local or remote disk I/O, and only 2% on actually processing the data. Clearly any design changes we can

do to reduce the disk I/O will help this query.

The results also demonstrate the queries were unable to perform any partition pruning which might improve query performance. We’ll cover the effect of partition pruning and

clustering in the next article.

Run from Warm

This query was executed immediately after, but with the result cache disabled, and it completed in 1.2 seconds – around 16 times faster. In this case, the Local Disk cache (which is

actually SSD on Amazon Web Services) was used to return results, and disk I/O is no longer a concern.
In the above case, the disk I/O has been reduced to around 11% of the total elapsed time, and 99% of the data came from the (local disk) cache. While querying 1.5 billion rows,

this is clearly an excellent result.

Run from Hot

This query returned results in milliseconds, and involved re-executing the query, but with this time, the result cache enabled. Normally, this is the default situation, but it was

disabled purely for testing purposes.


The above profile indicates the entire query was served directly from the result cache (taking around 2 milliseconds). Although not immediately obvious, many dashboard

applications involve repeatedly refreshing a series of screens and dashboards by re-executing the SQL. In these cases, the results are returned in milliseconds.

Although more information is available in the Snowflake Documentation, a series of tests demonstrated the result cache will be reused unless the underlying data (or SQL query)

has changed. As a series of additional tests demonstrated inserts, updates and deletes which don't affect the underlying data are ignored, and the result cache is used, provided data

in the micro-partitions remains unchanged.

Finally, results are normally retained for 24 hours, although the clock is reset every time the query is re-executed, up to a limit of 30 days, after which results query the remote

disk.
Snowflake Performance Summary

The sequence of tests was designed purely to illustrate the effect of data caching on Snowflake. The tests included:-

 Raw Data: Including over 1.5 billion rows of TPC generated data, a total of over 60Gb of raw data

 Initial Query: Took 20 seconds to complete, and ran entirely from the remote disk. Quite impressive.

 Second Query: Was 16 times faster at 1.2 seconds and used the Local Disk (SSD) cache.

 Result Set Query: Returned results in 130 milliseconds from the result cache (intentially disabled on the prior query).

To put the above results in context, I repeatedly ran the same query on Oracle 11g production database server for a tier one investment bank and it took over 22 minutes to

complete.

Finally, unlike Oracle where additional care and effort must be made to ensure correct partitioning, indexing, stats gathering and data compression, Snowflake caching is entirely

automatic, and available by default. Absolutely no effort was made to tune either the queries or the underlying design, although there are a small number of options available,

which I'll discuss in the next article. Sign up below for further details.

System Performance Tuning Best Practice

Clearly data caching makes a massive difference to Snowflake query performance, but what can you do to ensure maintain the performance when you cannot change the cache?

Here's a few best practice tips:-

 Auto-Suspend: By default, Snowflake will auto-suspend a virtual warehouse (the compute resources with the SSD cache after 10 minutes of idle time. Best
practice? Leave this alone. Keep in mind, you should be trying to balance the cost of providing compute resources with fast query performance. To illustrate the
point, consider these two extremes:
1. Suspend after 60 seconds: When the warehouse is re-started, it will (most likely) start with a clean cache, and will take a few queries to hold the
relevant cached data in memory. (Note: Snowflake will try to restore the same cluster, with the cache intact, but this is not guaranteed).
2. Suspend Never: And your cache will always be warm, but you will pay for compute resources, even if nobody is running any queries. However,
provided you set up a script to shut down the server when not being used, it may make sense.

 Scale up for large data volumes: If you have a sequence of large queries to perform against massive (multi-terabyte) size data volumes, you can improve query
performance by scaling up. Simple execute a SQL statement to increase the virtual warehouse size, and new queries will start on the larger (faster) cluster. While
this will start with a clean (empty) cache, you should normally find performance doubles at each size, and this extra performance boost will more than out-weigh the
cost of refreshing the cache.

 Scale down - but not too soon: Once your large task has completed, you could reduce costs by scaling down or even suspending the virtual warehouse. Be aware
again however, the cache will start again clean on the smaller cluster. By all means tune the warehouse size dynamically, but don't keep adjusting it, or you'll lose the
benefit.

You might also like