Caching in The Snowflake Cloud Data Platform
Caching in The Snowflake Cloud Data Platform
This article explains how each layer of caching works in Snowflake while a query is executed.
In terms of performance tuning in Snowflake, there are very few options available. However, it is worth understanding how the Snowflake architecture includes various levels of
caching to help speed your queries. This article provides an overview of the techniques used, and some best practice tips on how to maximize system performance using caching.
Before starting it’s worth considering the underlying Snowflake architecture, and explaining when Snowflake caches data. The diagram below illustrates the overall architecture
The diagram below illustrates the levels at which data and results are cached for subsequent use. These are:-
1. Result Cache: Which holds the results of every query executed in the past 24 hours. These are available across virtual warehouses, so query results returned to one
user is available to any other user on the system who executes the same query, provided the underlying data has not changed.
2. Local Disk Cache: Which is used to cache data used by SQL queries. Whenever data is needed for a given query it's retrieved from the Remote Disk storage, and
cached in SSD and memory.
3. Remote Disk: Which holds the long term storage. This level is responsible for data resilience, which in the case of Amazon Web Services, means 99.999999999%
durability. Even in the event of an entire data centre failure.
Snowflake Benchmark Performance
Every Snowflake database is delivered with a pre-built and populated set of Transaction Processing Council (TPC) benchmark tables. To test the result of caching, I set up a series
of test queries against a small sub-set of the data, which is illustrated below.
All the queries were executed on a MEDIUM sized cluster (4 nodes), and joined the tables. The tables were queried exactly as is, without any performance tuning.
The following query was executed multiple times, and the elapsed time and query plan were recorded each time.
The screenshot below illustrates the results of the query which summarise the data by Region and Country. In total the SQL queried, summarised and counted over 1.5 Billion
1. Run from cold: Which meant starting a new virtual warehouse (with no local disk caching), and executing the query.
2. Run from warm: Which meant disabling the result caching, and repeating the query. This makes use of the local disk caching, but not the result cache.
3. Run from hot: Which again repeated the query, but with the result caching switched on.
Each query ran against 60Gb of data, although as Snowflake returns only the columns queried, and was able to automatically compress the data, the actual data transfers were
around 12Gb. As Snowflake is a columnar data warehouse, it automatically returns the columns needed rather then the entire row to further help maximize query performance.
This query returned in around 20 seconds, and demonstrates it scanned around 12Gb of compressed data, with 0% from the local disk cache. This means it had no benefit from disk
caching.
The bar chart above demonstrates around 50% of the time was spent on local or remote disk I/O, and only 2% on actually processing the data. Clearly any design changes we can
The results also demonstrate the queries were unable to perform any partition pruning which might improve query performance. We’ll cover the effect of partition pruning and
This query was executed immediately after, but with the result cache disabled, and it completed in 1.2 seconds – around 16 times faster. In this case, the Local Disk cache (which is
actually SSD on Amazon Web Services) was used to return results, and disk I/O is no longer a concern.
In the above case, the disk I/O has been reduced to around 11% of the total elapsed time, and 99% of the data came from the (local disk) cache. While querying 1.5 billion rows,
This query returned results in milliseconds, and involved re-executing the query, but with this time, the result cache enabled. Normally, this is the default situation, but it was
applications involve repeatedly refreshing a series of screens and dashboards by re-executing the SQL. In these cases, the results are returned in milliseconds.
Although more information is available in the Snowflake Documentation, a series of tests demonstrated the result cache will be reused unless the underlying data (or SQL query)
has changed. As a series of additional tests demonstrated inserts, updates and deletes which don't affect the underlying data are ignored, and the result cache is used, provided data
Finally, results are normally retained for 24 hours, although the clock is reset every time the query is re-executed, up to a limit of 30 days, after which results query the remote
disk.
Snowflake Performance Summary
The sequence of tests was designed purely to illustrate the effect of data caching on Snowflake. The tests included:-
Raw Data: Including over 1.5 billion rows of TPC generated data, a total of over 60Gb of raw data
Initial Query: Took 20 seconds to complete, and ran entirely from the remote disk. Quite impressive.
Second Query: Was 16 times faster at 1.2 seconds and used the Local Disk (SSD) cache.
Result Set Query: Returned results in 130 milliseconds from the result cache (intentially disabled on the prior query).
To put the above results in context, I repeatedly ran the same query on Oracle 11g production database server for a tier one investment bank and it took over 22 minutes to
complete.
Finally, unlike Oracle where additional care and effort must be made to ensure correct partitioning, indexing, stats gathering and data compression, Snowflake caching is entirely
automatic, and available by default. Absolutely no effort was made to tune either the queries or the underlying design, although there are a small number of options available,
which I'll discuss in the next article. Sign up below for further details.
Clearly data caching makes a massive difference to Snowflake query performance, but what can you do to ensure maintain the performance when you cannot change the cache?
Auto-Suspend: By default, Snowflake will auto-suspend a virtual warehouse (the compute resources with the SSD cache after 10 minutes of idle time. Best
practice? Leave this alone. Keep in mind, you should be trying to balance the cost of providing compute resources with fast query performance. To illustrate the
point, consider these two extremes:
1. Suspend after 60 seconds: When the warehouse is re-started, it will (most likely) start with a clean cache, and will take a few queries to hold the
relevant cached data in memory. (Note: Snowflake will try to restore the same cluster, with the cache intact, but this is not guaranteed).
2. Suspend Never: And your cache will always be warm, but you will pay for compute resources, even if nobody is running any queries. However,
provided you set up a script to shut down the server when not being used, it may make sense.
Scale up for large data volumes: If you have a sequence of large queries to perform against massive (multi-terabyte) size data volumes, you can improve query
performance by scaling up. Simple execute a SQL statement to increase the virtual warehouse size, and new queries will start on the larger (faster) cluster. While
this will start with a clean (empty) cache, you should normally find performance doubles at each size, and this extra performance boost will more than out-weigh the
cost of refreshing the cache.
Scale down - but not too soon: Once your large task has completed, you could reduce costs by scaling down or even suspending the virtual warehouse. Be aware
again however, the cache will start again clean on the smaller cluster. By all means tune the warehouse size dynamically, but don't keep adjusting it, or you'll lose the
benefit.