Snowflake Streams 101
Snowflake Streams 101
Guide (2024)
Capturing changes within tables efficiently remains a critical challenge for data engineers.
Traditional approaches to change data capture (CDC) can be critical, often relying on
methods like updated timestamps, row versioning, or complex log scanning. These
approaches can become difficult to manage and scale, especially for large datasets and real-
time applications. Fortunately, Snowflake offers a simple and elegant solution: Snowflake
Stream Objects, which take a fundamentally different approach to CDC by introducing
three dedicated meta-columns to your existing tables. These columns seamlessly track all
data modifications.
In this article, we will provide an in-depth overview of Snowflake streams. We will dive
into key concepts, syntax, a step-by-step process for creating a Snowflake Stream, types of
Snowflake streams, real-world examples, best practices, limitations, disadvantages of
Snowflake Streams, distinction between stream vs task—and so much more!!
Snowflake streams are objects that track all DML operations against the source table. Under
the hood, Snowflake streams add three metadata columns—
METADATA$ACTION, METADATA$ISUPDATE and METADATA$ROW_ID—to
the source table when created. These additional columns allow the stream to capture
information about inserts, updates, and deletes without having to store all the table data. In
essence, a stream sets an offset, which acts as a bookmark on the source table's version
timeline. When the stream is queried, it accesses the native versioning history of the source
table and returns only the minimal set of row changes that occurred after the stream's offset
pointer, joining this data with the current table contents to reconstruct the changes.
Standard tables
Directory tables
External tables
Views
Snowflake streams cannot be created on the following objects:
Materialized views
Secure objects (like Secure views , or Secure UDFs )
Some key things to know about Snowflake streams:
There is no hard limit on the number of Snowflake streams you can create
Standard tables, views, directory tables, and external tables are supported source objects
The overhead for streams is minimal—they reference source table data
As previously mentioned, in addition to the data columns from the source table, Snowflake
streams also add additional metadata. Let's delve into these extra columns in brief:
Size + frequency of the data changes in the source objects. If the source objects have a
high volume or velocity of data changes, Snowflake streams may grow rapidly and consume
more resources.
Consumption rate and pattern of the streams. Streams are designed to be consumed in a
transactional fashion, meaning that the stream advances to the next offset after each query
or load operation. If the streams are not consumed regularly or completely, the change
records may accumulate and cause the streams to lag behind the source objects, which may
result in staleness.
Number + complexity of the queries or tasks that use streams. Snowflake streams
support querying and time travel modes, which allow you to access historical data and
changes at any point in time. But remember that these modes may incur additional
processing overhead and latency, especially if the streams have a large number of change
records or a long retention period.
An offset is a point in time that marks the starting point for a stream. When you create a
stream, Snowflake takes a logical snapshot of every row in the source object and assigns a
transactional version to it. This version is the offset for the stream. The stream then records
information about the DML changes that occur after this offset. For example, if you insert,
update, or delete rows in the source object, Snowflake stream will store the state of the rows
before and after the change, as well as some metadata columns that describe the change
event.
A stream does not store any actual data from the source object. It only stores the offset and
returns the change records by using the versioning history of the source object. The
versioning history is maintained by Snowflake using some hidden columns that are added to
the source object when the first stream is created. These columns store the change tracking
metadata. The change records returned by the stream depend on the combination of the
offset and the change tracking metadata.
An offset is like a bookmark that indicates where you are in a book (i.e. the source object).
You can move the bookmark to different places in the book by creating new streams or
using Time Travel. This way, you can access the change records for the source object at
different points in time. For example, you can create a stream with an offset of one hour
ago, and see the changes that happened in the last hour. Or, you can use Time Travel to
query a stream with an offset of yesterday, and see the changes that happened yesterday.
An offset is useful for consuming the change records in a consistent and accurate way. It
ensures that the stream returns the same set of records until you consume or advance the
stream. It also supports repeatable read isolation, which means that Snowflake stream does
not reflect any concurrent changes that happen in the source object while you are querying
the stream.
Snowflake offers three flavors of streams to match different needs for capturing data
changes:
Standard streams
Append-only streams
Insert-only streams
Let's quickly cover what each one does.
First up is the Standard Snowflake stream. As its name suggests, this type tracks all
modifications made to the source table including inserts, updates, and deletes. If you need
full change data capture capability, standard Snowflake streams are a way to go.
2) Append-Only Streams
Next is append-only Snowflake streams. These types of Snowflake streams strictly record
new rows added to the table—so just INSERTS. Update and delete operations (including
table truncates) are not recorded. Append-only streams are great when you just need to see
new data as it arrives.
ON TABLE my_table
APPEND_ONLY = TRUE;
Snowflake Stream Example Creating
append-only Snowflake stream
TLDR; Only track INSERT events on standard tables, directory tables, or views.
Last but not least is insert-only Snowflake streams which are supported on external tables
only. As the name hints, these only track row inserts only; they do not record delete
operations that remove rows from an inserted set.
To create an insert-only stream on an external table, you can use the following syntax:
INSERT_ONLY = TRUE;
Cloning Snowflake streams is a piece of cake! You can clone a stream to create a copy of an
existing stream with the same definition and offset. A clone inherits the current
transactional table version from the source stream, which means it returns the same set of
records as the source stream until the clone is consumed or advanced.
SHOW STREAMS;
Snowflake Stream Example
Working with Snowflake streams is super easy once you get familiar with it. Let’s walk you
through different examples to showcase how to create and use Snowflake streams on tables.
We'll start from the absolute basics.
id INT,
name VARCHAR,
created_date DATE
);
Creating
Snowflake stream on standard table
Boom! This stream will now track inserts, updates, deletes on my_table. We can easily
query it to see changes:
Creatin
g append-only Snowflake streams
This type ignores updates/deletes and is more lightweight. To test, we update an existing
row then query:
Streams work nicely on transient tables too. Transient tables are similar to permanent tables
with the key difference that they do not have a Fail-safe period.
Creating
Snowflake stream on Transient Table
Let’s create a Stream on the transient table
Creating Snowflake
stream on Transient Table
This stream will now track inserts, updates, deletes on my_temp_table. We can query it to
see changes:
Next up, we'll demonstrate streaming an external table. To do this, you'll need an existing
external table. The query process remains mostly the same as the standard table, with the
exception of replacing the TABLE keyword with EXTERNAL TABLE.
LOCATION = @MY_AWS_STAGE
FILE_FORMAT = my_format;
INSERT_ONLY = TRUE;
Scalability: Creating and querying many streams concurrently could impact performance.
Throttling may be required.
Data Volumes + Throughput: Ingesting high volumes of data into source tables could
overload stream change tracking capabilities leading to lag or data loss scenarios.
Latency: There is often a slight delay in change data appearing in streams due to
transactional and consistency model differences. Tuning may help produce lower latency
CDC.
Complexity: Managing many Snowflake streams, monitoring for issues, and maintaining
robust change consuming pipelines requires strong data engineering skills.
Support: Streams don't work for all data types and formats. Semi-structured, geospatial, etc
may not fully capture all changes.
Make sure to carefully test Snowflake streams against your specific data and use case.
Snowflake streams also have some disadvantages that you should be aware of before using
them. Some of these disadvantages are:
Snowflake streams are not compatible with some types of objects and operations in
Snowflake. For example, streams cannot track changes in materialized views.
Snowflake streams may incur additional storage and performance costs. Streams store
the offset and change tracking metadata for the source object in hidden internal tables,
which consume some storage space.
Streams can only track DML changes, such as insert, update, or delete, but not DDL
changes
Streams may require more maintenance and monitoring. Streams are designed to be
consumed regularly and completely, meaning that the stream advances to the next offset
after each query or load operation. If the streams are not consumed or advanced, the change
records may accumulate and cause the streams to lag behind the source object.
What Is the Difference Between Stream and Snowflake
Task in Snowflake?
Snowflake stream can become stale if its offset moves outside the data retention period for
the source table/view.
At that point, the historical changes required to recreate stream results are no longer
accessible. Any unconsumed records in the stream are lost.
To avoid staleness, Snowflake streams should be read frequently enough to keep the offset
within the table's retention window. If the data retention period for a table is less than 14
days, and a stream has not been consumed, Snowflake temporarily extends this period to
prevent it from going stale. The period is extended to the stream’s offset, up to a maximum
of 14 days by default.
The STALE_AFTER metadata field predicts when a stale state will occur based on
consumption patterns. Monitor and consume streams prior to that timestamp.
If a stream does go stale, it must be dropped and recreated to reset the offset to the end of
the current table timeline. The stream will then resume change capture from that point
forward.
Conclusion
And that’s a wrap! Snowflake streams provide a really slick way to stay on top of your ever-
changing data. Instead of confusing custom solutions, Snowflake streams give you full
access to changes happening across your tables/views.
Snowflake streams are like smart bookmarks—they remember a position in the ongoing
history of your table. So by storing just this tiny offset metadata rather than duplicating
masses of data, streams let you travel through time to fetch row changes on demand!
FAQs
Snowflake streams are objects that track DML changes to source tables or views by adding
metadata columns. They act as a bookmark to access the native versioning history.
Snowflake streams can be created on standard tables, directory tables, external tables and
views.
No, there is no hard limit on the number of streams but performance may degrade with a
very high number.
You can query a stream just like a table, using the SELECT statement. You can apply
filters, predicates, and joins on the stream columns.
Standard streams track inserts, updates and deletes. Append-only streams only track inserts.
Insert-only streams track inserts for external tables.
Yes, both permanent and transient tables support Snowflake streams to track DML changes.
Yes, if a stream's offset moves outside the source table's fail-safe retention period, it can
become stale and must be recreated.
To prevent accumulation of change records which causes streams to lag behind source tables
as the offset is only advanced upon consumption.
Snowflake streams enable you to query and consume the changed data in a transactional and
consistent manner, ensuring data accuracy and freshness.
STALE_AFTER metadata field provides a timestamp for predicted staleness based on past
consumption.