Best Practices For Data Load
Best Practices For Data Load
com/en/user-guide/data-load-considerations-prepare
The number of load operations that run in parallel cannot exceed the number of data files to be
loaded. To optimize the number of parallel operations for a load, we recommend aiming to
produce data files roughly 100-250 MB (or larger) in size compressed.
Note
If you must load a large file, carefully consider the ON_ERROR copy option value. Aborting or
skipping a file due to a small number of errors could result in delays and wasted credits. In
addition, if a data loading operation continues beyond the maximum allowed duration of 24
hours, it could be aborted without any portion of the file being committed.
Aggregate smaller files to minimize the processing overhead for each file. Split larger files into a
greater number of smaller files to distribute the load among the compute resources in an active
warehouse. The number of data files that are processed in parallel is determined by the amount
of compute resources in a warehouse. We recommend splitting large files by line to avoid
records that span chunks.
If your source database does not allow you to export data files in smaller chunks, you can use a
third-party utility to split large CSV files.
Linux or macOS
The split utility enables you to split a CSV file into multiple smaller files.
Syntax:
split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]
Example:
This example splits a file named pagecounts-20151201.csv by line length. Suppose the large single
file is 8 GB in size and contains 10 million lines. Split by 100,000, each of the 100 smaller files
is 80 MB in size (10 million / 100,000 = 100). The split files are named pages<suffix>.
Windows
Windows does not include a native file split utility; however, Windows supports many third-
party tools and scripts that can split large data files.
In general, JSON data sets are a simple concatenation of multiple documents. The JSON output
from some software is composed of a single huge array containing multiple records. There is no
need to separate the documents with line breaks or commas, though both are supported.
If the data exceeds 16 MB, enable the STRIP_OUTER_ARRAY file format option for
the COPY INTO <table> command to remove the outer array structure and load the records into
separate table rows:
In addition to resource consumption, an overhead to manage files in the internal load queue is
included in the utilization costs charged for Snowpipe. This overhead increases in relation to the
number of files queued for loading. This overhead charge appears as Snowpipe charges in your
billing statement because Snowpipe is used for event notifications for the automatic external
table refreshes.
For the most efficient and cost-effective load experience with Snowpipe, we recommend
following the file sizing recommendations in File Sizing Best Practices and Limitations (in this
topic). Loading data files roughly 100-250 MB in size or larger reduces the overhead charge
relative to the amount of total data loaded to the point where the overhead cost is immaterial.
If it takes longer than one minute to accumulate MBs of data in your source application, consider
creating a new (potentially smaller) data file once per minute. This approach typically leads to a
good balance between cost (i.e. resources spent on Snowpipe queue management and the actual
load) and performance (i.e. load latency).
Creating smaller data files and staging them in cloud storage more often than once per minute
has the following disadvantages:
A reduction in latency between staging and loading the data cannot be guaranteed.
An overhead to manage files in the internal load queue is included in the utilization costs
charged for Snowpipe. This overhead increases in relation to the number of files queued
for loading.
Various tools can aggregate and batch data files. One convenient option is Amazon Kinesis
Firehose. Firehose allows defining both the desired file size, called the buffer size, and the wait
interval after which a new file is sent (to cloud storage in this case), called the buffer interval.
For more information, see the Kinesis Firehose documentation. If your source application
typically accumulates enough data within a minute to populate files larger than the recommended
maximum for optimal parallel processing, you could decrease the buffer size to trigger delivery
of smaller files. Keeping the buffer interval setting at 60 seconds (the minimum value) helps
avoid creating too many files or increasing latency.
UTF-8 is the default character set, however, additional encodings are supported. Use the
ENCODING file format option to specify the character set for the data files. For more
information, see CREATE FILE FORMAT.
Fields that contain delimiter characters should be enclosed in quotes (single or double). If
the data contains single or double quotes, then those quotes must be escaped.
Carriage returns are commonly introduced on Windows systems in conjunction with a
line feed character to mark the end of a line (\r \n). Fields that contain carriage returns
should also be enclosed in quotes (single or double).
The number of columns in each row should be consistent.
By default, Snowflake extracts a maximum of 200 elements per partition, per table. To increase
this limit, contact Snowflake Support.
Elements with the following characteristics are not extracted into a column:
Elements that contain even a single “null” value are not extracted into a column. This
applies to elements with “null” values and not to elements with missing values, which are
represented in columnar form.
This rule ensures that no information is lost (that is, that the difference between
VARIANT “null” values and SQL NULL values is not lost).
{"foo":1}
{"foo":"1"}
When you query a semi-structured element, Snowflake’s execution engine behaves differently
according to whether an element was extracted.
If the element was extracted into a column, the engine scans only the extracted column.
If the element was not extracted into a column, the engine must scan the entire JSON
structure, and then for each row traverse the structure to output values. This impacts
performance.
To avoid the performance impact for elements that were not extracted, do the following:
Alternatively, if the “null” values in your files indicate missing values and have no other
special meaning, we recommend setting the file format option STRIP_NULL_VALUES
to TRUE when you load the semi-structured data files. This option removes OBJECT
elements or ARRAY elements containing “null” values.
Ensure each unique element stores values of a single data type that is native to the format
(for example, string or number for JSON).
Note
Snowflake checks temporal data values at load time. Invalid date, time, and timestamp values
(e.g. 0000-00-00) produce an error.
The number of data files that can be processed in parallel is determined by the amount of
compute resources in a warehouse. If you follow the file sizing guidelines described in Preparing
your data files, a data load requires minimal resources. Splitting larger data files allows the load
to scale linearly. Unless you are bulk loading a large number of files concurrently (i.e. hundreds
or thousands of files), a smaller warehouse (Small, Medium, Large) is generally sufficient. Using
a larger warehouse (X-Large, 2X-Large, etc.) will consume more credits and may not result in
any performance increase.
Staging data
This topic provides best practices, general guidelines, and important considerations
for preparing your data files for loading.
Organizing your data files by path lets you copy any fraction of the partitioned data into
Snowflake with a single command. This allows you to execute concurrent COPY statements that
match a subset of files, taking advantage of parallel operations.
For example, if you were storing data for a North American company by geographical location,
you might include identifiers such as continent, country, and city in paths along with data write
dates:
Canada/Ontario/Toronto/2016/07/10/05/
United_States/California/Los_Angeles/2016/06/01/11/
United_States/New York/New_York/2016/12/21/03/
United_States/California/San_Francisco/2016/08/03/17/
When you create a named stage, you can specify any part of a path. For example, create an
external stage using one of the above example paths:
You can also add a path when you stage files in an internal user or table stage. For example,
stage mydata.csv in a specific path in the t1 table stage:
Use any of the following options to further confine the list of files to load:
If the file names match except for a suffix or extension, include the matching part of the
file names in the path, e.g.:
COPY INTO t1 from @
%t1/United_States/California/Los_Angeles/2016/06/01/11/mydata;
Add the FILES or PATTERN options (see Options for selecting staged data files), e.g.:
COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/
FILES=('mydata1.csv', 'mydata1.csv');
COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/
PATTERN='.*mydata[^[0-9]{1,3}$$].csv';
Loading data
This topic provides best practices, general guidelines, and important considerations
for loading staged data.
By path (internal stages) / prefix (Amazon S3 bucket). See Organizing data by path for
information.
Specifying a list of specific files to load.
Using pattern matching to identify specific files by pattern.
These options enable you to copy a fraction of the staged data into Snowflake with a single
command. This allows you to execute concurrent COPY statements that match a subset of files,
taking advantage of parallel operations.
Lists of files
The COPY INTO <table> command includes a FILES parameter to load files by specific name.
Tip
Of the three options for identifying/specifying data files to load from a stage, providing a discrete
list of files is generally the fastest; however, the FILES parameter supports a maximum of 1,000
files, meaning a COPY command executed with the FILES parameter can only load up to 1,000
files.
For example:
File lists can be combined with paths for further control over data loading.
Pattern matching
The COPY INTO <table> command includes a PATTERN parameter to load files using a
regular expression.
For example:
Pattern matching using a regular expression is generally the slowest of the three options for
identifying/specifying data files to load from a stage; however, this option works well if you
exported your files in named order from your external application and want to batch load the
files in the same order.
Pattern matching can be combined with paths for further control over data loading.
Note
The regular expression is applied differently to bulk data loads versus Snowpipe data loads.
Snowpipe trims any path segments in the stage definition from the storage location and
applies the regular expression to any remaining path segments and filenames. To view the
stage definition, execute the DESCRIBE STAGE command for the stage. The URL
property consists of the bucket or container name and zero or more path segments. For
example, if the FROM location in a COPY INTO <table> statement is @s/path1/path2/ and
the URL value for stage @s is s3://mybucket/path1/, then Snowpipe trims /path1/ from the
storage location in the FROM clause and applies the regular expression to path2/ plus the
filenames in the path.
Bulk data load operations apply the regular expression to the entire storage location in the
FROM clause.
Snowflake recommends that you enable cloud event filtering for Snowpipe to reduce costs, event
noise, and latency. Only use the PATTERN option when your cloud provider’s event filtering
feature is not sufficient. For more information about configuring event filtering for each cloud
provider, see the following pages:
Configuring event notifications using object key name filtering - Amazon S3
Understand event filtering for Event Grid subscriptions - Azure
Filtering messages - Google Pub/Sub
When processing of the COPY statement is completed, Snowflake adjusts the load status for the
data files as appropriate. If one or more data files fail to load, Snowflake sets the load status for
those files as load failed. These files are available for a subsequent COPY statement to load.
Load metadata
Snowflake maintains detailed metadata for each table into which data is loaded, including:
This load metadata expires after 64 days. If the LAST_MODIFIED date for a staged data file is
less than or equal to 64 days, the COPY command can determine its load status for a given table
and prevent reloading (and data duplication). The LAST_MODIFIED date is the timestamp
when the file was initially staged or when it was last modified, whichever is later.
If the LAST_MODIFIED date is older than 64 days, the load status is still known if either of the
following events occurred less than or equal to 64 days prior to the current date:
However, the COPY command cannot definitively determine whether a file has been loaded
already if the LAST_MODIFIED date is older than 64 days and the initial set of data was loaded
into the table more than 64 days earlier (and if the file was loaded into the table, that also
occurred more than 64 days earlier). In this case, to prevent accidental reload, the command
skips the file by default.
Workarounds
To load files whose metadata has expired, set the LOAD_UNCERTAIN_FILES copy option to
true. The copy option references load metadata, if available, to avoid data duplication, but also
attempts to load files with expired load metadata.
Alternatively, set the FORCE option to load all files, ignoring load metadata if it exists. Note that
this option reloads files, potentially duplicating data in a table.
Examples
In this example:
A table is created on January 1, and the initial table load occurs on the same day.
64 days pass. On March 7, the load metadata expires.
A file is staged and loaded into the table on July 27 and 28, respectively. Because the file
was staged one day prior to being loaded, the LAST_MODIFIED date was within 64
days. The load status was known. There are no data or formatting issues with the file, and
the COPY command loads it successfully.
64 days pass. On September 28, the LAST_MODIFIED date for the staged file exceeds
64 days. On September 29, the load metadata for the successful file load expires.
An attempt is made to reload the file into the same table on November 1. Because the
COPY command cannot determine whether the file has been loaded already, the file is
skipped. The LOAD_UNCERTAIN_FILES copy option (or the FORCE copy option) is
required to load the file.
In this example:
Use the TRIM_SPACE file format option to remove undesirable spaces during the data load.
For example, each of the following fields in an example CSV file includes a leading space:
The following COPY command trims the leading space and removes the quotation marks
enclosing each field:
+--------+--------+--------+
| col1 | col2 | col3 |
+--------+--------+--------+
| value1 | value2 | value3 |
+--------+--------+--------+
s3://bucket_name/application_one/2016/07/01/11/
s3://bucket_name/application_two/location_one/2016/07/01/14/
Where:
Identifying details for the source of all data in the path. The data can be organized by the
date when it was written. An optional 24-hour directory reduces the amount of data in
each directory.
Note
S3 transmits a directory list with each COPY statement used by Snowflake, so reducing
the number of files in each directory improves the performance of your COPY
statements. You may even consider creating subfolders of 10-15 minute increments
within the folders for each hour.
Similarly, you can also add a path when you stage files in an internal stage. For example:
PUT file:///tmp/file_20160701.11*.csv
@my_stage/<application_one>/<location_one>/2016/07/01/11/;
Note
Do not remove the staged files until the data has been loaded successfully. To check if the data
has been loaded successfully, use the COPY_HISTORY command. Check the STATUS column to
determine if the data from the file has been loaded. Note that if the status is Load in progress,
removing the staged file can result in partial loads and data loss.
Staged files can be deleted from a Snowflake stage (user stage, table stage, or named stage)
using the following methods:
Files that were loaded successfully can be deleted from the stage during a load by
specifying the PURGE copy option in the COPY INTO <table> command.
After the load completes, use the REMOVE command to remove the files in the stage.
Removing files ensures they aren’t inadvertently loaded again. It also improves load
performance, because it reduces the number of files that COPY commands must scan to verify
whether existing files in a stage were loaded already.