0% found this document useful (0 votes)
34 views13 pages

Best Practices For Data Load

Uploaded by

ravi.abinit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views13 pages

Best Practices For Data Load

Uploaded by

ravi.abinit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

https://docs.snowflake.

com/en/user-guide/data-load-considerations-prepare

Preparing your data files


This topic provides best practices, general guidelines, and important considerations
for preparing your data files for loading.

File sizing best practices and limitations


For best load performance and to avoid size limitations, consider the following data file sizing
guidelines. Note that these recommendations apply to bulk data loads as well as continuous
loading using Snowpipe.

General file sizing recommendations

The number of load operations that run in parallel cannot exceed the number of data files to be
loaded. To optimize the number of parallel operations for a load, we recommend aiming to
produce data files roughly 100-250 MB (or larger) in size compressed.

Note

Loading very large files (e.g. 100 GB or larger) is not recommended.

If you must load a large file, carefully consider the ON_ERROR copy option value. Aborting or
skipping a file due to a small number of errors could result in delays and wasted credits. In
addition, if a data loading operation continues beyond the maximum allowed duration of 24
hours, it could be aborted without any portion of the file being committed.

Aggregate smaller files to minimize the processing overhead for each file. Split larger files into a
greater number of smaller files to distribute the load among the compute resources in an active
warehouse. The number of data files that are processed in parallel is determined by the amount
of compute resources in a warehouse. We recommend splitting large files by line to avoid
records that span chunks.

If your source database does not allow you to export data files in smaller chunks, you can use a
third-party utility to split large CSV files.

Linux or macOS

The split utility enables you to split a CSV file into multiple smaller files.

Syntax:
split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]

For more information, type man split in a terminal window.

Example:

split -l 100000 pagecounts-20151201.csv pages

This example splits a file named pagecounts-20151201.csv by line length. Suppose the large single
file is 8 GB in size and contains 10 million lines. Split by 100,000, each of the 100 smaller files
is 80 MB in size (10 million / 100,000 = 100). The split files are named pages<suffix>.

Windows

Windows does not include a native file split utility; however, Windows supports many third-
party tools and scripts that can split large data files.

Semi-structured data size limitations

A VARIANT can have a maximum size of up to 16 MB of uncompressed data. However, in


practice, the maximum size is usually smaller due to internal overhead. The maximum size is
also dependent on the object being stored.

For more information, see VARIANT.

In general, JSON data sets are a simple concatenation of multiple documents. The JSON output
from some software is composed of a single huge array containing multiple records. There is no
need to separate the documents with line breaks or commas, though both are supported.

If the data exceeds 16 MB, enable the STRIP_OUTER_ARRAY file format option for
the COPY INTO <table> command to remove the outer array structure and load the records into
separate table rows:

COPY INTO <table>


FROM @~/<file>.json
FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = true);

Continuous data loads (i.e. Snowpipe) and file sizing


Snowpipe is designed to load new data typically within a minute after a file notification is sent;
however, loading can take significantly longer for really large files or in cases where an unusual
amount of compute resources is necessary to decompress, decrypt, and transform the new data.

In addition to resource consumption, an overhead to manage files in the internal load queue is
included in the utilization costs charged for Snowpipe. This overhead increases in relation to the
number of files queued for loading. This overhead charge appears as Snowpipe charges in your
billing statement because Snowpipe is used for event notifications for the automatic external
table refreshes.

For the most efficient and cost-effective load experience with Snowpipe, we recommend
following the file sizing recommendations in File Sizing Best Practices and Limitations (in this
topic). Loading data files roughly 100-250 MB in size or larger reduces the overhead charge
relative to the amount of total data loaded to the point where the overhead cost is immaterial.

If it takes longer than one minute to accumulate MBs of data in your source application, consider
creating a new (potentially smaller) data file once per minute. This approach typically leads to a
good balance between cost (i.e. resources spent on Snowpipe queue management and the actual
load) and performance (i.e. load latency).

Creating smaller data files and staging them in cloud storage more often than once per minute
has the following disadvantages:

 A reduction in latency between staging and loading the data cannot be guaranteed.
 An overhead to manage files in the internal load queue is included in the utilization costs
charged for Snowpipe. This overhead increases in relation to the number of files queued
for loading.

Various tools can aggregate and batch data files. One convenient option is Amazon Kinesis
Firehose. Firehose allows defining both the desired file size, called the buffer size, and the wait
interval after which a new file is sent (to cloud storage in this case), called the buffer interval.
For more information, see the Kinesis Firehose documentation. If your source application
typically accumulates enough data within a minute to populate files larger than the recommended
maximum for optimal parallel processing, you could decrease the buffer size to trigger delivery
of smaller files. Keeping the buffer interval setting at 60 seconds (the minimum value) helps
avoid creating too many files or increasing latency.

Preparing delimited text files


Consider the following guidelines when preparing your delimited text (CSV) files for loading:

 UTF-8 is the default character set, however, additional encodings are supported. Use the
ENCODING file format option to specify the character set for the data files. For more
information, see CREATE FILE FORMAT.
 Fields that contain delimiter characters should be enclosed in quotes (single or double). If
the data contains single or double quotes, then those quotes must be escaped.
 Carriage returns are commonly introduced on Windows systems in conjunction with a
line feed character to mark the end of a line (\r \n). Fields that contain carriage returns
should also be enclosed in quotes (single or double).
 The number of columns in each row should be consistent.

Semi-structured data files and columnarization


When semi-structured data is inserted into a VARIANT column, Snowflake uses certain rules to
extract as much of the data as possible to a columnar form. The rest of the data is stored as a
single column in a parsed semi-structured structure.

By default, Snowflake extracts a maximum of 200 elements per partition, per table. To increase
this limit, contact Snowflake Support.

Elements that are not extracted

Elements with the following characteristics are not extracted into a column:

 Elements that contain even a single “null” value are not extracted into a column. This
applies to elements with “null” values and not to elements with missing values, which are
represented in columnar form.

This rule ensures that no information is lost (that is, that the difference between
VARIANT “null” values and SQL NULL values is not lost).

 Elements that contain multiple data types. For example:

The foo element in one row contains a number:

{"foo":1}

The same element in another row contains a string:

{"foo":"1"}

How extraction impacts queries

When you query a semi-structured element, Snowflake’s execution engine behaves differently
according to whether an element was extracted.

 If the element was extracted into a column, the engine scans only the extracted column.
 If the element was not extracted into a column, the engine must scan the entire JSON
structure, and then for each row traverse the structure to output values. This impacts
performance.

To avoid the performance impact for elements that were not extracted, do the following:

 Extract semi-structured data elements containing “null” values into relational


columns before you load them.

Alternatively, if the “null” values in your files indicate missing values and have no other
special meaning, we recommend setting the file format option STRIP_NULL_VALUES
to TRUE when you load the semi-structured data files. This option removes OBJECT
elements or ARRAY elements containing “null” values.

 Ensure each unique element stores values of a single data type that is native to the format
(for example, string or number for JSON).

Numeric data guidelines


 Avoid embedded characters, such as commas (e.g. 123,456).
 If a number includes a fractional component, it should be separated from the whole
number portion by a decimal point (e.g. 123456.789).
 Oracle only. The Oracle NUMBER or NUMERIC types allow for arbitrary scale,
meaning they accept values with decimal components even if the data type was not
defined with a precision or scale. Whereas in Snowflake, columns designed for values
with decimal components must be defined with a scale to preserve the decimal portion.

Date and timestamp data guidelines


 For information on the supported formats for date, time, and timestamp data, see Date
and Time Input and Output Formats.
 Oracle only. The Oracle DATE data type can contain date or timestamp information. If
your Oracle database includes DATE columns that also store time-related information,
map these columns to a TIMESTAMP data type in Snowflake rather than DATE.

Note
Snowflake checks temporal data values at load time. Invalid date, time, and timestamp values
(e.g. 0000-00-00) produce an error.

Planning a data load


This topic provides best practices, general guidelines, and important considerations
for planning a data load.

Dedicating separate warehouses to load and query


operations
Loading large data sets can affect query performance. We recommend dedicating separate
warehouses for loading and querying operations to optimize performance for each.

The number of data files that can be processed in parallel is determined by the amount of
compute resources in a warehouse. If you follow the file sizing guidelines described in Preparing
your data files, a data load requires minimal resources. Splitting larger data files allows the load
to scale linearly. Unless you are bulk loading a large number of files concurrently (i.e. hundreds
or thousands of files), a smaller warehouse (Small, Medium, Large) is generally sufficient. Using
a larger warehouse (X-Large, 2X-Large, etc.) will consume more credits and may not result in
any performance increase.

Staging data
This topic provides best practices, general guidelines, and important considerations
for preparing your data files for loading.

Organizing data by path


Both internal (i.e. Snowflake) and external (Amazon S3, Google Cloud Storage, or Microsoft
Azure) stage references can include a path (or prefix in AWS terminology). When staging
regular data sets, we recommend partitioning the data into logical paths that include identifying
details such as geographical location or other source identifiers, along with the date when the
data was written.

Organizing your data files by path lets you copy any fraction of the partitioned data into
Snowflake with a single command. This allows you to execute concurrent COPY statements that
match a subset of files, taking advantage of parallel operations.

For example, if you were storing data for a North American company by geographical location,
you might include identifiers such as continent, country, and city in paths along with data write
dates:

 Canada/Ontario/Toronto/2016/07/10/05/
 United_States/California/Los_Angeles/2016/06/01/11/
 United_States/New York/New_York/2016/12/21/03/
 United_States/California/San_Francisco/2016/08/03/17/

When you create a named stage, you can specify any part of a path. For example, create an
external stage using one of the above example paths:

CREATE STAGE my_stage URL='s3://mybucket/United_States/California/Los_Angeles/'


CREDENTIALS=(AWS_KEY_ID='1a2b3c' AWS_SECRET_KEY='4x5y6z');

You can also add a path when you stage files in an internal user or table stage. For example,
stage mydata.csv in a specific path in the t1 table stage:

PUT file:///data/mydata.csv @%t1/United_States/California/Los_Angeles/2016/06/01/11/


When loading your staged data, narrow the path to the most granular level that includes your
data for improved data load performance.

Use any of the following options to further confine the list of files to load:

 If the file names match except for a suffix or extension, include the matching part of the
file names in the path, e.g.:
 COPY INTO t1 from @
%t1/United_States/California/Los_Angeles/2016/06/01/11/mydata;

 Add the FILES or PATTERN options (see Options for selecting staged data files), e.g.:
 COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/
 FILES=('mydata1.csv', 'mydata1.csv');

 COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/
PATTERN='.*mydata[^[0-9]{1,3}$$].csv';

Loading data
This topic provides best practices, general guidelines, and important considerations
for loading staged data.

Options for selecting staged data files


The COPY command supports several options for loading data files from a stage:

 By path (internal stages) / prefix (Amazon S3 bucket). See Organizing data by path for
information.
 Specifying a list of specific files to load.
 Using pattern matching to identify specific files by pattern.

These options enable you to copy a fraction of the staged data into Snowflake with a single
command. This allows you to execute concurrent COPY statements that match a subset of files,
taking advantage of parallel operations.

Lists of files

The COPY INTO <table> command includes a FILES parameter to load files by specific name.

Tip
Of the three options for identifying/specifying data files to load from a stage, providing a discrete
list of files is generally the fastest; however, the FILES parameter supports a maximum of 1,000
files, meaning a COPY command executed with the FILES parameter can only load up to 1,000
files.

For example:

COPY INTO load1 FROM @%load1/data1/ FILES=('test1.csv', 'test2.csv', 'test3.csv')

File lists can be combined with paths for further control over data loading.

Pattern matching

The COPY INTO <table> command includes a PATTERN parameter to load files using a
regular expression.

For example:

COPY INTO people_data FROM @%people_data/data1/


PATTERN='.*person_data[^0-9{1,3}$$].csv';

Pattern matching using a regular expression is generally the slowest of the three options for
identifying/specifying data files to load from a stage; however, this option works well if you
exported your files in named order from your external application and want to batch load the
files in the same order.

Pattern matching can be combined with paths for further control over data loading.

Note

The regular expression is applied differently to bulk data loads versus Snowpipe data loads.

 Snowpipe trims any path segments in the stage definition from the storage location and
applies the regular expression to any remaining path segments and filenames. To view the
stage definition, execute the DESCRIBE STAGE command for the stage. The URL
property consists of the bucket or container name and zero or more path segments. For
example, if the FROM location in a COPY INTO <table> statement is @s/path1/path2/ and
the URL value for stage @s is s3://mybucket/path1/, then Snowpipe trims /path1/ from the
storage location in the FROM clause and applies the regular expression to path2/ plus the
filenames in the path.
 Bulk data load operations apply the regular expression to the entire storage location in the
FROM clause.

Snowflake recommends that you enable cloud event filtering for Snowpipe to reduce costs, event
noise, and latency. Only use the PATTERN option when your cloud provider’s event filtering
feature is not sufficient. For more information about configuring event filtering for each cloud
provider, see the following pages:
 Configuring event notifications using object key name filtering - Amazon S3
 Understand event filtering for Event Grid subscriptions - Azure
 Filtering messages - Google Pub/Sub

Executing parallel COPY statements that reference the same


data files
When a COPY statement is executed, Snowflake sets a load status in the table metadata for the
data files referenced in the statement. This prevents parallel COPY statements from loading the
same files into the table, avoiding data duplication.

When processing of the COPY statement is completed, Snowflake adjusts the load status for the
data files as appropriate. If one or more data files fail to load, Snowflake sets the load status for
those files as load failed. These files are available for a subsequent COPY statement to load.

Loading older files


This section describes how the COPY INTO <table> command prevents data duplication
differently based on whether the load status for a file is known or unknown. If you partition your
data in stages using logical, granular paths by date (as recommended in Organizing data by path)
and load data within a short period of time after staging it, this section largely does not apply to
you. However, if the COPY command skips older files (i.e. historical data files) in a data load,
this section describes how to bypass the default behavior.

Load metadata

Snowflake maintains detailed metadata for each table into which data is loaded, including:

 Name of each file from which data was loaded


 File size
 ETag for the file
 Number of rows parsed in the file
 Timestamp of the last load for the file
 Information about any errors encountered in the file during loading

This load metadata expires after 64 days. If the LAST_MODIFIED date for a staged data file is
less than or equal to 64 days, the COPY command can determine its load status for a given table
and prevent reloading (and data duplication). The LAST_MODIFIED date is the timestamp
when the file was initially staged or when it was last modified, whichever is later.

If the LAST_MODIFIED date is older than 64 days, the load status is still known if either of the
following events occurred less than or equal to 64 days prior to the current date:

 The file was loaded successfully.


 The initial set of data for the table (i.e. the first batch after the table was created)
was loaded.

However, the COPY command cannot definitively determine whether a file has been loaded
already if the LAST_MODIFIED date is older than 64 days and the initial set of data was loaded
into the table more than 64 days earlier (and if the file was loaded into the table, that also
occurred more than 64 days earlier). In this case, to prevent accidental reload, the command
skips the file by default.

Workarounds

To load files whose metadata has expired, set the LOAD_UNCERTAIN_FILES copy option to
true. The copy option references load metadata, if available, to avoid data duplication, but also
attempts to load files with expired load metadata.

Alternatively, set the FORCE option to load all files, ignoring load metadata if it exists. Note that
this option reloads files, potentially duplicating data in a table.

Examples

In this example:

 A table is created on January 1, and the initial table load occurs on the same day.
 64 days pass. On March 7, the load metadata expires.
 A file is staged and loaded into the table on July 27 and 28, respectively. Because the file
was staged one day prior to being loaded, the LAST_MODIFIED date was within 64
days. The load status was known. There are no data or formatting issues with the file, and
the COPY command loads it successfully.
 64 days pass. On September 28, the LAST_MODIFIED date for the staged file exceeds
64 days. On September 29, the load metadata for the successful file load expires.
 An attempt is made to reload the file into the same table on November 1. Because the
COPY command cannot determine whether the file has been loaded already, the file is
skipped. The LOAD_UNCERTAIN_FILES copy option (or the FORCE copy option) is
required to load the file.
In this example:

 A file is staged on January 1.


 64 days pass. On March 7, the LAST_MODIFIED date for the staged file exceeds 64
days.
 A new table is created on September 29, and the staged file is loaded into the table.
Because the initial table load occurred less than 64 days prior, the COPY command can
determine that the file had not been loaded already. There are no data or formatting issues
with the file, and the COPY command loads it successfully.

JSON data: Removing “null” values


In a VARIANT column, NULL values are stored as a string containing the word “null,” not the
SQL NULL value. If the “null” values in your JSON documents indicate missing values and
have no other special meaning, we recommend setting the file format option
STRIP_NULL_VALUES to TRUE for the COPY INTO <table> command when loading the
JSON files. Retaining the “null” values often wastes storage and slows query processing.

CSV data: Trimming leading spaces


If your external software exports fields enclosed in quotes but inserts a leading space before the
opening quotation character for each field, Snowflake reads the leading space rather than the
opening quotation character as the beginning of the field. The quotation characters are
interpreted as string data.

Use the TRIM_SPACE file format option to remove undesirable spaces during the data load.

For example, each of the following fields in an example CSV file includes a leading space:

"value1", "value2", "value3"

The following COPY command trims the leading space and removes the quotation marks
enclosing each field:

COPY INTO mytable


FROM @%mytable
FILE_FORMAT = (TYPE = CSV TRIM_SPACE=true
FIELD_OPTIONALLY_ENCLOSED_BY = '0x22');

SELECT * FROM mytable;

+--------+--------+--------+
| col1 | col2 | col3 |
+--------+--------+--------+
| value1 | value2 | value3 |
+--------+--------+--------+

Managing regular data loads


This topic provides best practices, general guidelines, and important considerations
for managing regular data loads.

Partitioning staged data files


When planning regular data loads such as ETL (Extract, Transform, Load) processes or regular
imports of machine-generated data, it is important to partition the data in your internal (i.e.
Snowflake) stage or external locations (S3 buckets or Azure containers) using logical, granular
paths. Create a partitioning structure that includes identifying details such as application or
location, along with the date when the data was written. You can then copy any fraction of the
partitioned data into Snowflake with a single command. You can copy data into Snowflake by
the hour, day, month, or even year when you initially populate tables.

Some examples of partitioned S3 buckets using paths:

s3://bucket_name/application_one/2016/07/01/11/

s3://bucket_name/application_two/location_one/2016/07/01/14/

Where:

application_one , application_two , location_one , etc.

Identifying details for the source of all data in the path. The data can be organized by the
date when it was written. An optional 24-hour directory reduces the amount of data in
each directory.

Note
S3 transmits a directory list with each COPY statement used by Snowflake, so reducing
the number of files in each directory improves the performance of your COPY
statements. You may even consider creating subfolders of 10-15 minute increments
within the folders for each hour.

Similarly, you can also add a path when you stage files in an internal stage. For example:

PUT file:///tmp/file_20160701.11*.csv
@my_stage/<application_one>/<location_one>/2016/07/01/11/;

Loading staged data


Load organized data files into Snowflake tables by specifying the precise path to the staged files.
For more information, see Organizing data by path.

Removing loaded data files


When data from staged files is loaded successfully, consider removing the staged files to ensure
the data is not inadvertently loaded again (duplicated).

Note
Do not remove the staged files until the data has been loaded successfully. To check if the data
has been loaded successfully, use the COPY_HISTORY command. Check the STATUS column to
determine if the data from the file has been loaded. Note that if the status is Load in progress,
removing the staged file can result in partial loads and data loss.

Staged files can be deleted from a Snowflake stage (user stage, table stage, or named stage)
using the following methods:

 Files that were loaded successfully can be deleted from the stage during a load by
specifying the PURGE copy option in the COPY INTO <table> command.
 After the load completes, use the REMOVE command to remove the files in the stage.

Removing files ensures they aren’t inadvertently loaded again. It also improves load
performance, because it reduces the number of files that COPY commands must scan to verify
whether existing files in a stage were loaded already.

You might also like