Snowflake
Snowflake
1. Storage Layer:
● storing data in an efficient and scalable manner.
● Cloud based: integrates with major cloud providers such as AWS, GCP, and
Microsoft Azure.
● Columnar format: Snowflake stores data in a columnar format, optimized
for analytical queries.
● columnar format is well-suited for data aggregation.
● Micro-partitioning: Snowflake uses a technique called micro-partitioning
that stores tables in memory in small chunks. Execution becomes faster
● Zero-copy cloning: Snowflake has a unique feature that allows it to create
virtual clones of data. And doesnt consume additional memory unless
changes made
● The storage layer scales horizontally, which means it can handle
increasing data volumes by adding more servers to distribute the load.
2. Compute Layer:
is the engine that executes your queries.
● Virtual warehouses: You can think of Virtual Warehouses as teams of
computers (compute nodes) designed to handle query processing.
● Virtual Warehouses in different sizes, and subsequently, at different prices
(the sizes include XS, S, M, L, XL).
● Multi-cluster, multi-node architecture: The compute layer uses multiple
clusters with multiple nodes for high concurrency, allowing several users
to access and query the data simultaneously.
● Automatic query optimization: Snowflake’s system analyzes all queries
and identifies patterns to optimize using historical data.
● Results cache: The compute layer includes a cache that stores the results
of frequently executed queries.
A stage in Snowflake is a storage area where you can upload your local files. These can
be structured and semi-structured data files. Above, we are creating a stage named
my_local_files.
Approach for Data Modelling (designing and maintaining data
structures in Snowflake)
a. Schema Design
b. Data Types
● Appropriate Data Types: Choose the correct data types for columns to optimize
storage and performance. Snowflake supports a wide range of data types,
including STRING, NUMBER, BOOLEAN, VARIANT (for semi-structured data like
JSON), and more.
● Variable vs. Fixed Data Types: Use variable-length data types (e.g., VARCHAR)
where the length of the data varies significantly, and fixed-length types (e.g.,
CHAR) when the length is consistent.
2. Loading Data
a. Data Ingestion
● COPY Command: Use the COPY INTO command for bulk loading data from
various sources (e.g., S3, Azure Blob Storage, local files). Snowflake
automatically handles parallel loading and ensures efficient ingestion.
● File Formats: Snowflake supports multiple file formats like CSV, JSON, AVRO,
ORC, and PARQUET. Choose the format that best suits your data characteristics
and performance requirements.
● Staging: Stage your data files before loading them into Snowflake tables. This
allows for data validation and quality checks before they are made available for
querying.
a. Micro-Partitions
a. Query Performance
b. Resource Management
a. Access Control
b. Data Governance
● Data Lineage: Implement data lineage tracking to understand the flow and
transformation of data within Snowflake. This aids in compliance and auditing.
● Metadata Management: Use Snowflake's INFORMATION_SCHEMA to manage
and query metadata about your data structures, helping in the maintenance.
6. Monitoring and Maintenance
a. Monitoring
b. Maintenance
● Regular Audits: Conduct regular audits of your data structures and usage
patterns to ensure they still meet business requirements and perform optimally.
● Automated Maintenance: Leverage Snowflake's automated maintenance
features, such as automatic clustering and vacuuming, to keep your data
structures optimized without manual intervention.
Imp Points:
1) Schema-on-Read in Snowflake
And here we will fetch the data directly from the stage where file is stored without
defining the schema and directly setting the schema on read
Note: t.$1, t.$2, etc., are used to refer to the columns in the CSV file, with $1
corresponding to the first column (id), $2 to the second column (name), and so on.
Benefits
● Flexibility: can do changes in data structure without requiring changes to the
storage format.
● Adaptability: It is particularly useful for handling semi-structured and
unstructured data, where the schema might not be known upfront or might
change frequently.
● Efficiency: It enables efficient querying and transformation of raw data without
the need for upfront schema definitions, simplifying the process of integrating
and analyzing diverse data sources.
● used to optimize the physical layout of data in large tables, improving query
performance by minimizing the amount of data scanned during query execution.
● Clustering keys define one or more columns that are used to sort and organize
the data within micro-partitions, enhancing data locality and reducing the need
for full table scans.
Benefits
● Improved Query Performance
● Efficient Storage:
● Better maintenance
For clustering Always choose a column which is used for filtering the data based on.
Streams Tasks
capture changes to tables and provide Tasks help run async pieces of code like
change data to consumers in near ETL transformations.
real-time.
Staging in Snowflake refers to the process of temporarily storing data files before
loading them into database tables.
It is a step where you can validate, transform, and manage data before it is ingested into
your Snowflake tables.
Types of Stages
Here we add a file format directly in the query without defining. We can also define file
format separately and then can add.
● Its an ability to track and visualize the flow of data through various stages of its
lifecycle, from initial ingestion to transformation, and finally to its use in reports
and analyses.
● helps in understanding where the data comes from, how it is transformed, and
where it goes
e.g.Suppose you have a CSV file “employees.csv”
You stage and load this data into a table named “employees_raw”
Then we transformed it from raw table to more refined table.
table=”employees_refined”
Steps goes- putting data into stage - putting that into raw table - then clean table - then
table used for reporting.
security feature that allows you to hide sensitive information in your database, so that
unauthorized users see masked data instead of the actual data
e.g.Here we created a table with a sensitive info like SSN(social security numbers)