0% found this document useful (0 votes)
39 views16 pages

Snowflake

Uploaded by

karangole7074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views16 pages

Snowflake

Uploaded by

karangole7074
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

What is Snowflake

● Snowflake is a massively popular cloud-based data warehouse management


platform.
● It has a unique architecture, which uses separate storage and compute layers,
allowing it to be incredibly flexible and scalable

1. Storage Layer:
● storing data in an efficient and scalable manner.
● Cloud based: integrates with major cloud providers such as AWS, GCP, and
Microsoft Azure.
● Columnar format: Snowflake stores data in a columnar format, optimized
for analytical queries.
● columnar format is well-suited for data aggregation.
● Micro-partitioning: Snowflake uses a technique called micro-partitioning
that stores tables in memory in small chunks. Execution becomes faster
● Zero-copy cloning: Snowflake has a unique feature that allows it to create
virtual clones of data. And doesnt consume additional memory unless
changes made
● The storage layer scales horizontally, which means it can handle
increasing data volumes by adding more servers to distribute the load.
2. Compute Layer:
is the engine that executes your queries.
● Virtual warehouses: You can think of Virtual Warehouses as teams of
computers (compute nodes) designed to handle query processing.
● Virtual Warehouses in different sizes, and subsequently, at different prices
(the sizes include XS, S, M, L, XL).
● Multi-cluster, multi-node architecture: The compute layer uses multiple
clusters with multiple nodes for high concurrency, allowing several users
to access and query the data simultaneously.
● Automatic query optimization: Snowflake’s system analyzes all queries
and identifies patterns to optimize using historical data.
● Results cache: The compute layer includes a cache that stores the results
of frequently executed queries.

3. Cloud service layer:


● Security and access control: This layer enforces security measures,
including authentication, authorization, and encryption.
● Data sharing: This layer implements secure data sharing protocols across
different accounts and even third-party organizations.
● Semi-structured data support: Another unique benefit of Snowflake is its
ability to handle semi-structured data, such as JSON and Parquet, despite
being a data warehouse management platform.
We must manually define a file format and name it because Snowflake cannot infer the
schema and structure of data files such as CSV, JSON, or XMLs.

A stage in Snowflake is a storage area where you can upload your local files. These can
be structured and semi-structured data files. Above, we are creating a stage named
my_local_files.
Approach for Data Modelling (designing and maintaining data
structures in Snowflake)

1. Designing Data Structures

a. Schema Design

● Use Snowflake's Schema-on-Read Approach: Snowflake's unique architecture


allows you to design flexible schemas. Utilize this by designing schemas that
accommodate changes and evolution in data requirements.
● Normalized vs. Denormalized Schemas: Depending on your use case, choose
between normalized schemas (for OLTP-like operations) and denormalized
schemas (for OLAP and reporting). Snowflake can handle both efficiently.
● Star and Snowflake Schemas: For data warehousing and analytics, consider
using star or snowflake schemas. Star schemas are typically simpler and faster
for querying, while snowflake schemas normalize data into multiple related
tables, reducing redundancy.

b. Data Types

● Appropriate Data Types: Choose the correct data types for columns to optimize
storage and performance. Snowflake supports a wide range of data types,
including STRING, NUMBER, BOOLEAN, VARIANT (for semi-structured data like
JSON), and more.
● Variable vs. Fixed Data Types: Use variable-length data types (e.g., VARCHAR)
where the length of the data varies significantly, and fixed-length types (e.g.,
CHAR) when the length is consistent.

2. Loading Data

a. Data Ingestion

● COPY Command: Use the COPY INTO command for bulk loading data from
various sources (e.g., S3, Azure Blob Storage, local files). Snowflake
automatically handles parallel loading and ensures efficient ingestion.
● File Formats: Snowflake supports multiple file formats like CSV, JSON, AVRO,
ORC, and PARQUET. Choose the format that best suits your data characteristics
and performance requirements.
● Staging: Stage your data files before loading them into Snowflake tables. This
allows for data validation and quality checks before they are made available for
querying.

3. Partitioning and Clustering

a. Micro-Partitions

● Automatic Partitioning: Snowflake automatically partitions data into


micro-partitions, which optimizes query performance. Ensure your data model
and queries leverage this feature.
● Clustering Keys: For large tables, define clustering keys to improve the
performance of selective queries. Clustering keys help Snowflake maintain data
locality and reduce the need for full table scans.
4. Performance Optimization

a. Query Performance

● Result Caching: Utilize Snowflake's result caching to speed up repeated queries.


Snowflake automatically caches query results for 24 hours by default.
● Materialized Views: Reuse query results without re-running queries
● Query Optimization: Use query profiling and optimization techniques like proper
JOINs, filtering early in the query, and avoiding unnecessary complex operations.

b. Resource Management

● Warehouses: Configure virtual warehouses based on workload requirements.


Scale warehouses up or down to match the query load, ensuring optimal
performance and cost management.
● Auto-Suspend and Auto-Resume: Enable auto-suspend and auto-resume
features for virtual warehouses to minimize costs by suspending idle
warehouses and resuming them automatically when new queries are submitted.

5. Security and Governance

a. Access Control

● Role-Based Access Control (RBAC): Implement RBAC to manage permissions


and access to data and resources in Snowflake. Define roles and assign them to
users or groups based on their responsibilities.
● Data Masking: Use dynamic data masking to protect sensitive data by obscuring
it from unauthorized users, while still allowing authorized users to access the full
data.

b. Data Governance

● Data Lineage: Implement data lineage tracking to understand the flow and
transformation of data within Snowflake. This aids in compliance and auditing.
● Metadata Management: Use Snowflake's INFORMATION_SCHEMA to manage
and query metadata about your data structures, helping in the maintenance.
6. Monitoring and Maintenance

a. Monitoring

● Snowflake's Monitoring Tools: Utilize Snowflake's built-in monitoring tools and


dashboards to track the performance and health of your data structures and
queries.
● Third-Party Tools: Integrate third-party monitoring tools, such as those from
AWS CloudWatch or Azure Monitor, for more comprehensive monitoring and
alerting.

b. Maintenance

● Regular Audits: Conduct regular audits of your data structures and usage
patterns to ensure they still meet business requirements and perform optimally.
● Automated Maintenance: Leverage Snowflake's automated maintenance
features, such as automatic clustering and vacuuming, to keep your data
structures optimized without manual intervention.

Imp Points:

1) Schema-on-Read in Snowflake

Service provided by snowflake for designing data structure. Schema-on-read is a data


management pattern where the data schema is applied at the time of reading the data,
rather than when the data is ingested or stored.
E.g. suppose we have following csv file:

Will do staging and upload the file to stage:

Then create a file format

And here we will fetch the data directly from the stage where file is stored without
defining the schema and directly setting the schema on read
Note: t.$1, t.$2, etc., are used to refer to the columns in the CSV file, with $1
corresponding to the first column (id), $2 to the second column (name), and so on.

Benefits
● Flexibility: can do changes in data structure without requiring changes to the
storage format.
● Adaptability: It is particularly useful for handling semi-structured and
unstructured data, where the schema might not be known upfront or might
change frequently.
● Efficiency: It enables efficient querying and transformation of raw data without
the need for upfront schema definitions, simplifying the process of integrating
and analyzing diverse data sources.

2) Clustering Keys in Snowflake

● used to optimize the physical layout of data in large tables, improving query
performance by minimizing the amount of data scanned during query execution.
● Clustering keys define one or more columns that are used to sort and organize
the data within micro-partitions, enhancing data locality and reducing the need
for full table scans.

Benefits
● Improved Query Performance
● Efficient Storage:
● Better maintenance

For clustering Always choose a column which is used for filtering the data based on.

E.g. creating a table with cluster keys


Adding cluster keys to the existing table

If you want to check clustering info then use below query

3) Snowflake tasks vs streams

Streams Tasks
capture changes to tables and provide Tasks help run async pieces of code like
change data to consumers in near ETL transformations.
real-time.

Run continuously Run once

Capture changes Run the code


4) Staging in Snowflake

Staging in Snowflake refers to the process of temporarily storing data files before
loading them into database tables.
It is a step where you can validate, transform, and manage data before it is ingested into
your Snowflake tables.

Types of Stages

1. Internal Stage: Managed entirely by Snowflake.


2. External Stage: Uses external cloud storage locations

E.g. consider we have a csv file

Here we created an internal stage

Uploading a csv file to it

Then we created a table


Then we load a data from stage to table

Here we add a file format directly in the query without defining. We can also define file
format separately and then can add.

5) Data Lineage in Snowflake

● Its an ability to track and visualize the flow of data through various stages of its
lifecycle, from initial ingestion to transformation, and finally to its use in reports
and analyses.
● helps in understanding where the data comes from, how it is transformed, and
where it goes
e.g.Suppose you have a CSV file “employees.csv”

You stage and load this data into a table named “employees_raw”
Then we transformed it from raw table to more refined table.
table=”employees_refined”

Here we created a table for reporting purpose from cleaned table.

Steps goes- putting data into stage - putting that into raw table - then clean table - then
table used for reporting.

1. Source Data: employees.csv file.


2. Stage and Load: Data loaded into employees_raw table.
3. Transformation: Data transformed into employees_cleaned table.
4. Reporting: Summary data generated in department_summary table
6) Data Masking in Snowflake

security feature that allows you to hide sensitive information in your database, so that
unauthorized users see masked data instead of the actual data

e.g.Here we created a table with a sensitive info like SSN(social security numbers)

Now we created a masking policy to mask the sensitive data

● CURRENT_ROLE() is a function that returns the role of the current user.


● If the current role of the user is 'AUTHORIZED_ROLE', then the policy returns the
original value (val).
● 'XXX-XX-' is a string that represents the masked part of the SSN.
● RIGHT(val, 4) is a function that takes the last four characters of the input string
val.
● || is the concatenation operator used to combine the masked part with the last
four characters of the SSN.
Now apply masking policy to table:

You might also like