0% found this document useful (0 votes)
11 views12 pages

SF Notes Anuja

Snowflake notes

Uploaded by

Anuja Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

SF Notes Anuja

Snowflake notes

Uploaded by

Anuja Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Our client is industry-leading tax engine that is comprehensive and configurable for indirect tax

calculations. this client is also assist with other aspects of indirect taxes, namely compliance, returns, and
reporting. Our team is working on data domain. This involves our customer private Cloud's Client
Configuration data. I mostly work on administrating data as well as building data pipelines. We have build
two types of pipeline yet:

Einvoice:

1. E-Invoicing requires inputting invoicing data to a tax engine that reconciles with another team
data.we transforms XML files, merges with e-invoice metadata. We have four layers in each env.
RAW, STG, CURATED, SEMANTIC.
2. we have two sources: 1. einvoice XML file in s3 bucket in AWS and einvoice metadata from
fivetran. Fivetran is ingestion tool which load get data from source to destination.
3. The e-invoices received from Vertex clients have been parsed by the E-Invoicing application and
results of data extraction are saved in the metadata databases. this metadata will be extract from
tenant metadata database to Fivetran connectors
1. So In Raw database we receives tenant metadata from Fivetran and S3 replication of e-invoice
XML files.
2. Snowpipe loads XML payload into Data Platform Snowflake account RAW layer from S3, SQS
notification that new file(s) have arrived in S3 folder
3. Data from the E-Invoice metadata database schemas are joined and merged in the stage data
layer, transformed in curated layer like adding another column or changing data types and finally
will get formatted in the semantic data layer. This transformation is implemented by DBT (Data
build tool) tool. in which we materialize dbt models using SQL select statements.
4. in stage, we have two task scheduled which perform operation like merging and inserting data in
STG layer table.
and transformed in curated layer like adding another column or changing data types, on top of this
table we implemented view in final and semantic data layer.
5. Another team mostly of data scientists connects to the Data Platform Snowflake account to query
the e-invoice semantic data set.

FXrate i.e. foriegn exchange:


The customer's transaction currency may not be the same as the filing currency therefore we need to
provide customers with the ability to calculate or recalculate the correct filing currency amounts. Main
purpose is to provide exchange rate data. Source data will be available on s3 bucket, and we load in
snowflake through snowpipe. These data is in parquet format. It simple pipeline, we just build one task,
one table in stg layer, top of that table, we created view in next layers.
DATA:
This Client data includes data related to clients which our vendor's customers that has Taxpayer,
Customer, Vendor, and their related data that is used to setup taxability. Each Client is licensed to a
subscription which provides the client access to a set of features (such as O Series or Data
Integrity). Client Cloud Configuration information provides the key to access feature data.
Snowflake is a cloud native data platform delivered as a service
• purpose built for the cloud
• no management of hardware, updates and patches
• No software to install

Features:
1. OLAP data warehouse
2. Multi-cluster, shared data architecture
3. Pay only for what you use: storage and compute
4. Virtual Warehouses – scale up/ down (vertically): resizing the VWH Scale in/out(horizontally): add /
remove VWH
5. Time travel – failsafe
6. Data sharing
7. Zero copy cloning

Data Warehouse : is database that combines different data from different data sources into one
consumable DB for reporting and analysis purposes.

Cloud Computing: Historically, if an organization wants to work with data, that needed data center that
can cause overhead means. It need building / data center room i.e. infrastructure, then have to provide
electricity and security to them. Adding on, that maintenance of Hardware n software , this can cause lot
of cost.

so snowflake is Software-as-a-service, SO we focused on only our application not on servers and


physical servers etc.

Storage :
1. Layer where we store data.
2. Data is not actually stored in snowflake itself but we use external cloud provider.
3. SO for storage use AWS s3 buckets. Snowflake’s storage layer is built on top of cloud-based storage
systems such as Amazon S3 depending on the chosen cloud provider.
4.This is all under the hood, we don’t have to do anything or we don’t have to maintain it.
5. Here data is also called: Hybrid Columnar Storage.
6. This means instead of savindg data in rows, it is stored in Columns , compressed into called blobs. This
how data stored in cloud providers.

Query processing Layer:


1. This called Muscle of the system , bcoz this provides the actual compute power to execute queries.
2. This is layer where queries are processed and we have warehouse also. WH are virtual compute
resources that are used to process queries and operations that we are executing.
Cloud Services:
1. Brain of the system
2. Layer where we manage infrastructure , all of the access control, security, metadata, query
optimization
Warehouse

1. Provide compute capacity or virtual compute instance/ servers that used to process queries
2. They come different sizes
3. Smallest size in xsmall , consist of one server. If queries are not complex then use smaller size and
queries become more complex we have higher amount of server warehouse i.e large warehouse to
run query fast
4. Each size always increase the number of servers and it always doubles them upto 4xL
5. Large WH are more expensive
6. Multi-clustering : Means that if one warehouse can’t handle the query load, additional clusters
can be activated and they can be clustered together in one warehouse. SO to more compute
power to process large no of queries simultaneously.

Multi-cluster WH: If we have one WH, and at certain time , we have more queries , then single WH can
process and causes problem because user will have to wait for long time until all queries are processed.
So solution to this is that we can Automatically scale up and down clusters dynamically depending on
workload. By doing this we can just redistribute the queries that are in queue and can currently not be
processed

Scaling up Scaling out


for Known pattern of high workload, which dynamically scale it based on unknown
pattern of workload.
we can increase size of VW We create additional clusters of same size. Or
warehouses
More complex query, to handle increased More concurrent users/queries
workload for a specific period

Snowflake pricing: compute and storage cost decoupled


1. compute (depend on WH size, no of active warehouse, time consumed by WH) and
2. storage (Based on avg storage per month, cloud providers)
cost calculated after compression
3. Pay only what you use
4. Pricing depend on region/cloud provider

What is Snowflake Credit?


It is the mode of payment for the consumption of the snowflake resources, usually Virtual Warehouses,
Cloud Services, and serverless features.
Storage cost is measured as the average amount of data stored in Snowflake on a monthly basis.
The snowflake credit is calculated based on 1. Warehouse size, 2. number of VWH 3. time spent to
execute queries

Resource monitor : Control and monitor credit usage of WH and account


ROLES:
ORGADMIN : Manages actions on org level. Create account , view all accounts
ACCOUNTADMIN : Top level role. Can manage all objects in account
SECURITYADMIN: Manage any object grant globally. Manage grants privilege
USERADMIN inherits from securityadmin. create and manage users and roles
SYSADMIN : create WH, DB & other objects
1. All custom roles should be assigned
2. Can grant privileges on WH, DB and other objects
PUBLIC : Automatically granted per default

User - An entity that enables a person (or service) to connect to Snowflake


Object - An entity that a User can access (i.e. table, view, database), if they have the right privileges.
Privilege - An operation that a User could execute on an Object, if their Role has been granted it.
Role - A bridging entity between Users and Privileges. Privileges are granted to Roles, and Roles are
granted to Users

LOADING:

Bulk Loading / copy command (Bulk data Continuous loading / snowpipe


periodically )
Uses compute power of WH Serverless (will not use desiccated WH)
Loading from stages Load small volume of data continuously/quickly
Using COPY command Latest result for analysis
Transformation possible

STAGES: DB object which used as location of data files where data can be loaded from.

FILE FORMAT Pre-defined format structure that describes a set of staged data to access or load into
Snowflake tables for CSV, JSON, AVRO, ORC, PARQUET, and XML input types

Snowpipe:
Enables loading once file appears in bucket
Serverless: instead of using VWH, compute resource automatically managed by sf itself
STAGE → TEST copy command → create pipe → create event notification (As file added to s3 bucket it
can trigger Snowpipe)
Enables near-real-time loading of frequent, small volumes of data into Snowflake.

Micro-partitions: Contiguous units of storage


Each MP hold small chunk of data. consisting of immutable and compressed data files. Manageable in
size, 50MB to 500MB
Sf auto created MP as we added more data
benefit of these, when we search something specific, sf only needs to look at the relevant MP. Saving time
Snowflake Clustering is a performance optimization technique that organizes data within tables based
on one or more clustering keys.

We can choose a specific colm in table to be the clustering key, Snowflake then sorts and stores related
data together within the micro-partitions.

Benefit: If your queries often search for data based on the chosen clustering key, Snowflake can quickly
find the relevant information within the micro-partitions because similar data is grouped together.

Cluster Keys: (sf automatically maintains these keys or we can also specify)
1. Subset of rows to locate data in micro-partitions
2. For large tables this improves scan efficiency in our queries.

How to cluster:
1. Colm used in most time in Where clauses, joins

Create table <name> … CLUSTER BY (colms/expressions)

Partition Pruning:
Only scan partition in which we have data and skipping/eliminating all other partitions

Performance optimization:
Automatically managed Micro-partitions it includes following task: Already there
Add indexes, primary keys
Create table partitions
Analyze query execution table plan
Remove unnecessary full table scans
We can do:
Assigning Appropriate data types
Sizing and dedicating VWH : separated according different workloads
use cluster keys for large table

Caching : Storing query result upto 24 hr. to maximize this, should run queries in same warehouse

Time travel: Query historic data/ recover objects that been dropped within retention time
What possible?
1. Query deleted / updated data
2. Restore tables, schemas, DB that have been dropped
3. Create clones of tbl, schs, DBs from previous state
4. Contributes storage cost

AT | BEFORE clause which can be specified in SELECT statements and CREATE … CLONE commands
(immediately after the object name). The clause uses one of the following parameters to pinpoint the
exact historical data you wish to access:
o TIMESTAMP
o OFFSET (time difference in seconds from the present time)
o STATEMENT (identifier for statement, e.g. query ID)

UNDROP command for tables, schemas, and databases.

SELECT * FROM my_table AT(TIMESTAMP => 'Fri, 01 May 2015 16:20:00 -0700'::timestamp_tz);
SELECT * FROM my_table AT(OFFSET => -60*5);
SELECT * FROM my_table BEFORE(STATEMENT => '8e5d0ca9-005e-44e6-b858-a8f5b37c5726')

Retention Period : Time that we can travel back past in sf. Depends on editions (Standard : 1 d , other :
90) Default 1 Day

Fail safe : Protection of historical data incase of disaster


1. Non-configurable 7 day perios for permanent tables
2. Period starts immediately after TT ends
3. Only by snowflake
4. Contributes storage cost

Table types:

1. Permanent Table
• persist until deleted or dropped from database
• default table type.
• Time travel is possible in these tables up to 90 days
• It is Fail-safe and data can be recovered if lost due to fail
2. Temporary Table
• persist for a session
• Time travel is possible but only 0 to 1 day.
• It is not fail-safe 3.
3. Transient Table
• persist until the users drop or delete them.
• It is used where "data persistence" is required but doesn't need "data retention" for a longer
period
• Time travel is possible but only 0 to 1 day.
• It is not fail-safe
4. External Table
• allows you to query data stored in an external stage as if the data were inside a table in
Snowflake.
• External tables are read-only. We can’t perform DML operations.
• Time travel is not possible for external tables.
• It is not fail-safe inside Snowflake environment.
STAGES: DB object which used as location of data files where data can be loaded from.

FILE FORMAT Pre-defined format structure that describes a set of staged data to access or load into
Snowflake tables for CSV, JSON, AVRO, ORC, PARQUET, and XML input types

Task: Can be used to schedule SQL statements

1. Standlone task and trees of tasks


2. Used to perform actions based on the data changes captured by Streams.

Stream: Object that records (DML[DEL, INSERT, UPDATE ]) changes made to a table. This process is
called CDC(Change data capture)
Types: append only and insert only(only for external table)

3colm: metadata$isUpdate, action , row_id

-- Create a stream for the sales table CREATE OR REPLACE STREAM sales_stream ON TABLE sales;

-- Check the stream metadata SHOW STREAMS

View : underlying logic of query. We can create a view on top of table. But these are not updating
automatically, we have update it manually.

Materialize View: Soring it as table, updated automatically on frequent basis. So when there are some
changes in underlying base table, will automatically update in MV

Store procedure: A Stored Procedure in Snowflake is a block of code that performs a series of database
operations. It allows you to encapsulate logic that you can execute on demand

CREATE OR REPLACE PROCEDURE my_procedure() RETURNS STRING LANGUAGE SQL AS $$ BEGIN


RETURN 'Hello, World!'; END; $$; CALL my_procedure();

A User-Defined Function (UDF) in Snowflake is a custom function that you define to perform specific
operations. It can return a single value or a table, and you can use it in your queries just like built-in
functions.
CREATE OR REPLACE FUNCTION add_numbers(a INT, b INT) RETURNS INT LANGUAGE SQL AS $$ a + b;
$$;

SELECT add_numbers(5, 3)

A data warehouse : is database that combines different data from different data sources into one
consumable DB for reporting and analysis purposes.

data mart : is a subset of a data warehouse that has the same characteristics but is usually smaller and
is focused on the data for one division or one workgroup within an enterprise.

Database: DB is organized structure of data


DBMS: helps you to interact with data you have stored.

RDBMS: In relational DBMS, whatever info we are storing it is in the form of table wherein we have
multiple column and multiple rows. We have multiple tables here, they are in some relations for these
table

SQL: Query language.


So we database where data is store, so next step will be ,to actually use data, access or manage it. So we
need some commands that can operate it on. So SQL is query language which gives us these basic
commands that we can work with to access, manage and retrieve data.

MySQL vs SQL: MySQL is system software to work with DB and SQL is query language that gives
command

Constraints: SQL constraints are used to specify rules for the data in a table

• NOT NULL - Ensures that a column cannot have a NULL value


• UNIQUE - Ensures that all values in a column are different
• PRIMARY KEY - A combination of a NOT NULL and UNIQUE. Uniquely identifies each row in a table
• FOREIGN KEY - FOREIGN KEY is a field (or collection of fields) in one table, that refers to the PRIMARY
KEY in another table.
• CHECK - Ensures that the values in a column satisfies a specific condition

HAVING: Filter at group level, based on aggregate function


WHERE: filter at row level, checks individual rows.

JOIN clause is used to combine rows from two or more tables, based on a related column between them.
1. (INNER) JOIN: Returns records that have matching values in both tables
2. LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right
3. RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left
4. FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table
DELETE vs DROP vs TRUNCATE.
Delete: It is used to delete one or more rows of a table. We can give here condition to delete row.
DELETE FROM table_name WHERE condition;
DROP: It is used to drop the whole table.
TRUNCATE: It is used to delete all the rows of a relation (table) in one go.
TRUNCATE table ;

How to implement one-to-one, one-to-many and many-to-many relationships while designing


tables?
The one-to-one relationship can be implemented as a single table and rarely as two tables with primary
and foreign key relationships.
One-to-Many relationships are implemented by splitting the data into two tables with primary key and
foreign key relationships.
Many-to-Many relationships are implemented using a junction table with the keys from both the tables
forming the composite primary key of the junction table.

Why use Sub Query in Sql Server and List out types of Sub Queries?
Answer: Sub Queries are queries within a query. The parent or outer query is being called as the main
query and the inner query is called as inner query or subquery. Different types of Sub Queries are
• Correlated – It is not an independent subquery. It is an inner query which is referred by the outer
query.
• Non-Correlated – It is an independent subquery. It can be executed even without an outer query.

List out the difference between Union and Union All in Sql Server?
Union is used to combine all result sets and it removes the duplicate records from the final result set
obtained,
Union All which returns all the rows irrespective of whether rows are being duplicated or not.

What are the differences between “ROW_NUMBER()”, “RANK()” and “DENSE_RANK()”?


“ROW_NUMBER” – Used to return a sequential number of each row within a given partition.
“RANK” – Used to returns a new row number for each distinct row in the result set and it will leave a
number gap in case of duplicates.
“DENSE_RANK” – Used to returns a new row number for each distinct row in the result set and it will not
leave any number gap in case of duplicates.

Alternate Key – To identify a row uniquely we can have multiple keys one of them is called primary key
and rest of them are called alternate keys.
Candidate Key – Set of fields or columns which are uniquely identified in a row and they constitute
candidate keys.
Composite Key – One key formed by combining at least two or more columns or fields.

The COALESCE() function returns the first non-null value in a list.


Why are u switching job?
My father is planning to retire soon. And I want to become backbone of family and support them
financially. I don’t have issues with current employer, just looking for better compensation.

What is the most challenging in your experience?


1. At the start of my current project, we needed to create over 300 database objects across four
environments. To streamline this, we decided to use Terraform, an Infrastructure-as-Code tool. As this
technology was new to me, I faced several challenges, particularly during debugging.

2. Initially, our data asset used ON FUTURE grants for all database objects, implemented via Terraform.
However, a requirement emerged to switch to ON ALL grants for each object. During a deployment by
another team member, all grants in the development environment were accidentally removed. This is
when I realized that once ON ALL grants are applied, reverting to ON FUTURE grants is no longer possible.

3. Losing all grants in the development environment was a significant challenge, and identifying the root
cause took time. Fortunately, Terraform’s tfvars files allowed me to restore the changes.

4. To prevent such issues in the future, our team agreed to always coordinate and communicate before
deploying any changes in any environment. This ensures we avoid similar errors going forward.

Why this role?


As this role will give me more exposure to database administrative and exposure to data driven
technologies. And Blackrock is organization That I always wanted to work which will give boost my career
where I can understand more about investment banking and asset management domain.

Relocation?
No I don’t have any issues with relocation. I am ready to relocate.

Shift?
In my current organization , Im working in shifts. So do not have any issues to work in shifts.

office?
yes, I can work from office.

You might also like