SF Notes Anuja
SF Notes Anuja
calculations. this client is also assist with other aspects of indirect taxes, namely compliance, returns, and
reporting. Our team is working on data domain. This involves our customer private Cloud's Client
Configuration data. I mostly work on administrating data as well as building data pipelines. We have build
two types of pipeline yet:
Einvoice:
1. E-Invoicing requires inputting invoicing data to a tax engine that reconciles with another team
data.we transforms XML files, merges with e-invoice metadata. We have four layers in each env.
RAW, STG, CURATED, SEMANTIC.
2. we have two sources: 1. einvoice XML file in s3 bucket in AWS and einvoice metadata from
fivetran. Fivetran is ingestion tool which load get data from source to destination.
3. The e-invoices received from Vertex clients have been parsed by the E-Invoicing application and
results of data extraction are saved in the metadata databases. this metadata will be extract from
tenant metadata database to Fivetran connectors
1. So In Raw database we receives tenant metadata from Fivetran and S3 replication of e-invoice
XML files.
2. Snowpipe loads XML payload into Data Platform Snowflake account RAW layer from S3, SQS
notification that new file(s) have arrived in S3 folder
3. Data from the E-Invoice metadata database schemas are joined and merged in the stage data
layer, transformed in curated layer like adding another column or changing data types and finally
will get formatted in the semantic data layer. This transformation is implemented by DBT (Data
build tool) tool. in which we materialize dbt models using SQL select statements.
4. in stage, we have two task scheduled which perform operation like merging and inserting data in
STG layer table.
and transformed in curated layer like adding another column or changing data types, on top of this
table we implemented view in final and semantic data layer.
5. Another team mostly of data scientists connects to the Data Platform Snowflake account to query
the e-invoice semantic data set.
Features:
1. OLAP data warehouse
2. Multi-cluster, shared data architecture
3. Pay only for what you use: storage and compute
4. Virtual Warehouses – scale up/ down (vertically): resizing the VWH Scale in/out(horizontally): add /
remove VWH
5. Time travel – failsafe
6. Data sharing
7. Zero copy cloning
Data Warehouse : is database that combines different data from different data sources into one
consumable DB for reporting and analysis purposes.
Cloud Computing: Historically, if an organization wants to work with data, that needed data center that
can cause overhead means. It need building / data center room i.e. infrastructure, then have to provide
electricity and security to them. Adding on, that maintenance of Hardware n software , this can cause lot
of cost.
Storage :
1. Layer where we store data.
2. Data is not actually stored in snowflake itself but we use external cloud provider.
3. SO for storage use AWS s3 buckets. Snowflake’s storage layer is built on top of cloud-based storage
systems such as Amazon S3 depending on the chosen cloud provider.
4.This is all under the hood, we don’t have to do anything or we don’t have to maintain it.
5. Here data is also called: Hybrid Columnar Storage.
6. This means instead of savindg data in rows, it is stored in Columns , compressed into called blobs. This
how data stored in cloud providers.
1. Provide compute capacity or virtual compute instance/ servers that used to process queries
2. They come different sizes
3. Smallest size in xsmall , consist of one server. If queries are not complex then use smaller size and
queries become more complex we have higher amount of server warehouse i.e large warehouse to
run query fast
4. Each size always increase the number of servers and it always doubles them upto 4xL
5. Large WH are more expensive
6. Multi-clustering : Means that if one warehouse can’t handle the query load, additional clusters
can be activated and they can be clustered together in one warehouse. SO to more compute
power to process large no of queries simultaneously.
Multi-cluster WH: If we have one WH, and at certain time , we have more queries , then single WH can
process and causes problem because user will have to wait for long time until all queries are processed.
So solution to this is that we can Automatically scale up and down clusters dynamically depending on
workload. By doing this we can just redistribute the queries that are in queue and can currently not be
processed
LOADING:
STAGES: DB object which used as location of data files where data can be loaded from.
FILE FORMAT Pre-defined format structure that describes a set of staged data to access or load into
Snowflake tables for CSV, JSON, AVRO, ORC, PARQUET, and XML input types
Snowpipe:
Enables loading once file appears in bucket
Serverless: instead of using VWH, compute resource automatically managed by sf itself
STAGE → TEST copy command → create pipe → create event notification (As file added to s3 bucket it
can trigger Snowpipe)
Enables near-real-time loading of frequent, small volumes of data into Snowflake.
We can choose a specific colm in table to be the clustering key, Snowflake then sorts and stores related
data together within the micro-partitions.
Benefit: If your queries often search for data based on the chosen clustering key, Snowflake can quickly
find the relevant information within the micro-partitions because similar data is grouped together.
Cluster Keys: (sf automatically maintains these keys or we can also specify)
1. Subset of rows to locate data in micro-partitions
2. For large tables this improves scan efficiency in our queries.
How to cluster:
1. Colm used in most time in Where clauses, joins
Partition Pruning:
Only scan partition in which we have data and skipping/eliminating all other partitions
Performance optimization:
Automatically managed Micro-partitions it includes following task: Already there
Add indexes, primary keys
Create table partitions
Analyze query execution table plan
Remove unnecessary full table scans
We can do:
Assigning Appropriate data types
Sizing and dedicating VWH : separated according different workloads
use cluster keys for large table
Caching : Storing query result upto 24 hr. to maximize this, should run queries in same warehouse
Time travel: Query historic data/ recover objects that been dropped within retention time
What possible?
1. Query deleted / updated data
2. Restore tables, schemas, DB that have been dropped
3. Create clones of tbl, schs, DBs from previous state
4. Contributes storage cost
AT | BEFORE clause which can be specified in SELECT statements and CREATE … CLONE commands
(immediately after the object name). The clause uses one of the following parameters to pinpoint the
exact historical data you wish to access:
o TIMESTAMP
o OFFSET (time difference in seconds from the present time)
o STATEMENT (identifier for statement, e.g. query ID)
SELECT * FROM my_table AT(TIMESTAMP => 'Fri, 01 May 2015 16:20:00 -0700'::timestamp_tz);
SELECT * FROM my_table AT(OFFSET => -60*5);
SELECT * FROM my_table BEFORE(STATEMENT => '8e5d0ca9-005e-44e6-b858-a8f5b37c5726')
Retention Period : Time that we can travel back past in sf. Depends on editions (Standard : 1 d , other :
90) Default 1 Day
Table types:
1. Permanent Table
• persist until deleted or dropped from database
• default table type.
• Time travel is possible in these tables up to 90 days
• It is Fail-safe and data can be recovered if lost due to fail
2. Temporary Table
• persist for a session
• Time travel is possible but only 0 to 1 day.
• It is not fail-safe 3.
3. Transient Table
• persist until the users drop or delete them.
• It is used where "data persistence" is required but doesn't need "data retention" for a longer
period
• Time travel is possible but only 0 to 1 day.
• It is not fail-safe
4. External Table
• allows you to query data stored in an external stage as if the data were inside a table in
Snowflake.
• External tables are read-only. We can’t perform DML operations.
• Time travel is not possible for external tables.
• It is not fail-safe inside Snowflake environment.
STAGES: DB object which used as location of data files where data can be loaded from.
FILE FORMAT Pre-defined format structure that describes a set of staged data to access or load into
Snowflake tables for CSV, JSON, AVRO, ORC, PARQUET, and XML input types
Stream: Object that records (DML[DEL, INSERT, UPDATE ]) changes made to a table. This process is
called CDC(Change data capture)
Types: append only and insert only(only for external table)
-- Create a stream for the sales table CREATE OR REPLACE STREAM sales_stream ON TABLE sales;
View : underlying logic of query. We can create a view on top of table. But these are not updating
automatically, we have update it manually.
Materialize View: Soring it as table, updated automatically on frequent basis. So when there are some
changes in underlying base table, will automatically update in MV
Store procedure: A Stored Procedure in Snowflake is a block of code that performs a series of database
operations. It allows you to encapsulate logic that you can execute on demand
A User-Defined Function (UDF) in Snowflake is a custom function that you define to perform specific
operations. It can return a single value or a table, and you can use it in your queries just like built-in
functions.
CREATE OR REPLACE FUNCTION add_numbers(a INT, b INT) RETURNS INT LANGUAGE SQL AS $$ a + b;
$$;
SELECT add_numbers(5, 3)
A data warehouse : is database that combines different data from different data sources into one
consumable DB for reporting and analysis purposes.
data mart : is a subset of a data warehouse that has the same characteristics but is usually smaller and
is focused on the data for one division or one workgroup within an enterprise.
RDBMS: In relational DBMS, whatever info we are storing it is in the form of table wherein we have
multiple column and multiple rows. We have multiple tables here, they are in some relations for these
table
MySQL vs SQL: MySQL is system software to work with DB and SQL is query language that gives
command
Constraints: SQL constraints are used to specify rules for the data in a table
JOIN clause is used to combine rows from two or more tables, based on a related column between them.
1. (INNER) JOIN: Returns records that have matching values in both tables
2. LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right
3. RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left
4. FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table
DELETE vs DROP vs TRUNCATE.
Delete: It is used to delete one or more rows of a table. We can give here condition to delete row.
DELETE FROM table_name WHERE condition;
DROP: It is used to drop the whole table.
TRUNCATE: It is used to delete all the rows of a relation (table) in one go.
TRUNCATE table ;
Why use Sub Query in Sql Server and List out types of Sub Queries?
Answer: Sub Queries are queries within a query. The parent or outer query is being called as the main
query and the inner query is called as inner query or subquery. Different types of Sub Queries are
• Correlated – It is not an independent subquery. It is an inner query which is referred by the outer
query.
• Non-Correlated – It is an independent subquery. It can be executed even without an outer query.
List out the difference between Union and Union All in Sql Server?
Union is used to combine all result sets and it removes the duplicate records from the final result set
obtained,
Union All which returns all the rows irrespective of whether rows are being duplicated or not.
Alternate Key – To identify a row uniquely we can have multiple keys one of them is called primary key
and rest of them are called alternate keys.
Candidate Key – Set of fields or columns which are uniquely identified in a row and they constitute
candidate keys.
Composite Key – One key formed by combining at least two or more columns or fields.
2. Initially, our data asset used ON FUTURE grants for all database objects, implemented via Terraform.
However, a requirement emerged to switch to ON ALL grants for each object. During a deployment by
another team member, all grants in the development environment were accidentally removed. This is
when I realized that once ON ALL grants are applied, reverting to ON FUTURE grants is no longer possible.
3. Losing all grants in the development environment was a significant challenge, and identifying the root
cause took time. Fortunately, Terraform’s tfvars files allowed me to restore the changes.
4. To prevent such issues in the future, our team agreed to always coordinate and communicate before
deploying any changes in any environment. This ensures we avoid similar errors going forward.
Relocation?
No I don’t have any issues with relocation. I am ready to relocate.
Shift?
In my current organization , Im working in shifts. So do not have any issues to work in shifts.
office?
yes, I can work from office.