0% found this document useful (0 votes)

38 views7 pages

Building A Data Processing Pipeline Using A Directory Table

Uploaded by

ravi.abinit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views7 pages

Building A Data Processing Pipeline Using A Directory Table

Uploaded by

ravi.abinit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Building a data processing pipeline using a

directory table
You can combine a directory table, which tracks and stores file-level metadata on a stage, with
other Snowflake objects such as streams and tasks to build a data processing pipeline.

A stream records data manipulation language (DML) changes made to a directory table, table,
external table, or the underlying tables in a view. A task executes a single action, which can be a
SQL command or an extensive UDF. You can schedule a task or run it on-demand.

Example: Create a simple pipeline to process PDFs

This example builds a simple data processing pipeline that does the following:

1. Detects PDF files added to a stage.

2. Extracts data from the files.
3. Inserts the data into a Snowflake table.

The pipeline uses a stream to detect changes to a directory table on the stage, and a task that
executes a user-defined function (UDF) to process the files.

The following diagram summarizes how the example pipeline works:

Step 1: Create a stage with a directory table enabled

Create an internal stage with a directory table enabled. The example statement sets
the ENCRYPTION type to SNOWFLAKE_SSE to enable unstructured data access on the stage.

CREATE OR REPLACE STAGE my_pdf_stage

ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE')
DIRECTORY = ( ENABLE = TRUE);

Step 2: Create a stream on the directory table

Next, create a stream on the directory table by specifying the stage that the directory table
belongs to. The stream will track changes to the directory table. In step 5 of this example, we use
this stream to construct a task.

CREATE STREAM my_pdf_stream ON STAGE my_pdf_stage;

Step 3: Create a user-defined function to parse PDFs

Create a user-defined function (UDF) that extracts data from PDF files. The task that we create
in step 5 will call this UDF to process newly-added files on the stage.

The following example statement creates a UDF named PDF_PARSE that processes PDF files
containing product review data. The UDF extracts form field data using the PyPDF2 library. It
returns a dictionary that contains the form names and values as key-value pairs.

Note
The UDF reads dynamically-specified files using the SnowflakeFile class. To learn more
about SnowflakeFile, see Reading a Dynamically-Specified File with SnowflakeFile.
CREATE OR REPLACE FUNCTION PDF_PARSE(file_path string)
RETURNS VARIANT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
HANDLER = 'parse_pdf_fields'
PACKAGES=('typing-extensions','PyPDF2','snowflake-snowpark-python')
AS
$$
from pathlib import Path
import PyPDF2 as pypdf
from io import BytesIO
from snowflake.snowpark.files import SnowflakeFile

def parse_pdf_fields(file_path):
with SnowflakeFile.open(file_path, 'rb') as f:
buffer = BytesIO(f.readall())
reader = pypdf.PdfFileReader(buffer)
fields = reader.getFields()
field_dict = {}
for k, v in fields.items():
if "/V" in v.keys():
field_dict[v["/T"]] = v["/V"].replace("/", "") if v["/V"].startswith("/") else v["/V"]

return field_dict
$$;

Step 4: Create a table to store the file contents

Next, create a table where each row stores information about a file on the stage in columns
named file_name and file_data. The task that we create in step 5 of this example will load data
into this table.

CREATE OR REPLACE TABLE prod_reviews (

file_name varchar,
file_data variant
);

Step 5: Create a task

Create a scheduled task that checks the stream for new files on the stage and inserts the file data
into the prod_reviews table.

The following statement creates a scheduled task using the stream created in step 2. The task
uses the SYSTEM$STREAM_HAS_DATA function to check whether the stream contains
change data capture (CDC) records.

CREATE OR REPLACE TASK load_new_file_data

WAREHOUSE = 'MY_WAREHOUSE'
SCHEDULE = '1 minute'
COMMENT = 'Process new files on the stage and insert their data into the prod_reviews table.'
WHEN
SYSTEM$STREAM_HAS_DATA('my_pdf_stream')
AS
INSERT INTO prod_reviews (
SELECT relative_path as file_name,
PDF_PARSE(build_scoped_file_url('@my_pdf_stage', relative_path)) as file_data
FROM my_pdf_stream
WHERE METADATA$ACTION='INSERT'
);

Step 6: Run the task to test the pipeline

To check that the pipeline works, you can add files to the stage, manually execute the task, and
then query the product_reviews table.

Start by adding some PDF files to the my_pdf_stage stage, and then refresh the stage.

Note
This example uses PUT commands, which cannot be executed from a worksheet in the
Snowflake web interface. To upload files with Snowsight, see Upload files onto a named internal
stage.
PUT file:///my/file/path/prod_review1.pdf @my_pdf_stage AUTO_COMPRESS = FALSE;
PUT file:///my/file/path/prod_review2.pdf @my_pdf_stage AUTO_COMPRESS = FALSE;

ALTER STAGE my_pdf_stage REFRESH;

You can query the stream to verify that it has recorded the two PDF files that we added to the
stage.

SELECT * FROM my_pdf_stream;

Now, execute the task to process the PDF files and update the product_reviews table.

EXECUTE TASK load_new_file_data;

+----------------------------------------------------------+
| status |
|----------------------------------------------------------|
| Task LOAD_NEW_FILE_DATA is scheduled to run immediately. |
+----------------------------------------------------------+
1 Row(s) produced. Time Elapsed: 0.178s

Query the product_reviews table to see that the task has added a row for each PDF file.

select * from prod_reviews;

+------------------+----------------------------------+
| FILE_NAME | FILE_DATA |
|------------------+----------------------------------|
| prod_review1.pdf | { |
| | "FirstName": "John", |
| | "LastName": "Johnson", |
| | "Middle Name": "Michael", |
| | "Product": "Tennis Shoes", |
| | "Purchase Date": "03/15/2022", |
| | "Recommend": "Yes" |
| |} |
| prod_review2.pdf | { |
| | "FirstName": "Emily", |
| | "LastName": "Smith", |
| | "Middle Name": "Ann", |
| | "Product": "Red Skateboard", |
| | "Purchase Date": "01/10/2023", |
| | "Recommend": "MayBe" |
| |} |
+------------------+----------------------------------+

Finally, you can create a view that parses the objects in the FILE_DATA column into separate
columns. You can then query the view to analyze and work with the file contents.

CREATE OR REPLACE VIEW prod_review_info_v

AS
WITH file_data
AS (
SELECT
file_name
, parse_json(file_data) AS file_data
FROM prod_reviews
)
SELECT
file_name
, file_data:FirstName::varchar AS first_name
, file_data:LastName::varchar AS last_name
, file_data:"Middle Name"::varchar AS middle_name
, file_data:Product::varchar AS product
, file_data:"Purchase Date"::date AS purchase_date
, file_data:Recommend::varchar AS recommended
, build_scoped_file_url(@my_pdf_stage, file_name) AS scoped_review_url
FROM file_data;

SELECT * FROM prod_review_info_v;

+------------------+------------+-----------+-------------+----------------+---------------+-------------
+-------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------+
| FILE_NAME | FIRST_NAME | LAST_NAME | MIDDLE_NAME | PRODUCT |
PURCHASE_DATE | RECOMMENDED | SCOPED_REVIEW_URL
|
|------------------+------------+-----------+-------------+----------------+---------------+-------------
+-------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------|
| prod_review1.pdf | John | Johnson | Michael | Tennis Shoes | 2022-03-15 | Yes |
https://mydeployment.us-west-2.aws.privatelink.snowflakecomputing.com/api/files/01aefcdc-
0000-6f92-0000-012900fdc73e/1275606224902/RZ4s%2bJLa6iHmLouHA79b94tg%2f3SDA
%2bOQX01pAYo%2bl6gAxiLK8FGB%2bv8L2QSB51tWP%2fBemAbpFd
%2btKfEgKibhCXN2QdMCNraOcC1uLdR7XV40JRIrB4gDYkpHxx3HpCSlKkqXeuBll
%2fyZW9Dc6ZEtwF19GbnEBR9FwiUgyqWjqSf4KTmgWKv5gFCpxwqsQgofJs%2fqINOy
%2bOaRPa%2b65gcnPpY2Dc1tGkJGC%2fT110Iw30cKuMGZ2HU%3d |
| prod_review2.pdf | Emily | Smith | Ann | Red Skateboard | 2023-01-10 | MayBe
| https://mydeployment.us-west-2.aws.privatelink.snowflakecomputing.com/api/files/01aefcdc-
0000-6f92-0000-012900fdc73e/1275606224902/
g3glgIbGik3VOmgcnltZxVNQed8%2fSBehlXbgdZBZqS1iAEsFPd8pkUNB1DSQEHoHfHcW
LsaLblAdSpPIZm7wDwaHGvbeRbLit6nvE%2be2LHOsPR1UEJrNn83o
%2fZyq4kVCIgKeSfMeGH2Gmrvi82JW
%2fDOyZJITgCEZzpvWGC9Rmnr1A8vux47uZj9MYjdiN2Hho3uL9ExeFVo8FUtR
%2fHkdCJKIzCRidD5oP55m9p2ml2yHOkDJW50%3d |
+------------------+------------+-----------+-------------+----------------+---------------+-------------
+-------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------+
Was

Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Programming+in+Snowflake+ +All+Slides
100% (1)
Programming+in+Snowflake+ +All+Slides
342 pages
Informatica Tutorial
100% (2)
Informatica Tutorial
91 pages
Msbi Nuakri Resume PDF
No ratings yet
Msbi Nuakri Resume PDF
6 pages
ETL Interview Preparation
No ratings yet
ETL Interview Preparation
18 pages
Building A Data Warehouse
100% (2)
Building A Data Warehouse
90 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
Snowflake Mini Project
No ratings yet
Snowflake Mini Project
7 pages
Delhivery Feature Engineering - Solution Approach
No ratings yet
Delhivery Feature Engineering - Solution Approach
7 pages
Introduction To SAP HANA
No ratings yet
Introduction To SAP HANA
16 pages
PDE Exam Dump 3
No ratings yet
PDE Exam Dump 3
98 pages
Abdul SnowflakeDeveloper
No ratings yet
Abdul SnowflakeDeveloper
3 pages
Data Engineering Notes
No ratings yet
Data Engineering Notes
4 pages
Courier Management System
77% (13)
Courier Management System
57 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Study On Data-Driven Recognition and Extraction of PDF Document Elements
No ratings yet
Study On Data-Driven Recognition and Extraction of PDF Document Elements
19 pages
Pharma Script Pawan
No ratings yet
Pharma Script Pawan
19 pages
2.delivery Project
No ratings yet
2.delivery Project
3 pages
DFD & CFD
No ratings yet
DFD & CFD
8 pages
Coursework Requirements
No ratings yet
Coursework Requirements
4 pages
Data Warehousing and Business Intelligence DS-3003 Assignment # 1
No ratings yet
Data Warehousing and Business Intelligence DS-3003 Assignment # 1
6 pages
Scope
No ratings yet
Scope
83 pages
Aarohan Subedi
No ratings yet
Aarohan Subedi
38 pages
IT 224 - Database 1 Guidelines For Project Submission
No ratings yet
IT 224 - Database 1 Guidelines For Project Submission
3 pages
5 Data Enginnering Projefct
No ratings yet
5 Data Enginnering Projefct
9 pages
CS5200 - Project - Report - PagarSJoshiPPolA-1 (1) 3
No ratings yet
CS5200 - Project - Report - PagarSJoshiPPolA-1 (1) 3
11 pages
Introduction To Cloud Data Warehousing: o o o o o
No ratings yet
Introduction To Cloud Data Warehousing: o o o o o
5 pages
Creditone Bank
No ratings yet
Creditone Bank
10 pages
Todo App Project
No ratings yet
Todo App Project
9 pages
Data Analyst Work Sample Request
No ratings yet
Data Analyst Work Sample Request
4 pages
Technical Documentation
No ratings yet
Technical Documentation
21 pages
Prabin G.C.
No ratings yet
Prabin G.C.
51 pages
Name - Nityananda Vyawhare Roll No. - 2223216 TY Core - 2: Unit-3
No ratings yet
Name - Nityananda Vyawhare Roll No. - 2223216 TY Core - 2: Unit-3
22 pages
DWDM Lab Manual Excercises
No ratings yet
DWDM Lab Manual Excercises
91 pages
DBMS MCQ
No ratings yet
DBMS MCQ
19 pages
Cargo Management System
No ratings yet
Cargo Management System
18 pages
Group Case Study #2 - Teacher Copy - BTA1002
No ratings yet
Group Case Study #2 - Teacher Copy - BTA1002
4 pages
Your Reliable Data Engineering Partner
No ratings yet
Your Reliable Data Engineering Partner
18 pages
Week 5. Data Pipelines
No ratings yet
Week 5. Data Pipelines
51 pages
Outline Key Features of The Snowflake Data Cloud
No ratings yet
Outline Key Features of The Snowflake Data Cloud
4 pages
Azure 5years CV Retail3
No ratings yet
Azure 5years CV Retail3
4 pages
Coursera Car Project
No ratings yet
Coursera Car Project
3 pages
RuleCraft Presentation WithComparsion
No ratings yet
RuleCraft Presentation WithComparsion
15 pages
Document 7
No ratings yet
Document 7
1 page
Data Engineer Intern Assessment
No ratings yet
Data Engineer Intern Assessment
6 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
De FiNal
No ratings yet
De FiNal
94 pages
Data Engineering System Design
No ratings yet
Data Engineering System Design
37 pages
DW Sem
No ratings yet
DW Sem
25 pages
ITSI-4.7.1-Administration Manual
No ratings yet
ITSI-4.7.1-Administration Manual
353 pages
Intro Snowflake Datawarehouse
No ratings yet
Intro Snowflake Datawarehouse
12 pages
Goods Transport Company
No ratings yet
Goods Transport Company
19 pages
Delta Blast PDF
No ratings yet
Delta Blast PDF
14 pages
Ajay Practical File
100% (1)
Ajay Practical File
62 pages
SAP Integration - Microsoft BI On Top of SAP BW
No ratings yet
SAP Integration - Microsoft BI On Top of SAP BW
10 pages
Hospital Management System Database Design Is Uploaded in This Page
No ratings yet
Hospital Management System Database Design Is Uploaded in This Page
4 pages
Your SQL Quickstart Guide 1694613471
No ratings yet
Your SQL Quickstart Guide 1694613471
32 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
Data Warehouse Concepts: TCS Internal
No ratings yet
Data Warehouse Concepts: TCS Internal
19 pages
Comptia: Questions & Answers PDF
No ratings yet
Comptia: Questions & Answers PDF
8 pages
Assignment On Dbms
No ratings yet
Assignment On Dbms
7 pages
UU-COM-4008 Reading Material Week 3
No ratings yet
UU-COM-4008 Reading Material Week 3
9 pages
Advance Database Administration Practicles by Om Waman
No ratings yet
Advance Database Administration Practicles by Om Waman
44 pages
SQL Basics Cheat Sheet Ledger
No ratings yet
SQL Basics Cheat Sheet Ledger
1 page
Data Camp SQL
No ratings yet
Data Camp SQL
18 pages
CH 6 pt2
No ratings yet
CH 6 pt2
4 pages
Optimizing Replication of Data For Distributed Cloud Computing Environments Techniques Challenges and Research Gap
No ratings yet
Optimizing Replication of Data For Distributed Cloud Computing Environments Techniques Challenges and Research Gap
7 pages
Stock
No ratings yet
Stock
4 pages
Cbda2103 Assignment Sept12 PDF
No ratings yet
Cbda2103 Assignment Sept12 PDF
8 pages
Cloudera Academic Partnership 7
No ratings yet
Cloudera Academic Partnership 7
70 pages
Rechargesolar Co Uk-4th Dec 2023
No ratings yet
Rechargesolar Co Uk-4th Dec 2023
9 pages
Rdbms Lab. - Module-1 EXPERIMENT - 1&2&3
No ratings yet
Rdbms Lab. - Module-1 EXPERIMENT - 1&2&3
14 pages
Mobile Computing Unit 3
No ratings yet
Mobile Computing Unit 3
3 pages
Image Result Relevance v1.4 Guidelines
No ratings yet
Image Result Relevance v1.4 Guidelines
6 pages
Assignment
No ratings yet
Assignment
2 pages
Learning Informatica PowerCenter 9.x
From Everand
Learning Informatica PowerCenter 9.x
Rahul Malewar
3/5 (4)
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
From Everand
Getting Started with Oracle Data Integrator 11g: A Hands-On Tutorial
David Hecksel
5/5 (2)
CMS Made Simple 1.6 Beginner's Guide
From Everand
CMS Made Simple 1.6 Beginner's Guide
Sofia Hauschildt
5/5 (1)
Node.js 6.x Blueprints
From Everand
Node.js 6.x Blueprints
Fernando Monteiro
No ratings yet
Microsoft Power Platform For Dummies
From Everand
Microsoft Power Platform For Dummies
Jack A. Hyman
1/5 (1)
Agile Web Application Development with Yii1.1 and PHP5
From Everand
Agile Web Application Development with Yii1.1 and PHP5
Jeffrey Winesett
3.5/5 (1)
Pentaho 5.0 Reporting by Example: Beginner’s Guide
From Everand
Pentaho 5.0 Reporting by Example: Beginner’s Guide
Mariano GarcÃa MattÃo
No ratings yet
Learning Salesforce Visual Workflow
From Everand
Learning Salesforce Visual Workflow
Rakesh Gupta
4/5 (1)
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
From Everand
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
Abdelfattah Ragab
No ratings yet
Windows Security Monitoring: Scenarios and Patterns
From Everand
Windows Security Monitoring: Scenarios and Patterns
Andrei Miroshnikov
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet

Building A Data Processing Pipeline Using A Directory Table

Uploaded by

Building A Data Processing Pipeline Using A Directory Table

Uploaded by

Building a data processing pipeline using a

Example: Create a simple pipeline to process PDFs

1. Detects PDF files added to a stage.

The following diagram summarizes how the example pipeline works:

CREATE OR REPLACE STAGE my_pdf_stage

Step 2: Create a stream on the directory table

CREATE STREAM my_pdf_stream ON STAGE my_pdf_stage;

Step 4: Create a table to store the file contents

CREATE OR REPLACE TABLE prod_reviews (

Step 5: Create a task

CREATE OR REPLACE TASK load_new_file_data

Step 6: Run the task to test the pipeline

ALTER STAGE my_pdf_stage REFRESH;

SELECT * FROM my_pdf_stream;

EXECUTE TASK load_new_file_data;

select * from prod_reviews;

CREATE OR REPLACE VIEW prod_review_info_v

SELECT * FROM prod_review_info_v;

You might also like