100% found this document useful (1 vote)
303 views47 pages

Data Masters - Datawarehousing in The Cloud

test
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
303 views47 pages

Data Masters - Datawarehousing in The Cloud

test
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Using Cloud Data

Warehousing to Analyze
Structured and Semi-
Structured data sets
Kevin Bair
Solution Architect
[email protected]
Topics this presentation will cover
1. Cloud DW Architecture

2. ETL / Data Pipeline Architecture

3. Analytics on Semi-Structured Data

4. “Instant” Datamarts without replicating TB of data

5. Analyzing Structured with Semi-Structured Data

2
Introducing Snowflake: An experienced team of data
experts with a vision to reinvent the data warehouse

Bob Muglia Benoit Dageville, PhD Marcin Zukowski, PhD Thierry Cruanes, PhD
CEO CTO & Founder Founder & VP of Engineering Founder Architect

Former President of Lead architect of Oracle parallel Inventor of vectorized query Leading expert in query

Microsoft’s Server and Tools execution and a key execution in databases optimization and parallel

Business manageability architect execution at Oracle

3
Today’s data: big, complex, moving to cloud

Of workloads will
be processed In
cloud data centers
(Cisco)

Surge in cloud
spending and
supporting
technology
(IDC)

Data in the cloud today is


expected to grow in the
next two years.
(Gigaom)

4
Structured data and Semi-
Structured data
• Transactional data • Machine-generated
• Relational • Non-relational
• Fixed schema • Varying schema
• OLTP / OLAP • Most common in cloud
environments
What does Semi Structured
mean?
• Data that may be of any type
• Data that is incomplete
• Structure that can rapidly and unpredictably
change
• Usually Self Describing

• Examples
• XML
• AVRO
• JSON
XML Example
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple
syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped
cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and
whipped cream</description>
<calories>900</calories>
</food>
</breakfast_menu>
JSON Example
{
"custkey": "450002",
"useragent": {
"devicetype": "pc",
"experience": "browser",
"platform": "windows"
},
"pagetype": "home",
"productline": "none",
"customerprofile": {
"age": 20,
"gender": "male",
"customerinterests": [
"movies",
"fashion",
"music"
]
}
}
Avro Example

Schema } JSON

Data
} Binary
Why is this so hard for a
traditional Relational DBMS?
• Pre-defined Schema

• Store in Character Large Object (CLOB) data


type

• Inefficient to Query

• Constantly Changing
Current architectures can’t keep up

Data Warehousing Hadoop


• Complex: manage hardware, data • Complex: specialized skills, new tools
distribution, indexes, … • Limited elasticity: data
• Limited elasticity: forklift upgrades, redistribution, resource contention
data redistribution, downtime • Not a data warehouse: batch-
• Costly: overprovisioning, significant oriented, limited optimization,
care & feeding incomplete security

11
Data Pipeline / Data Lake Architecture – “ETL”

Data
Source Stage Stage EDW
Lake
Website S3 Hadoop S3 MPP
Logs • 10TB • 30 TB • 5 TB • 10 TB Disk
• Summary

Operational
Systems

External
Providers

Stream
Data
One System for all Business Data
HDFS
Semi-structured data
Structured Structured data
Data Sink { "firstName": "John",
Storage Apple 101.12 250 FIH-2316 "lastName": "Smith",
"height_cm": 167.64,
Pear 56.22 202 IHO-6912 "address": {

Map-Reduce Jobs
"streetAddress": "21 2nd
Orange 98.21 600 WHQ-6090 Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},

z
Relational
Databases

Other Systems Snowflake


 Multiple Systems  One System
 Specialized Skillset  One Common Skillset
 Slower/More Costly Data Conversion  Faster/Less Costly Data Conversion
 For both Structured and Semi-
Structured Business Data
How have other Big Data / DW
vendors approached this?
Microsoft - SQL Server doesn't yet accommodate JSON queries, so instead the company
announced Azure DocumentDB, a native document DBaaS (database as a service) for the Azure
cloud (http://azure.microsoft.com/en-us/documentation/services/documentdb/)

Oracle Exadata - Oracle Exadata X5 has many new software capabilities, including faster pure
columnar flash caching, database snapshots, flash cache resource management, near-instant
server death detection, I/O latency capping, and offload of JSON and XML analytics
(https://www.oracle.com/corporate/pressrelease/data-center-012115.html)

IBM Neteeza - You can use the Jaql Netezza® module to read from or write to Netezza tables.
(www-
01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.an
alyze.doc/doc/r0057926.html)

Postgres/Redshift - We recommend using JSON sparingly. JSON is not a good choice for storing
larger datasets because, by storing disparate data in a single column, JSON does not leverage
Amazon Redshift’s column store architecture.
(http://docs.aws.amazon.com/redshift/latest/dg/json-functions.html)

Hadoop – Hive and/or Map Reduce, somewhat vendor specific

14
Relational Processing of
Semi-Structured Data

1. Variant data type compresses storage of semi-


structured data
2. Data is analyzed during load to discern repetitive
attributes within the hierarchy
3. Repetitive attributes are columnar compressed
and statistics are collected for relational query
optimization
4. SQL extensions enable relational queries against
both semi-structured and structured data

15
Why Support Semi-structured
data via SQL?
• Integrate with existing data
• Reduced administrative costs
• Improved security
• Transaction management
• Better performance
• Better resource allocation
• Increased developer productivity
• SQL is proven model for performing queries,
especially joins.
Requirements for a Cloud-based Big
Data / Data Warehouse Platform
• No Contention (Writers can’t block readers)
• Continuous loading of data without “Windows”
• Compress, and don’t duplicate the data
• Segment the data (Datamarts) without replicating
• Ability to analyzed structured and Semi-structured
data, together, at Volume (TB-PB) using SQL
• Reduce Complexity, easy to manage and develop
with
• ELT vs. ETL, allowing processing to happen closer
to the data
• Security, encrypt all data at rest
Architectural Evolution of the Data
Warehouse
Scale Up
(SMP – Single Server)

RDBMS
Software

Storage
Architectural Evolution of the Data
Warehouse
Scale Up Scale Out Scale Up, Out or Down
(SMP – Single Server) (MPP / Hadoop Cluster) (Elastic / Cloud)

Optimizer
Optimizer
Metadata / Schema
Metadata / Schema
Optimizer

Metadata / Schema Leader Node


Query Query Query Query
Engine Engine Engine Engine
Query Engine

Query Query Query


Engine Engine Engine

Storage Storage Storage


Storage 1/X 2/X 3/X Storage
…..
Data Node(s)

1) Partition Keys / OLAP 1) No Partitions


2) Skew 2) Multiple Clusters
3) Redundancy 3) Only data needed is accessed
4) Query Inefficiency 4) Query Efficient
Data Warehousing ETL &
Data Loading

Cloud Service
Database is separate
Virtual
from Virtual Warehouse Warehouse Marketing
Finance
Users
Users
One Virtual Warehouse, S
multiple Databases Virtual
Virtual
Warehouse
Warehouse
One Database, multiple
Virtual Warehouses Databases

Virtual Warehouse scales


independently from Database Virtual
Virtual
Warehouse
Test/Dev Warehouse
Data loading does not Users S Sales
Virtual Users
Impact query performance Warehouse

Biz Dev
User
20
Data Pipeline / Snowflake Architecture – “ELT”

Source Stage EDW


Website Logs S3 Snowflake
• 10TB • 2 TB Disk

Operational
Systems

External
Providers

Stream Data
Amazon Cloud Data Pipeline Architecture

Stream Data
“JSON”

IAM “ Application”

EDW
Amazon
Kinesis S3 buckets
Amazon SQS

“ Long Term Storage Files” “ Notification”

Amazon SNS
Amazon Glacier

22
Amazon Cloud Data Pipeline Architecture (Near Real-time)

Stream Data
“JSON”

IAM “ Application”

EDW
Amazon
Kinesis S3 buckets
Amazon SQS

“ Near Real-time” “ Long Term Storage Files” “ Notification”

Storm /
Spark Amazon SNS
Amazon Glacier

23
Typical customer environment
Data sources
Datamarts BI / Analytics

OLTP EDW
databa
ses

Enterpri
se
ETL
applicati
ons

Third-
party
Web
applicati
ons

Other
Hadoop
Demo Time!
Demo Scenarios
• Clickstream Analysis (load JSON, multi-table insert)
• Which Product Category is most clicked on?
• Which Product line does the customer self identify as
having the most interest in?

• Twitter Feed (Join Structured and Semi-Structured)


• From our twitter campaign, is there a correlation
between twitter volume and sales?
Clickstream Example
{
"custkey": "450002",
"useragent": {
"devicetype": "pc",
"experience": "browser",
"platform": "windows"
},
"pagetype": "home",
"productline": "none",
"customerprofile": {
"age": 20,
"gender": "male",
"customerinterests": [
"movies",
"fashion",
"music"
]
}
}
What makes Snowflake unique for
handling Semi-Structured Data?
• Compression
• Encryption / Role Based Authentication
• Shredding
• History/Results
• Clone
• Time Travel
• Flatten
• Regexp
• No Contention
• No Tuning
• Infinitely scalable
• SQL based with extremely high performance
Where to Get More Info

• Visit us: http://www.snowflake.net/

• Email us:
• Sales: [email protected]
• General: [email protected]

• Q&A
THANK YOU!
Functions
• ARRAYAGG, ARRAY_AGG
• REGEXP
• ARRAY_APPEND
• REGEXP_COUNT • ARRAY_CAT
• ARRAY_COMPACT
• REGEXP_INSTR • ARRAY_SIZE
• REGEXP_LIKE • ARRAY_CONSTRUCT
• ARRAY_CONSTRUCT_COMPACT
• REGEXP_REPLACE • ARRAY_INSERT
• REGEXP_SUBSTR • ARRAY_PREPEND
• ARRAY_SLICE
• RLIKE • ARRAY_TO_STRING
• CHECK_JSON
• PARSE_JSON
• OBJECT_CONSTRUCT
• OBJECT_INSERT
• GET
• GET_PATH
• AS_type
• IS_type
• IS_NULL_VALUE
• TO_JSON
• TYPEOF
Parsing JSON using Snowflake SQL
(After loading JSON file into Snowflake table)

Parsing JSON using SQL from a VARIANT column in a Snowflake table


SELECT 'The First Person is '||fullrow:fullName||' '||
'He is '||fullrow:age||' years of age.'||' '||
'His children are: '
||fullrow:children[0].name||' Who is a '||
fullrow:children[0].gender||' and is '||
fullrow:children[0].age||' year(s) old '
||fullrow:children[1].name||' Who is a '||
fullrow:children[1].gender||' and is '||
fullrow:children[1].age||' year(s) old ' Result
FROM json_data_table
WHERE fullrow:fullName = 'John Doe';
FLATTEN() Function
and its Pseudo-columns
FLATTEN() Converts a repeated field into a set of rows.
FLATTEN() Returns Pseudo-columns in addition to the data result.
SELECT S.fullrow:fullName, t.value:name, t.value:age, t.SEQ, t.KEY, t.PATH, t.INDEX, t.VALUE
FROM json_data_table AS S, TABLE(FLATTEN(S.fullrow,'children')) t;

For maps or
objects. It contains
the key to the
exploded value

A unique sequence #
Path expression of For arrays, It contains The expression
related to the input
the exploded value in the index in the array of contained in the
expression
the input expression the exploded value collection
FLATTEN() in Snowflake SQL
(Removing one level of nesting)

FLATTEN() Converts a repeated field into a set of rows:


SELECT S.fullrow:fullName, t.value:name, t.value:age
FROM json_data_table as S, TABLE(FLATTEN(S.fullrow,'children')) t
WHERE s.fullrow:fullName = 'Mike Jones’
AND t.value:age::integer > 6 ;
FLATTEN(): Two levels of
un-nesting
Output
+---------------+-----+--------+-------------------+------------------------+
| name | age | gender | citiesLived.place | citiesLived.yearsLived |
+---------------+-----+--------+-------------------+------------------------+
| Mike Jones | 35 | Male | Los Angeles | 1989 |
| Mike Jones | 35 | Male | Los Angeles | 1993 |
| Mike Jones | 35 | Male | Los Angeles | 1998 |
| Mike Jones | 35 | Male | Los Angeles | 2002 |
| Mike Jones | 35 | Male | Washington DC | 1990 |
| Mike Jones | 35 | Male | Washington DC | 1993 |
| Mike Jones | 35 | Male | Washington DC | 1998 |
| Mike Jones | 35 | Male | Washington DC | 2008 |
| Mike Jones | 35 | Male | Portland | 1993 |
| Mike Jones | 35 | Male | Portland | 1998 |
| Mike Jones | 35 | Male | Portland | 2003 |
| Mike Jones | 35 | Male | Portland | 2005 |
| Mike Jones | 35 | Male | Austin | 1973 |
| Mike Jones | 35 | Male | Austin | 1998 |
| Mike Jones | 35 | Male | Austin | 2001 |
| Mike Jones | 35 | Male | Austin | 2005 |
FLATTEN() in Snowflake SQL
(Removing two levels of nesting)
Getting cities Mike Jones lived and when
TABLE (Snowflake syntax)
SELECT
p.fullrow:fullName::varchar as name,
p.fullrow:age::int as age, Output
p.fullrow:gender::varchar as gender,
cl.value:place::varchar as city,
+---------------+-----+--------+-------------------+------------------------+
yl.value::int as year | name | age | gender | citiesLived.place | citiesLived.yearsLived |
FROM json_data_table p, +---------------+-----+--------+-------------------+------------------------+
TABLE(FLATTEN(p.fullrow,'citiesLived')) cl, | Mike Jones | 35 | Male | Los Angeles | 1989 |
| Mike Jones | 35 | Male | Los Angeles | 1993 |
TABLE(FLATTEN(cl.value:yearsLived,'')) yl | Mike Jones | 35 | Male | Los Angeles | 1998 |
WHERE name = 'Mike Jones'; | Mike Jones | 35 | Male | Los Angeles | 2002 |
| Mike Jones | 35 | Male | Washington DC | 1990 |
LATERAL (ANSI syntax, also supported) | Mike Jones
| Mike Jones
| 35 | Male
| 35 | Male
| Washington DC
| Washington DC
|
|
1993 |
1998 |
| Mike Jones | 35 | Male | Washington DC | 2008 |
SELECT | Mike Jones | 35 | Male | Portland | 1993 |
p.fullrow:fullName::varchar as name, | Mike Jones | 35 | Male | Portland | 1998 |
| Mike Jones | 35 | Male | Portland | 2003 |
p.fullrow:age::int as age, | Mike Jones | 35 | Male | Portland | 2005 |
p.fullrow:gender::varchar as gender, | Mike Jones | 35 | Male | Austin | 1973 |
cl.value:place::varchar as city, | Mike Jones | 35 | Male | Austin | 1998 |
| Mike Jones | 35 | Male | Austin | 2001 |
yl.value::int as year | Mike Jones | 35 | Male | Austin | 2005 |
FROM json_data_table p,
LATERAL FLATTEN(p.fullrow,'citiesLived') cl,
LATERAL FLATTEN(cl.value:yearsLived,'') yl
WHERE name = 'Mike Jones';
Parsing JSON using Snowflake SQL
(Without loading the JSON file into a Snowflake table)

Parsing JSON using SQL directly from the file without loading into Snowflake
SELECT 'The First Person is '||
S.$1:fullName||' '||
'He is '||S.$1:age||' years of age.'||' '||
'His children are: '||S.$1:children[0].name||' Who is a
'||S.$1:children[0].gender||' and is '||S.$1:children[0].age||' year(s) old '
FROM @~/json/json_sample_data (FILE_FORMAT => 'json') as S
WHERE S.$1:fullName = 'John Doe';
Parsing JSON Records:
PARSE_JSON
Interprets an input string as a JSON document, producing a VARIANT value
SELECT s.fullrow:fullName Parent, c.value Children_Object,
c.value:name Child_Name, c.value:age Child_Age
FROM json_data_table AS S,
TABLE(FLATTEN(S.fullrow,'children')) c
WHERE PARSE_JSON(c.value:age) > 8;
Parsing JSON Records:
CHECK_JSON
Valid JSON will produce NULL
SELECT CHECK_JSON('{"age": "15",
"gender": "Male",
"name": "John"}') ;
Valid JSON

Invalid JSON will produce error message


SELECT CHECK_JSON('{"age": "15",
"gender": "Male",
"name "John" ') ;
Missing :

Invalid JSON will produce error message


SELECT CHECK_JSON('{"age": "15",
"gender": "Male",
"name": "John" ') ;
Missing }
Parsing JSON Records:
CHECK_JSON
Validate JSON records in the S3 file before loading it. Use SELECT with CSV file format
SELECT S.$1, CHECK_JSON(S.$1)
FROM @~/json/json_sample_data (FILE_FORMAT => 'CSV') AS S ;

Missing matching ] Missing : before [ Missing Attribute Value


Validate JSON records in the S3 file before loading it. Use COPY with JSON file format
COPY INTO json_data_table
FROM @~/json/json_sample_data.gz
FILE_FORMAT = 'JSON' VALIDATION_MODE ='RETURN_ERRORS';
Back up
Learn more at snowflake.net
Snowflake Architecture

User Interface
ODBC Driver JDBC Driver Web UI

Cloud Services
Optimization Query Mgmt Warehouse Mgmt Security Metadata

Virtual Warehouse
Processing
EC2 Customer Service Financial Analysts Quality Control Loading

Database Storage
S3
Data Sales Marketing Materials

Cloud Infrastructure
Amazon AWS

43
Snowflake Architecture

User Interface
ODBC Driver JDBC Driver Web UI

Compute Cloud Services


EC2 Financial Analysts Optimization Query Mgmt Warehouse Mgmt Security

DML DDL

1 4 9 B F H J M
Node Node Node Node Node Node Node Node Metadata Metadata Metadata

Database Database Database


Cluster
Storage 1 2 3 4 5 6 7 8
S3 9 A
I
B C
K
D
L
E
M
F
N
G
O
H J

44
Snowflake Architecture

User Interface
ODBC Driver JDBC Driver Web UI

Compute Cloud Services


EC2 Loading Financial Analysts Optimization Query Mgmt Warehouse Mgmt Security

DML DDL

Node Node Node Node Node Node Node Node

Metadata Metadata Metadata


Node Node Node Node Node Node Node Node
Database Database Database
Cluster
Node Node Node Node
Storage
Node Node Node Node S3
Data Sales Marketing
Cluster

AWS cloud
45
Snowflake High Availability Architecture
SQL

Load Balancer REST

Cloud Services Cluster Cluster Cluster Cluster Cluster Cluster

Fully Replicated
Metadata

Virtual
Warehouses

Fully Replicated
Database Storage

Availability zone 1 Availability zone 2 Availability zone 3

46
Enterprise-class data warehouse:
Security
Authentication
• Embedded multi-factor authentication server
• Federated authentication via SAML 2.0 (in development)

Access control

.

X X • Role-based access control model


• Granular privileges on objects & actions

Data encryption
• Encryption at rest for database data
• Encryption of Snowflake metadata
• Snowflake-managed keys

Controls & processes validated through SOC


certification & audit

You might also like