Data Masters - Datawarehousing in The Cloud
Data Masters - Datawarehousing in The Cloud
Warehousing to Analyze
Structured and Semi-
Structured data sets
Kevin Bair
Solution Architect
[email protected]
Topics this presentation will cover
1. Cloud DW Architecture
2
Introducing Snowflake: An experienced team of data
experts with a vision to reinvent the data warehouse
Bob Muglia Benoit Dageville, PhD Marcin Zukowski, PhD Thierry Cruanes, PhD
CEO CTO & Founder Founder & VP of Engineering Founder Architect
Former President of Lead architect of Oracle parallel Inventor of vectorized query Leading expert in query
Microsoft’s Server and Tools execution and a key execution in databases optimization and parallel
3
Today’s data: big, complex, moving to cloud
Of workloads will
be processed In
cloud data centers
(Cisco)
Surge in cloud
spending and
supporting
technology
(IDC)
4
Structured data and Semi-
Structured data
• Transactional data • Machine-generated
• Relational • Non-relational
• Fixed schema • Varying schema
• OLTP / OLAP • Most common in cloud
environments
What does Semi Structured
mean?
• Data that may be of any type
• Data that is incomplete
• Structure that can rapidly and unpredictably
change
• Usually Self Describing
• Examples
• XML
• AVRO
• JSON
XML Example
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple
syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped
cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and
whipped cream</description>
<calories>900</calories>
</food>
</breakfast_menu>
JSON Example
{
"custkey": "450002",
"useragent": {
"devicetype": "pc",
"experience": "browser",
"platform": "windows"
},
"pagetype": "home",
"productline": "none",
"customerprofile": {
"age": 20,
"gender": "male",
"customerinterests": [
"movies",
"fashion",
"music"
]
}
}
Avro Example
Schema } JSON
Data
} Binary
Why is this so hard for a
traditional Relational DBMS?
• Pre-defined Schema
• Inefficient to Query
• Constantly Changing
Current architectures can’t keep up
11
Data Pipeline / Data Lake Architecture – “ETL”
Data
Source Stage Stage EDW
Lake
Website S3 Hadoop S3 MPP
Logs • 10TB • 30 TB • 5 TB • 10 TB Disk
• Summary
Operational
Systems
External
Providers
Stream
Data
One System for all Business Data
HDFS
Semi-structured data
Structured Structured data
Data Sink { "firstName": "John",
Storage Apple 101.12 250 FIH-2316 "lastName": "Smith",
"height_cm": 167.64,
Pear 56.22 202 IHO-6912 "address": {
Map-Reduce Jobs
"streetAddress": "21 2nd
Orange 98.21 600 WHQ-6090 Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
z
Relational
Databases
Oracle Exadata - Oracle Exadata X5 has many new software capabilities, including faster pure
columnar flash caching, database snapshots, flash cache resource management, near-instant
server death detection, I/O latency capping, and offload of JSON and XML analytics
(https://www.oracle.com/corporate/pressrelease/data-center-012115.html)
IBM Neteeza - You can use the Jaql Netezza® module to read from or write to Netezza tables.
(www-
01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.an
alyze.doc/doc/r0057926.html)
Postgres/Redshift - We recommend using JSON sparingly. JSON is not a good choice for storing
larger datasets because, by storing disparate data in a single column, JSON does not leverage
Amazon Redshift’s column store architecture.
(http://docs.aws.amazon.com/redshift/latest/dg/json-functions.html)
14
Relational Processing of
Semi-Structured Data
15
Why Support Semi-structured
data via SQL?
• Integrate with existing data
• Reduced administrative costs
• Improved security
• Transaction management
• Better performance
• Better resource allocation
• Increased developer productivity
• SQL is proven model for performing queries,
especially joins.
Requirements for a Cloud-based Big
Data / Data Warehouse Platform
• No Contention (Writers can’t block readers)
• Continuous loading of data without “Windows”
• Compress, and don’t duplicate the data
• Segment the data (Datamarts) without replicating
• Ability to analyzed structured and Semi-structured
data, together, at Volume (TB-PB) using SQL
• Reduce Complexity, easy to manage and develop
with
• ELT vs. ETL, allowing processing to happen closer
to the data
• Security, encrypt all data at rest
Architectural Evolution of the Data
Warehouse
Scale Up
(SMP – Single Server)
RDBMS
Software
Storage
Architectural Evolution of the Data
Warehouse
Scale Up Scale Out Scale Up, Out or Down
(SMP – Single Server) (MPP / Hadoop Cluster) (Elastic / Cloud)
Optimizer
Optimizer
Metadata / Schema
Metadata / Schema
Optimizer
Cloud Service
Database is separate
Virtual
from Virtual Warehouse Warehouse Marketing
Finance
Users
Users
One Virtual Warehouse, S
multiple Databases Virtual
Virtual
Warehouse
Warehouse
One Database, multiple
Virtual Warehouses Databases
Biz Dev
User
20
Data Pipeline / Snowflake Architecture – “ELT”
Operational
Systems
External
Providers
Stream Data
Amazon Cloud Data Pipeline Architecture
Stream Data
“JSON”
IAM “ Application”
EDW
Amazon
Kinesis S3 buckets
Amazon SQS
Amazon SNS
Amazon Glacier
22
Amazon Cloud Data Pipeline Architecture (Near Real-time)
Stream Data
“JSON”
IAM “ Application”
EDW
Amazon
Kinesis S3 buckets
Amazon SQS
Storm /
Spark Amazon SNS
Amazon Glacier
23
Typical customer environment
Data sources
Datamarts BI / Analytics
OLTP EDW
databa
ses
Enterpri
se
ETL
applicati
ons
Third-
party
Web
applicati
ons
Other
Hadoop
Demo Time!
Demo Scenarios
• Clickstream Analysis (load JSON, multi-table insert)
• Which Product Category is most clicked on?
• Which Product line does the customer self identify as
having the most interest in?
• Email us:
• Sales: [email protected]
• General: [email protected]
• Q&A
THANK YOU!
Functions
• ARRAYAGG, ARRAY_AGG
• REGEXP
• ARRAY_APPEND
• REGEXP_COUNT • ARRAY_CAT
• ARRAY_COMPACT
• REGEXP_INSTR • ARRAY_SIZE
• REGEXP_LIKE • ARRAY_CONSTRUCT
• ARRAY_CONSTRUCT_COMPACT
• REGEXP_REPLACE • ARRAY_INSERT
• REGEXP_SUBSTR • ARRAY_PREPEND
• ARRAY_SLICE
• RLIKE • ARRAY_TO_STRING
• CHECK_JSON
• PARSE_JSON
• OBJECT_CONSTRUCT
• OBJECT_INSERT
• GET
• GET_PATH
• AS_type
• IS_type
• IS_NULL_VALUE
• TO_JSON
• TYPEOF
Parsing JSON using Snowflake SQL
(After loading JSON file into Snowflake table)
For maps or
objects. It contains
the key to the
exploded value
A unique sequence #
Path expression of For arrays, It contains The expression
related to the input
the exploded value in the index in the array of contained in the
expression
the input expression the exploded value collection
FLATTEN() in Snowflake SQL
(Removing one level of nesting)
Parsing JSON using SQL directly from the file without loading into Snowflake
SELECT 'The First Person is '||
S.$1:fullName||' '||
'He is '||S.$1:age||' years of age.'||' '||
'His children are: '||S.$1:children[0].name||' Who is a
'||S.$1:children[0].gender||' and is '||S.$1:children[0].age||' year(s) old '
FROM @~/json/json_sample_data (FILE_FORMAT => 'json') as S
WHERE S.$1:fullName = 'John Doe';
Parsing JSON Records:
PARSE_JSON
Interprets an input string as a JSON document, producing a VARIANT value
SELECT s.fullrow:fullName Parent, c.value Children_Object,
c.value:name Child_Name, c.value:age Child_Age
FROM json_data_table AS S,
TABLE(FLATTEN(S.fullrow,'children')) c
WHERE PARSE_JSON(c.value:age) > 8;
Parsing JSON Records:
CHECK_JSON
Valid JSON will produce NULL
SELECT CHECK_JSON('{"age": "15",
"gender": "Male",
"name": "John"}') ;
Valid JSON
User Interface
ODBC Driver JDBC Driver Web UI
Cloud Services
Optimization Query Mgmt Warehouse Mgmt Security Metadata
Virtual Warehouse
Processing
EC2 Customer Service Financial Analysts Quality Control Loading
Database Storage
S3
Data Sales Marketing Materials
Cloud Infrastructure
Amazon AWS
43
Snowflake Architecture
User Interface
ODBC Driver JDBC Driver Web UI
DML DDL
1 4 9 B F H J M
Node Node Node Node Node Node Node Node Metadata Metadata Metadata
44
Snowflake Architecture
User Interface
ODBC Driver JDBC Driver Web UI
DML DDL
AWS cloud
45
Snowflake High Availability Architecture
SQL
Fully Replicated
Metadata
Virtual
Warehouses
Fully Replicated
Database Storage
46
Enterprise-class data warehouse:
Security
Authentication
• Embedded multi-factor authentication server
• Federated authentication via SAML 2.0 (in development)
Access control
…
.
Data encryption
• Encryption at rest for database data
• Encryption of Snowflake metadata
• Snowflake-managed keys