Getting Started With Amazon Redshift
Getting Started With Amazon Redshift
Amazon Redshift
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Introduction
• Benefits
• Use cases
• Getting started
• Q&A
What is Big Data?
Generate
Analyze
Generate
Store exabytes of
Individual AWS customers Collect & Store
data in S3
generating over PB/day
Analyze
Generate
Store exabytes of
Individual AWS customers Collect & Store
data in S3
generating over PB/day
Highly
Analyze
Constrained
Generated Data
Available for Analysis
Year
1990 2000 2010 2020
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
AWS Big Data Portfolio
Collect Store Analyze
shift
Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year
150+ features
a lot faster
a lot simpler
a lot cheaper
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Interactive data analysis and Ride analytics for pricing Ad prediction and
recommendation engine and product development on-demand analytics
Use Case: Business Applications
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Compute nodes
Leader node
Local columnar storage
10 GigE
Parallel/distributed execution of all queries, loads, (HPC)
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed) Compute node Compute node Compute node
DC1: SSD; scale from 160 GB to 326 TB
Ingestion
DS2: HDD; scale from 2 TB to 2 PB Backup
Restore
Direct-attached storage 10 10 | 13 | 14 | 26 |…
Load
Export
Backup
Restore
Resize
Benefit #1: Amazon Redshift is fast
“Did I mention that it’s ridiculously fast? We’re using “After investigating Redshift, Snowflake, and
it to provide our analysts with an alternative to Hadoop” BigQuery, we found that Redshift offers top-of-the-
line performance at best-in-market price points”
“We regularly process multibillion row datasets “We saw a 2X performance improvement on a wide
and we do that in a matter of hours. We are heading variety of workloads. The more complex the queries,
to up to 10 times more data volumes in the next couple the higher the performance improvement”
of years, easily
And has gotten faster...
Efficient
The life of a query
Client Amazon Redshift Cluster
2 3
BI tools
Compute node
1
Queue 1
Analytics tools
Queue 2
Compute node
Leader node
SQL clients
Compute node
Query monitoring rules
• Allows automatic handling of runaway (poorly written) queries
• Metrics with operators and values (e.g. query_cpu_time > 1000) create a predicate
• Multiple rules can be defined for a queue in WLM. These rules are OR-ed together
Pre-defined templates
for common use
cases
Query monitoring rules
Common use cases:
• Protect interactive queues
INTERACTIVE = { “query_execution_time > 15 sec” or
“query_cpu_time > 1500 uSec” or
”query_blocks_read > 18000 blocks” } [HOP]
Continuous/incremental backups
Multiple copies within cluster Compute node Compute node Compute node
Amazon S3
Benefit #3: Amazon Redshift is fully managed
Fault tolerance
Disk failures Compute node Compute node Compute node
Node failures
Region 2
Amazon S3
Node fault tolerance
Data-path monitoring agents
Node level monitoring
can detect SW/HW
Compute node
issues and take action
Compute node
Node fault tolerance
Data-path monitoring agents Failure is detected at one
of the compute nodes
Compute node
Compute node
Node fault tolerance
Data-path monitoring agents Redshift parks the
connections
Compute node
Node fault tolerance
Data-path monitoring agents Queries are re-submitted
Compute node
Compute node
Node fault tolerance
Data-path monitoring agents Additional monitoring
layer for the leader
Cluster-level monitoring agents node and network
Compute node
Compute node
Benefit #4: Security is built-in Customer VPC
• Machine learning
• Data science
Benefit #6: Amazon Redshift has a large ecosystem
EC2/SSH
DynamoDB
RDS/Aurora
Amazon ML
EMR
Amazon
Redshift CloudSearch
Data Pipeline
Amazon
Mobile
S3 Amazon Kinesis Analytics
Amazon Redshift Spectrum
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
S3
SQL
High concurrency: Multiple No ETL: Query data in-place Full Amazon Redshift
clusters access same data using open file formats SQL support
Life of a query Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
JDBC/ODBC
Amazon
Redshift
Storage
Amazon S3 AWS Glue Data Catalog
Exabyte-scale Object Storage Hive-compatible Metastore
Serverless
Compute
Amazon Kinesis Firehose AWS Glue Amazon Redshift Spectrum AWS Lambda
Real-Time Data Streaming ETL & Data Catalog Fast @ Exabyte scale Trigger-based Code Execution
Data
Processing
Amazon EMR Amazon Redshift Amazon Athena
Athena
Managed Hadoop Applications Petabyte-scale Data Warehousing Interactive Query
Over 20 customers helped preview Amazon Redshift Spectrum
Use cases
NTT Docomo: Japan’s largest mobile service provider
Greenplum on-premises
NTT Docomo: Japan’s largest mobile service provider
Previous solution
• Legacy DW (Oracle)—query across 1 week/hr
• Hadoop—query across 1 month/hr
Results with Amazon Redshift
• Query 15 months in 14 min • 100 node DS2.8XL clusters • 20% time of one DBA
Detail Pages
• http://aws.amazon.com/redshift
• https://aws.amazon.com/marketplace/redshift/
• https://aws.amazon.com/redshift/developer-resources/
• Amazon Redshift Utilities - GitHub
Best Practices
• http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-
practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-
practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-
performance.html
Thank you!