100% found this document useful (1 vote)
262 views415 pages

AWSCertified Big Data Slides

aws big data exam course

Uploaded by

bhh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
262 views415 pages

AWSCertified Big Data Slides

aws big data exam course

Uploaded by

bhh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 415

NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.

com
Disclaimer: These slides are copyrighted
and strictly for personal use only
• This document is reserved for people enrolled into the
AWS Certified Big Data Specialty course by Stephane Maarek and Frank
Kane.

• Please do not share this document, it is intended for personal use and
exam preparation only, thank you.

• If you’ve obtained these slides for free on a website that is not the course’s
website, please reach out to [email protected]. Thanks!

• Best of luck for the exam and happy learning!

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Certified Data Analytics
Specialty Course
DAS-C01

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Welcome! We’re starting in 5 minutes
• We’re going to prepare for the Data Analytics Specialty exam –
DAS-C01
• It’s a challenging certification, so this course will be long and
interesting
• Recommended to have previous AWS knowledge (EC2,
networking…)
• Preferred to have some data / analytics background
• We will cover all the AWS Data Analytics services related to the
exam
• Take your time, it’s not a race!

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
My certification: 94%

BDS-C00

DAS-C01

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
About me
• I’m Stephane!
• Worked as in IT consultant and AWS Big Data Architect, Developer & SysOps
• Worked with AWS many years: built websites, apps, streaming platforms
• Veteran Instructor on AWS (Certifications, CloudFormation, Lambda, EC2…)

• You can find me on


• LinkedIn: https://www.linkedin.com/in/stephanemaarek
• Instagram: https://www.instagram.com/stephanemaarek/
• Twitter: https://twitter.com/stephanemaarek
• Medium: https://medium.com/@stephane.maarek
• GitHub: https://github.com/simplesteph

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
About me
• I’m Frank!
• 9 years at Amazon as a Sr. Software Engineer and Sr. Manager
• Focused on machine learning / recommender systems in big
data environment
• Owner of Sundog Education – Big Data & ML

• You can find me on


• LinkedIn: https://www.linkedin.com/in/fkane/
• Twitter: http://www.twitter.com/SundogEducation
• Facebook: https://www.facebook.com/SundogEdu

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Services we’ll learn
COLLECTION STORAGE PROCESSING ANALYSIS VISUALIZATION

Amazon Kinesis AWS IoT Core S3 + Glacier AWS Lambda Amazon ML Elasticsearch Amazon QuickSight

SECURITY
AWS Snowball Amazon SQS DynamoDB AWS Glue Amazon SageMaker Amazon Athena Amazon KMS

Amazon DMS AWS Direct Connect ElastiCache Amazon EMR AWS Data Pipeline Amazon Redshift AWS CloudHSM

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Course Cost

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Introducing our case study

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Our case study: cadabra.com

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Requirement 1:
Order history app

Client app
Server logs

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Requirement 2:
Product recommendations

Server logs

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Requirement 3:
Transaction rate alarm

Server logs

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Requirement 4:
Near-real-time log analysis

Server logs

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Requirement 5: Data warehousing &
visualization
(serverless)

Server logs

(managed)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Putting it all together

Server logs

Client app

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Collection
Moving data into AWS

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Collection Introduction
• Real Time - Immediate actions
• Kinesis Data Streams (KDS)
• Simple Queue Service (SQS)
• Internet of Things (IoT)
• Near-real time - Reactive actions
• Kinesis Data Firehose (KDF)
• Database Migration Service (DMS)
• Batch - Historical Analysis
• Snowball
• Data Pipeline

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Kinesis Overview
• Kinesis is a managed alternative to Apache Kafka
• Great for application logs, metrics, IoT, clickstreams
• Great for “real-time” big data
• Great for streaming processing frameworks (Spark, NiFi, etc…)
• Data is automatically replicated synchronously to 3 AZ

• Kinesis Streams: low latency streaming ingest at scale


• Kinesis Analytics: perform real-time analytics on streams using
SQL
• Kinesis Firehose: load streams into S3, Redshift, ElasticSearch &
Splunk

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis
Amazon
Kinesis

Amazon S3
Click streams bucket

IoT devices
Amazon Kinesis Amazon Kinesis Amazon Kinesis
Streams Analytics Firehose
Amazon
Metrics & Logs Redshift

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Streams Overview
• Streams are divided in ordered Shards / Partitions
Shard 1
producers Shard 2 consumers
Shard 3

• Data retention is 24 hours by default, can go up to 7 days


• Ability to reprocess / replay data
• Multiple applications can consume the same stream
• Real-time processing with scale of throughput
• Once data is inserted in Kinesis, it can’t be deleted (immutability)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Streams Shards
• One stream is made of many different shards
• Billing is per shard provisioned, can have as many shards as
you want
• Batching available or per message calls.
• The number of shards can evolve over time (reshard / merge)
• Records are ordered per shard
Shard 1
Shard 2
producers Shard 3 consumers
Shard 4

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Streams Records
• Data Blob: data being sent, serialized
as bytes. Up to 1 MB. Can represent
anything
Data Blob
(up to 1MB)
• Record Key:
• sent alongside a record, helps to group Bytes
records in Shards. Same key = Same
shard.
• Use a highly distributed key to avoid the
“hot partition” problem
Record Key
• Sequence number: Unique identifier
for each records put in shards. Added
by Kinesis after ingestion Sequence Number

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Streams Limits to know
• Producer:
• 1MB/s or 1000 messages/s at write PER SHARD
• “ProvisionedThroughputException” otherwise
• Consumer Classic:
• 2MB/s at read PER SHARD across all consumers
• 5 API calls per second PER SHARD across all consumers
• Consumer Enhanced Fan-Out:
• 2MB/s at read PER SHARD, PER ENHANCED CONSUMER
• No API calls needed (push model)
• Data Retention:
• 24 hours data retention by default
• Can be extended to 7 days

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Producers
• Kinesis SDK
• Kinesis Producer
Library (KPL) SDK
• Kinesis Agent
Kinesis Producer Library (KPL)

• 3rd party libraries: Amazon Kinesis


Spark, Log4J Streams

Appenders, Flume,
Kinesis Agent
Kafka Connect,
NiFi…

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Producer SDK - PutRecord(s)
• APIs that are used are PutRecord (one) and PutRecords (many
records)
• PutRecords uses batching and increases throughput => less HTTP
requests
• ProvisionedThroughputExceeded if we go over the limits
• + AWS Mobile SDK: Android, iOS, etc...
• Use case: low throughput, higher latency, simple API, AWS Lambda

• Managed AWS sources for Kinesis Data Streams:


• CloudWatch Logs
• AWS IoT
• Kinesis Data Analytics

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Kinesis API – Exceptions
• ProvisionedThroughputExceeded Exceptions
• Happens when sending more data (exceeding MB/s or TPS for any
shard)
• Make sure you don’t have a hot shard (such as your partition key is bad
and too much data goes to that partition)

• Solution:
• Retries with backoff
• Increase shards (scaling)
• Ensure your partition key is a good one

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Producer Library (KPL)
• Easy to use and highly configurable C++ / Java library
• Used for building high performance, long-running producers
• Automated and configurable retry mechanism
• Synchronous or Asynchronous API (better performance for async)
• Submits metrics to CloudWatch for monitoring
• Batching (both turned on by default) – increase throughput, decrease
cost:
• Collect Records and Write to multiple shards in the same PutRecords API call
• Aggregate – increased latency
• Capability to store multiple records in one record (go over 1000 records per second limit)
• Increase payload size and improve throughput (maximize 1MB/s limit)
• Compression must be implemented by the user
• KPL Records must be de-coded with KCL or special helper library

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Producer Library (KPL)
Batching
Aggregate into One Record < 1MB

2 KB 40 KB 500 KB Collection -
PutRecords

1KB 30 KB 80 KB 200 KB

• We can influence the batching efficiency by introducing some


delay with RecordMaxBufferedTime (default 100ms)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Agent
• Monitor Log files and sends them to Kinesis Data Streams
• Java-based agent, built on top of KPL
• Install in Linux-based server environments

• Features:
• Write from multiple directories and write to multiple streams
• Routing feature based on directory / log file
• Pre-process data before sending to streams (single line, csv to json, log to
json…)
• The agent handles file rotation, checkpointing, and retry upon failures
• Emits metrics to CloudWatch for monitoring

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Consumers - Classic
• Kinesis SDK Firehose

• Kinesis Client Library (KCL) AWS Lambda

• Kinesis Connector Library


• 3rd party libraries: Spark,
Log4J Appenders, Flume,
Kafka Connect… Kinesis Consumer Library
(KCL)
Amazon Kinesis
• Kinesis Firehose Streams
• AWS Lambda
• (Kinesis Consumer Enhanced
Fan-Out discussed in the next SDK
lecture)
Kinesis Collector Library
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Consumer SDK - GetRecords
• Classic Kinesis - Records are
polled by consumers from a GetRecords()
shard Consumer
• Each shard has 2 MB total Shard 1 Application A
aggregate throughput Data
• GetRecords returns up to 10MB
of data (then throttle for 5
seconds) or up to 10000 records Consumer
• Maximum of 5 GetRecords API Producer Application B
calls per shard per second =
200ms latency Shard 2
• If 5 consumers application
consume from the same shard, Consumer
means every consumer can poll Shard N
Application C
once a second and receive less
than 400 KB/s Kinesis Data
Streams

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Client Library (KCL)
• Java-first library but exists for other
languages too (Golang, Python, Ruby, Node,
.NET …)
• Read records from Kinesis produced with the

Checkpoint progress
Amazon Kinesis–
KPL (de-aggregation) enabled app

• Share multiple shards with multiple


consumers in one “group”, shard discovery
Consume messages
• Checkpointing feature to resume progress Amazon Kinesis–

• Leverages DynamoDB for coordination and enabled app

checkpointing (one row per shard)


• Make sure you provision enough WCU / RCU
• Or use On-Demand for DynamoDB
• Otherwise DynamoDB may slow down KCL
Amazon
• Record processors will process the data DynamoDB

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Connector Library
• Older Java library (2016), Amazon S3
leverages the KCL library Connector Library
(running on EC2)
• Write data to:
• Amazon S3
• DynamoDB Amazon DynamoDB

• Redshift
• ElasticSearch Amazon Kinesis Data
Streams
Amazon Redshift
• Kinesis Firehose replaces the
Connector Library for a few of
these targets, Lambda for the
others Amazon Elasticsearch
Service

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Lambda sourcing from Kinesis
• AWS Lambda can source records from Kinesis Data Streams
• Lambda consumer has a library to de-aggregate record from the
KPL
• Lambda can be used to run lightweight ETL to:
• Amazon S3
• DynamoDB
• Redshift
• ElasticSearch
• Anywhere you want
• Lambda can be used to trigger notifications / send emails in real time
• Lambda has a configurable batch size (more in Lambda section)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Enhanced Fan Out Producer

• New game-changing feature from


August 2018.
• Works with KCL 2.0 and AWS
Lambda (Nov 2018) Kinesis Data
Streams
• Each Consumer get 2 MB/s of
provisioned throughput per shard
Shard 1
• That means 20 consumers will get
40MB/s per shard aggregated Push Data
• No more 2 MB/s limit! SubscribeToShard() Push Data 2 MB/s
2 MB/s
• Enhanced Fan Out: Kinesis pushes Subscribe..
data to consumers over HTTP/2
• Reduce latency (~70 ms) Consumer Consumer
Application A Application B

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Enhanced Fan-Out vs Standard
Consumers
• Standard consumers:
• Low number of consuming applications (1,2,3…)
• Can tolerate ~200 ms latency
• Minimize cost

• Enhanced Fan Out Consumers:


• Multiple Consumer applications for the same Stream
• Low Latency requirements ~70ms
• Higher costs (see Kinesis pricing page)
• Default limit of 5 consumers using enhanced fan-out per data stream

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Operations – Adding Shards
• Also called “Shard Splitting”
• Can be used to increase the Stream capacity (1 MB/s data in per
shard)
• Can be used to divide a “hot shard”
• The old shard is closed and will be deleted once the data is expired

Shard 1 Shard 2 Shard 3

split

Shard 4 Shard 5
Shard 1 Shard 3
(new) (new)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Operations – Merging Shards
• Decrease the Stream capacity and save costs
• Can be used to group two shards with low traffic
• Old shards are closed and deleted based on data expiration

Shard 1 Shard 4 Shard 5 Shard 3

merge

Shard 6 Shard 5 Shard 3

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Operations – Auto Scaling
• Auto Scaling is not a native
feature of Kinesis
• The API call to change the
number of shards is
UpdateShardCount
• We can implement Auto Scaling
with AWS Lambda
• See:
https://aws.amazon.com/blogs/b
ig-data/scaling-amazon-kinesis-
data-streams-with-aws-
application-auto-scaling/

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Scaling Limitations
• Resharding cannot be done in parallel. Plan capacity in advance
• You can only perform one resharding operation at a time and it takes a few
seconds
• For1000 shards, it takes 30K seconds (8.3 hours) to double the shards to
2000

• You can’t do the following:


• Scale more than twice for each rolling 24-hour period for each stream
• Scale up to more than double your current shard count for a stream
• Scale down below half your current shard count for a stream
• Scale up to more than 500 shards in a stream
• Scale a stream with more than 500 shards down unless the result is fewer than 500
shards
• Scale up to more than the shard limit for your account

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Security
• Control access / authorization using IAM policies
• Encryption in flight using HTTPS endpoints
• Encryption at rest using KMS
• Client side encryption must be manually implemented (harder)
• VPC Endpoints available for Kinesis to access within VPC

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Kinesis Data Firehose
• Fully Managed Service, no administration
• Near Real Time (60 seconds latency minimum for non full batches)
• Load data into Redshift / Amazon S3 / ElasticSearch / Splunk
• Automatic scaling
• Supports many data formats
• Data Conversions from JSON to Parquet / ORC (only for S3)
• Data Transformation through AWS Lambda (ex: CSV => JSON)
• Supports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)
• Only GZIP is the data is further loaded into Redshift
• Pay for the amount of data going through Firehose
• Spark / KCL do not read from KDF

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Firehose Diagram
SDK Amazon S3
Kinesis Producer Library (KPL)
Lambda function

Kinesis Agent Redshift

Kinesis Data Streams


ElasticSearch
CloudWatch Logs & Events
Amazon Kinesis
Data Firehose

IoT rules actions

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Firehose Delivery Diagram

Data Transformation
Several “blueprint” templates available

Delivery
stream output COPY
Source

Amazon S3 Amazon Redshift


Output Bucket

Source Records
Transformation failures
Delivery Failures

Amazon S3
Other bucket

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Firehose Buffer Sizing
• Firehose accumulates records in a buffer
• The buffer is flushed based on time and size rules

• Buffer Size (ex: 32MB): if that buffer size is reached, it’s flushed
• Buffer Time (ex: 2 minutes): if that time is reached, it’s flushed
• Firehose can automatically increase the buffer size to increase
throughput

• High throughput => Buffer Size will be hit


• Low throughput => Buffer Time will be hit

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Streams vs Firehose
• Streams
• Going to write custom code (producer / consumer)
• Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out)
• Must manage scaling (shard splitting / merging)
• Data Storage for 1 to 7 days, replay capability, multi consumers
• Use with Lambda to insert data in real-time to ElasticSearch (for example)

• Firehose
• Fully managed, send to S3, Splunk, Redshift, ElasticSearch
• Serverless data transformations with Lambda
• Near real time (lowest buffer time is 1 minute)
• Automated Scaling
• No data storage

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS SQS
What’s a queue?
Consumer

Producer
Consumer
Send messages
Producer Poll messages

Consumer

Producer SQS Queue

Consumer

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS SQS – Standard Queue
• Oldest offering (over 10 years old)
• Fully managed
• Scales from 1 message per second to 10,000s per second
• Default retention of messages: 4 days, maximum of 14 days
• No limit to how many messages can be in the queue
• Low latency (<10 ms on publish and receive)
• Horizontal scaling in terms of number of consumers
• Can have duplicate messages (at least once delivery, occasionally)
• Can have out of order messages (best effort ordering)
• Limitation of 256KB per message sent

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SQS – Producing Messages
• Define Body
• Add message Message Body
attributes (up to 256kb)
(metadata – optional)
String
• Provide Delay Sent to SQS
Delivery (optional)

Name Type Value


• Get back …
• Message identifier
Name Type Value
• MD5 hash of the body
Attributes

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SQS – Consuming Messages
• Consumers…
• Poll SQS for messages (receive up to 10 messages at a time)
• Process the message within the visibility timeout
• Delete the message using the message ID & receipt handle

Poll Process
messages message
Message Consumer

Delete message

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS SQS – FIFO Queue
• Newer offering (First In - First out) – not 5
available in all regions! SQS
• Name of the queue must end in .fifo FIFO Queue
4
• Lower throughput (up to 3,000 per second
with batching, 300/s without) 3
• Messages are processed in order by the
consumer 2

• Messages are sent exactly once


1
• 5-minute interval de-duplication using
“Duplication ID”

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SQS Extended Client
• Message size limit is 256KB, how to send large messages?
• Using the SQS Extended Client (Java Library)

SQS Queue

Small metadata Small metadata


Producer Consumer
message message

Send large message to S3 Retrieve large message from S3

Amazon S3
bucket

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS SQS Use Cases
• Decouple applications
(for example to handle payments asynchronously)
• Buffer writes to a database
(for example a voting application)
• Handle large loads of messages coming in
(for example an email sender)

• SQS can be integrated with Auto Scaling through CloudWatch!

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SQS Limits
• Maximum of 120,000 in-flight messages being processed by
consumers
• Batch Request has a maximum of 10 messages – max 256KB
• Message content is XML, JSON, Unformatted text
• Standard queues have an unlimited TPS
• FIFO queues support up to 3,000 messages per second (using
batching)
• Max message size is 256KB (or use Extended Client)
• Data retention from 1 minute to 14 days
• Pricing:
• Pay per API Request
• Pay per network usage

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS SQS Security
• Encryption in flight using the HTTPS endpoint
• Can enable SSE (Server Side Encryption) using KMS
• Can set the CMK (Customer Master Key) we want to use
• SSE only encrypts the body, not the metadata
(message ID, timestamp, attributes)
• IAM policy must allow usage of SQS
• SQS queue access policy
• Finer grained control over IP
• Control over the time the requests come in

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Stream vs SQS
• Kinesis Data Stream: • SQS:
• Data can be consumed many times
• Data is deleted after the retention • Queue, decouple applications
period • One application per queue
• Ordering of records is preserved (at
the shard level) – even during • Records are deleted after
replays consumption (ack / fail)
• Build multiple applications reading • Messages are processed
from the same stream
independently (Pub/Sub) independently for standard
• “Streaming MapReduce” querying queue
capability
• Checkpointing needed to track • Ordering for FIFO queues
progress of consumption • Capability to “delay” messages
• Shards (capacity) must be provided • Dynamic scaling of load (no-ops)
ahead of time

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Streams vs SQS
Kinesis Data Kinesis Data Amazon SQS Amazon SQS FIFO
Streams Firehose Standard
Managed by AWS yes yes yes yes
Ordering Shard / Key No No Specify Group ID
Delivery At least once At least once At least once Exactly Once
Replay Yes No No No
Max Data Retention 7 days No 14 days 14 days
Scaling Provision Shards: No limit No limit ~3000 messages per
1MB/s producer second with
2MB/s consumer batching (soft limit)
Max Object Size 1MB 128 MB at 256KB (more if 256KB (more if
destination using extended lib) using extended lib)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SQS vs Kinesis – Use cases
• SQS Use cases :
• Order processing
• Image Processing
• Auto scaling queues according to messages.
• Buffer and Batch messages for future processing.
• Request Offloading

• Amazon Kinesis Data Streams Use cases :


• Fast log and event data collection and processing
• Real Time metrics and reports
• Mobile data capture
• Real Time data analytics
• Gaming data feed
• Complex Stream Processing
• Data Feed from “Internet of Things”

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
IoT Overview
• We deploy IoT devices (“Things”)
• We configure them and retrieve data from them Kinesis

AWS IoT Cloud


messages
SQS

messages IoT
Lambda
Rules Engine

IoT Thing Device IoT etc…


gateway Message Broker

Thing Device Shadow


Registry
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
IoT Device Gateway
• Serves as the entry point for IoT devices connecting to AWS
• Allows devices to securely and efficiently communicate with
AWS IoT
• Supports the MQTT, WebSockets, and HTTP 1.1 protocols
• Fully managed and scales automatically to support over a billion
devices
• No need to manage any infrastructure

Thing Device Gateway

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
IoT Message Broker
• Pub/sub (publishers/subscribers) messaging pattern - low
latency
• Devices can communicate with one another this way
• Messages sent using the MQTT, WebSockets, or HTTP 1.1
protocols
• Messages are published into topics (just like SNS)
• Message Broker forwards messages to all clients connected to
the topic

Thing Message broker / Topic


Device Gateway
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
IoT Thing Registry = IAM of IoT

• All connected IoT devices are represented in the AWS IoT registry
• Organizes the resources associated with each device in the AWS Cloud
• Each device gets a unique ID
• Supports metadata for each device (ex: Celsius vs Fahrenheit, etc…)
• Can create X.509 certificate to help IoT devices connect to AWS
• IoT Groups: group devices together and apply permissions to the group

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Authentication
• 3 possible authentication methods for IOT Devices
Things:
• Create X.509 certificates and load them
securely onto the Things Mutual
• AWS SigV4 Authentication
• Custom tokens with Custom authorizers
• For mobile apps:
• Cognito identities (extension to Google,
Facebook login, etc…)
• Web / Desktop / CLI:
• IAM Device Gateway
• Federated Identities

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Authorization
• AWS IoT policies:
• Attached to X.509 certificates or Cognito Identities
• Able to revoke any device at any time
• IoT Policies are JSON documents
• Can be attached to groups instead of individual Things.

• IAM Policies:
• Attached to users, group or roles
• Used for controlling IoT AWS APIs

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Device Shadow
• JSON document representing the state of a connected Thing
• We can set the state to a different desired state (ex: light on)
• The IoT thing will retrieve the state when online and adapt
IoT
Lightbulb (off) reported AWS Cloud
state (off)
Device shadow

Change state (AWS API)


Ex: using mobile application

Synchronization
IoT desired
Lightbulb (on) of state state (on)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Rules Engine
IoT Topic
• Rules are defined on the MQTT topics
• Rules = when it’s triggered | Action = what is does
• Rules use cases:
• Augment or filter data received from a device
IoT Rules
• Write data received from a device to a DynamoDB database
• Save a file to S3 IoT Rules Actions
• Send a push notification to all users using SNS
• Publish data to a SQS queue
• Invoke a Lambda function to extract data
• Process messages from a large number of devices using Amazon
Kinesis
• Send data to the Amazon Elasticsearch Service
• Capture a CloudWatch metric and Change a CloudWatch alarm Kinesis DynamoDB SQS
• Send the data from an MQTT message to Amazon Machine
Learning to make predictions based on an Amazon ML model
• & more
• Rules need IAM Roles to perform their actions
SNS S3 AWS Lambda

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
IoT Greengrass
• IoT Greengrass brings the compute
layer to the device directly
Local Device
• You can execute AWS Lambda
functions on the devices:
• Pre-process the data
• Execute predictions based on ML models AWS IoT Greengrass
• Keep device data in sync
• Communicate between local devices
• Operate offline Lambda
function
IoT thing
coffee pot

• Deploy functions from the cloud


directly to the devices

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DMS – Database Migration Service
• Quickly and securely migrate databases
to AWS, resilient, self healing
Source DB
• The source database remains available
during the migration
• Supports:
• Homogeneous migrations: ex Oracle to EC2 instance
Oracle Running DMS
• Heterogeneous migrations: ex Microsoft SQL
Server to Aurora
• Continuous Data Replication using CDC
Target DB
• You must create an EC2 instance to
perform the replication tasks

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DMS Sources and Targets
SOURCES: TARGETS:
• On-Premise and EC2 instances
• On-Premise and EC2 databases: Oracle, MS SQL
instances databases: Oracle, Server, MySQL, MariaDB,
MS SQL Server, MySQL, PostgreSQL, SAP
MariaDB, PostgreSQL, • Amazon RDS
MongoDB, SAP, DB2 • Amazon Redshift
• Azure: Azure SQL Database • Amazon DynamoDB
• Amazon S3
• Amazon RDS: all including
Aurora • ElasticSearch Service
• Kinesis Data Streams
• Amazon S3
• DocumentDB

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Schema Conversion Tool (SCT)
• Convert your Database’s Schema from one engine to another
• Example OLTP: (SQL Server or Oracle) to MySQL,
PostgreSQL, Aurora
• Example OLAP: (Teradata or Oracle) to Amazon Redshift

• You can use AWS SCT to create AWS DMS endpoints and
tasks.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Direct Connect
• Provides a dedicated private connection from a remote network to your VPC
• Can setup multiple 1 Gbps or 10 Gbps dedicated network connections
• Setup Dedicated connection between your DC and Direct Connect locations
• You need to setup a Virtual Private Gateway on your VPC
• Access public resources (S3) and private (EC2) on same connection
• Use Cases:
• Increase bandwidth throughput - working with large data sets – lower cost
• More consistent network experience - applications using real-time data feeds
• Hybrid Environments (on prem + cloud)
• Enhanced security (private connection)
• Supports both IPv4 and IPv6
• High-availability: Two DC as failover or use Site-to-Site VPN as a failover

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Direct Connect Diagram

https://docs.aws.amazon.com/directconnect/latest/UserGuide
/images/direct_connect_overview.png

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Direct Connect Gateway
• If you want to setup a Direct Connect to one or more VPC in
many different regions (same account), you must use a
Direct Connect Gateway

https://docs.aws.amazon.com/directconnect/latest/UserGuide
/direct-connect-gateways.html
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Snowball
• Physical data transport solution that helps
moving TBs or PBs of data in or out of AWS
• Alternative to moving data over the network
(and paying network fees)
• Secure, tamper resistant, uses KMS 256 bit
encryption
• Tracking using SNS and text messages. E-
ink shipping label
• Pay per data transfer job
• Use cases: large data cloud migrations, DC
decommission, disaster recovery
• If it takes more than a week to transfer over
the network, use Snowball devices!

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Snowball Process
1. Request snowball devices from the AWS console for delivery
2. Install the snowball client on your servers
3. Connect the snowball to your servers and copy files using the
client
4. Ship back the device when you’re done (goes to the right
AWS facility)
5. Data will be loaded into an S3 bucket
6. Snowball is completely wiped
7. Tracking is done using SNS, text messages and the AWS
console

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Snowball Diagrams
• Direct upload to S3:
www: 10Gbit/s

client Amazon S3
bucket
• With snowball
ship

AWS AWS import/


client export Amazon S3
Snowball Snowball
bucket

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Snowball Edge
• Snowball Edges add computational capability to
the device
• 100 TB capacity with either:
• Storage optimized – 24 vCPU
• Compute optimized – 52 vCPU & optional GPU
• Supports a custom EC2 AMI so you can perform
processing on the go
• Supports custom Lambda functions

• Very useful to pre-process the data while


moving
• Use case: data migration, image collation, IoT Lambda
capture, machine learning AMI function

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Snowmobile

• Transfer exabytes of data (1 EB = 1,000 PB = 1,000,000 TBs)


• Each Snowmobile has 100 PB of capacity (use multiple in
parallel)
• Better than Snowball if you transfer more than 10 PB

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
MSK Managed Streaming for Apache
Kafka
• Fully managed Apache Kafka on AWS (alternative to Kinesis)
• Allow you to create, update, delete clusters (control plane)
• MSK creates & manages brokers nodes & Zookeeper nodes for
you
• Deploy the MSK cluster in your VPC, multi AZ (up to 3 for HA)
• Automatic recovery from common Apache Kafka failures
• Can create custom configurations for your clusters
• Data is stored on EBS volumes
• You can build producers and consumers of data (data plane)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Apache Kafka at a high
level
MSK Cluster EMR

Kinesis
S3
Broker 2

IoT Write to topic


Producers Poll from topic Consumers SageMaker
(your code) (your code)
Broker 1
RDS
Kinesis

Etc…
RDS
Broker 3
Etc…

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
MSK – Configurations
• Choose the number of AZ MSK Cluster
(3 – recommended, or 2)
Availability Zone 1 Availability Zone 1 Availability Zone 1

• Choose the VPC & Subnets


• The broker instance type
(ex: kafka.m5.large)
• The number of brokers per
Broker 1 Broker 2 Broker 3

AZ (can add brokers later)


Broker 4 Broker 5 Broker 6
• Size of your EBS volumes
… … …
(1GB - 16 TB)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
MSK – Override Kafka Configurations
• List of properties you can set:
https://docs.aws.amazon.com/msk/latest/developerguide/msk-
configuration-properties.html

Important to note:
• Max message size in Kafka by default is 1MB
• Can override this with the broker message.max.bytes setting
• Must also change the consumer max.fetch.bytes setting
• Latency:
• By default it’s low in Kafka 10-40ms (way less than Kinesis)
• The producer can increase latency to increase batching using linger.ms

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
MSK – Security Kafka ACLs

TLS
• Can enable encryption in flight using
TLS between the brokers
Broker 2 Broker 3
• Can say PLAINTEXT and/or TLS-
encrypted between the clients and
brokers TLS TLS
• Encryption at rest for your EBS
volumes using KMS
Broker 1
• Supports TLS client authentication
using a Private Certificate Authority EBS Volume TLS encryption
(CA) from ACM w/ KMS encryption
+ TLS authentication
• Authorize specific security groups for
your Apache Kafka clients Security group

• Security and ACLs for clients is done EC2 Instances


within the Apache Kafka cluster

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
MSK - Monitoring
• CloudWatch Metrics
• Basic monitoring (cluster and broker metrics)
• Enhanced monitoring (++enhanced broker metrics)
• Topic-level monitoring (++enhanced topic-level metrics)
• Prometheus (Open-Source Monitoring)
• Opens a port on the broker to export cluster, broker and topic-level metrics
• Setup the JMX Exporter (metrics) or Node Exporter (CPU and disk metrics)
• Broker Log Delivery
• Delivery to CloudWatch Logs
• Delivery to Amazon S3
• Delivery to Kinesis Data Firehose

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Storage

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS S3 Overview - Buckets

• Amazon S3 allows people to store objects (files) in “buckets”


(directories)
• Buckets must have a globally unique name
• Buckets are defined at the region level
• Naming convention
• No uppercase
• No underscore
• 3-63 characters long
• Not an IP
• Must start with lowercase letter or number

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS S3 Overview - Objects
• Objects (files) have a Key. The key is the FULL path:
• <my_bucket>/my_file.txt
• <my_bucket>/my_folder1/another_folder/my_file.txt
• There’s no concept of “directories” within buckets
(although the UI will trick you to think otherwise)
• Just keys with very long names that contain slashes (“/”)
• Object Values are the content of the body:
• Max Size is 5TB
• If uploading more than 5GB, must use “multi-part upload”
• Metadata (list of text key / value pairs – system or user metadata)
• Tags (Unicode key / value pair – up to 10) – useful for security / lifecycle
• Version ID (if versioning is enabled)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS S3 - Consistency Model
• Read after write consistency for PUTS of new objects
• As soon as an object is written, we can retrieve it
ex: (PUT 200 -> GET 200)
• This is true, except if we did a GET before to see if the object existed
ex: (GET 404 -> PUT 200 -> GET 404) – eventually consistent

• Eventual Consistency for DELETES and PUTS of existing


objects
• If we read an object after updating, we might get the older version
ex: (PUT 200 -> PUT 200 -> GET 200 (might be older version))
• If we delete an object, we might still be able to retrieve it for a short time
ex: (DELETE 200 -> GET 200)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Storage Classes
• Amazon S3 Standard - General Purpose
• Amazon S3 Standard-Infrequent Access (IA)
• Amazon S3 One Zone-Infrequent Access
• Amazon S3 Intelligent Tiering
• Amazon Glacier
• Amazon Glacier Deep Archive

• Amazon S3 Reduced Redundancy Storage (deprecated -


omitted)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Standard – General Purpose
• High durability (99.999999999%) of objects across multiple AZ
• If you store 10,000,000 objects with Amazon S3, you can on
average expect to incur a loss of a single object once every
10,000 years
• 99.99% Availability over a given year
• Sustain 2 concurrent facility failures

• Use Cases: Big Data analytics, mobile & gaming applications,


content distribution…

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Standard – Infrequent Access (IA)
• Suitable for data that is less frequently accessed, but requires
rapid access when needed
• High durability (99.999999999%) of objects across multiple AZs
• 99.9% Availability
• Low cost compared to Amazon S3 Standard
• Sustain 2 concurrent facility failures

• Use Cases: As a data store for disaster recovery, backups…

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 One Zone - Infrequent Access (IA)
• Same as IA but data is stored in a single AZ
• High durability (99.999999999%) of objects in a single AZ; data lost
when AZ is destroyed
• 99.5% Availability
• Low latency and high throughput performance
• Supports SSL for data at transit and encryption at rest
• Low cost compared to IA (by 20%)

• Use Cases: Storing secondary backup copies of on-premise data, or


storing data you can recreate

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Intelligent Tiering
• Same low latency and high throughput performance of S3
Standard
• Small monthly monitoring and auto-tiering fee
• Automatically moves objects between two access tiers based
on changing access patterns
• Designed for durability of 99.999999999% of objects across
multiple Availability Zones
• Resilient against events that impact an entire Availability Zone
• Designed for 99.9% availability over a given year

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Glacier
• Low cost object storage meant for archiving / backup
• Data is retained for the longer term (10s of years)
• Alternative to on-premise magnetic tape storage
• Average annual durability is 99.999999999%
• Cost per storage per month ($0.004 / GB) + retrieval cost
• Each item in Glacier is called “Archive” (up to 40TB)
• Archives are stored in ”Vaults”

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Glacier & Glacier Deep Archive
• Amazon Glacier – 3 retrieval options:
• Expedited (1 to 5 minutes)
• Standard (3 to 5 hours)
• Bulk (5 to 12 hours)
• Minimum storage duration of 90 days

• Amazon Glacier Deep Archive – for long term storage –


cheaper:
• Standard (12 hours)
• Bulk (48 hours)
• Minimum storage duration of 180 days

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Storage Classes Comparison
S3 Intelligent- S3 Glacier
S3 Standard S3 Standard-IA S3 One Zone-IA S3 Glacier
Tiering Deep Archive
Designed for 99.999999999% 99.999999999% 99.999999999% 99.999999999% 99.999999999% 99.999999999%
durability (11 9’s) (11 9’s) (11 9’s) (11 9’s) (11 9’s) (11 9’s)
Designed for
99.99% 99.9% 99.9% 99.5% 99.99% 99.99%
availability
Availability SLA 99.9% 99% 99% 99% 99.9% 99.9%
Availability
≥3 ≥3 ≥3 1 ≥3 ≥3
Zones
Minimum
capacity charge N/A N/A 128KB 128KB 40KB 40KB
per object
Minimum
storage duration N/A 30 days 30 days 30 days 90 days 180 days
charge
Retrieval fee N/A N/A per GB retrieved per GB retrieved per GB retrieved per GB retrieved

sundog-education.com
https://aws.amazon.com/s3/storage-classes/
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Storage Classes – Price Comparison
Example us-east-2
S3 Intelligent- S3 Glacier
S3 Standard S3 Standard-IA S3 One Zone-IA S3 Glacier
Tiering Deep Archive
Storage Cost
$0.0125 - $0.004 $0.00099
(per GB per $0.023 $0.0125 $0.01
$0.023 Minimum 90 days Minimum 180 days
month)
GET $0.0004 +
GET $0.0004 +
Retrieval Cost
GET
(per 1000 GET $0.0004 GET $0.001 GET $0.001 Expedited - $10.00
$0.0004 Standard - $0.10
requests) Standard - $0.05
Bulk - $0.025
Bulk - $0.025
Expedited (1 to 5
Standard (12
instantaneo Instantaneou minutes)
Time to retrieve Instantaneous Instantaneous hours)
us s Standard (3 to 5 hours)
Bulk (48 hours)
Bulk (5 to 12 hours)
Monitoring Cost
$0.0025
(per 1000 objects)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 – Moving between storage classes
• You can transition objects
between storage classes

• For infrequently accessed


object, move them to
STANDARD_IA
• For archive objects you don’t
need in real-time, GLACIER or
DEEP_ARCHIVE

• Moving objects can be


automated using a lifecycle
configuration

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Lifecycle Rules
• Transition actions: It defines when objects are transitioned to another
storage class.
• Move objects to Standard IA class 60 days after creation
• Move to Glacier for archiving after 6 months

• Expiration actions: configure objects to expire (delete) after some time


• Access log files can be set to delete after a 365 days
• Can be used to delete old versions of files (if versioning is enabled)
• Can be used to delete incomplete multi-part uploads

• Rules can be created for a certain prefix (ex - s3://mybucket/mp3/*)


• Rules can be created for certain objects tags (ex - Department: Finance)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Lifecycle Rules – Scenario 1
• Your application on EC2 creates images thumbnails after profile
photos are uploaded to Amazon S3. These thumbnails can be
easily recreated, and only need to be kept for 45 days. The
source images should be able to be immediately retrieved for
these 45 days, and afterwards, the user can wait up to 6 hours.
How would you design this?

• S3 source images can be on STANDARD, with a lifecycle


configuration to transition them to GLACIER after 45 days.
• S3 thumbnails can be on ONEZONE_IA, with a lifecycle
configuration to expire them (delete them) after 45 days.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Lifecycle Rules – Scenario 2
• A rule in your company states that you should be able to
recover your deleted S3 objects immediately for 15 days,
although this may happen rarely. After this time, and for up to
365 days, deleted objects should be recoverable within 48
hours.

• You need to enable S3 versioning in order to have object versions,


so that “deleted objects” are in fact hidden by a “delete marker” and
can be recovered
• You can transition these “noncurrent versions” of the object to S3_IA
• You can transition afterwards these “noncurrent versions” to
DEEP_ARCHIVE

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS S3 - Versioning
• You can version your files in AWS S3
• It is enabled at the bucket level
• Same key overwrite will increment the “version”: 1, 2, 3….
• It is best practice to version your buckets
• Protect against unintended deletes (ability to restore a version)
• Easy roll back to previous version
• Any file that is not versioned prior to enabling versioning will
have version “null”
• You can “suspend” versioning

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Cross Region Replication
• Must enable versioning (source
and destination)
• Buckets must be in different AWS
regions
• Can be in different accounts Asynchronous
replication
• Copying is asynchronous
• Must give proper IAM permissions
to S3 eu-west-1 us-east-1

• Use cases: compliance, lower


latency access, replication across
accounts

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS S3 – ETag (Entity Tag)
• How do you verify if a file has already been uploaded to S3?
• Names work, but how are you sure the file is exactly the same?

• For this, you can use AWS ETags:


• For simple uploads (less than 5GB), it’s the MD5 hash
• For multi-part uploads, it’s more complicated, no need to know the
algorithm

• Using ETag, we can ensure integrity of files

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS S3 Performance – Key Names
Historic fact and current exam
• When you had > 100 TPS (transaction per second), S3 performance
could degrade
• Behind the scene, each object goes to an S3 partition and for the
best performance, we want the highest partition distribution
• In the exam, and historically, it was recommended to have random
characters in front of your key name to optimise performance:
• <my_bucket>/5r4d_my_folder/my_file1.txt
• <my_bucket>/a91e_my_folder/my_file2.txt
• …
• It was recommended never to use dates to prefix keys:
• <my_bucket>/2018_09_09_my_folder/my_file1.txt
• <my_bucket>/2018_09_10_my_folder/my_file2.txt

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS S3 Performance – Key Names
Current performance (not yet exam)
• https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-
announces-increased-request-rate-performance/

• As of July 17th 2018, we can scale up to 3500 RPS for PUT and
5500 RPS for GET for EACH PREFIX
• “This S3 request rate performance increase removes any previous
guidance to randomize object prefixes to achieve faster
performance”

• It’s a “good to know”, until the exam gets updated ☺

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS S3 Performance
• Faster upload of large objects (>5GB), use multipart upload:
• parallelizes PUTs for greater throughput
• maximize your network bandwidth
• decrease time to retry in case a part fails
• Use CloudFront to cache S3 objects around the world
(improves reads)
• S3 Transfer Acceleration (uses edge locations) – just need to
change the endpoint you write to, not the code.
• If using SSE-KMS encryption, you may be limited to your AWS
limits for KMS usage (~100s – 1000s downloads / uploads per
second)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Encryption for Objects

• There are 4 methods of encrypting objects in S3


• SSE-S3: encrypts S3 objects using keys handled & managed by AWS
• SSE-KMS: leverage AWS Key Management Service to manage
encryption keys
• SSE-C: when you want to manage your own encryption keys
• Client Side Encryption

• It’s important to understand which ones are adapted to which


situation for the exam

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SSE-S3
• SSE-S3: encryption using keys handled & managed by AWS S3
• Object is encrypted server side
• AES-256 encryption type
• Must set header: “x-amz-server-side-encryption": "AES256"

Object AWS S3

Object
HTTP/S + Header

+ encryption

Bucket

S3 Managed Data Key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SSE-KMS
• SSE-KMS: encryption using keys handled & managed by KMS
• KMS Advantages: user control + audit trail
• Object is encrypted server side
• Must set header: “x-amz-server-side-encryption": ”aws:kms"

Object AWS S3

Object
HTTP/S + Header

+ encryption

Bucket

KMS Customer Master Key


(CMK)
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SSE-C
• SSE-C: server-side encryption using data keys fully managed by the customer outside of AWS
• Amazon S3 does not store the encryption key you provide
• HTTPS must be used
• Encryption key must provided in HTTP headers, for every HTTP request made

Object
Object AWS S3

HTTPS only +
Data Key in Header

+ + encryption

Bucket

Client side data key Client-provided data key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Client Side Encryption
• Client library such as the Amazon S3 Encryption Client
• Clients must encrypt data themselves before sending to S3
• Clients must decrypt data themselves when retrieving from S3
• Customer fully manages the keys and encryption cycle
Client - S3 Encryption SDK AWS S3
Object

HTTP/S

+ encryption

Bucket

Client side data key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Encryption in transit (SSL)
• AWS S3 exposes:
• HTTP endpoint: non encrypted
• HTTPS endpoint: encryption in flight

• You’re free to use the endpoint you want, but HTTPS is


recommended
• HTTPS is mandatory for SSE-C
• Encryption in flight is also called SSL / TLS

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 CORS (Cross-Origin Resource
Sharing)
• If you request data from another website, you need to enable CORS
• Cross Origin Resource Sharing allows you to limit the number of
websites that can request your files in S3 (and limit your costs)
• It’s a popular exam question
GET index.html

mywebsite.com
Client GET coffee.jpg
ORIGIN: http://mywebsite.com/

Access-Control-Allow-Origin: <domain>
myimagebucket
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
user

S3 Access Logs
• For audit purpose, you may want to log all access to requests

S3 buckets
• Any request made to S3, from any account, authorized
or denied, will be logged into another S3 bucket
My-bucket
• That data can be analyzed using data analysis tools…
• Or Amazon Athena as we’ll see later in this course! Log all
requests

• The log format is at:


https://docs.aws.amazon.com/AmazonS3/latest/dev/Lo
gFormat.html Logging Bucket

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Security
• User based
• IAM policies - which API calls should be allowed for a specific user from
IAM console

• Resource Based
• Bucket Policies - bucket wide rules from the S3 console - allows cross
account
• Object Access Control List (ACL) – finer grain
• Bucket Access Control List (ACL) – less common

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Bucket Policies
• JSON based policies
• Resources: buckets and objects
• Actions: Set of API to Allow or Deny
• Effect: Allow / Deny
• Principal: The account or user to apply the policy to
• Use S3 bucket for policy to:
• Grant public access to the bucket
• Force objects to be encrypted at upload
• Grant access to another account (Cross Account)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Default Encryption vs Bucket Policies
• The old way to enable default encryption was to use a bucket policy
and refuse any HTTP command without the proper headers:

• The new way is to use the “default encryption” option in S3


• Note: Bucket Policies are evaluated before “default encryption”
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Security - Other
• Networking:
• Supports VPC Endpoints (for instances in VPC without www internet)
• Logging and Audit:
• S3 access logs can be stored in other S3 bucket
• API calls can be logged in AWS CloudTrail
• User Security:
• MFA (multi factor authentication) can be required in versioned buckets
to delete objects
• Signed URLs: URLs that are valid only for a limited time (ex: premium
video service for logged in users)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glacier
• Low cost object storage meant for archiving / backup
• Data is retained for the longer term (10s of years)
• Alternative to on-premise magnetic tape storage
• Average annual durability is 99.999999999%
• Cost per storage per month ($0.004 / GB) + retrieval cost

• Each item in Glacier is called “Archive” (up to 40TB)


• Archives are stored in ”Vaults”

• Exam tip: archival from S3 after XXX days => use Glacier

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glacier Operations
• Restore links have an expiry date
• 3 retrieval options:
• Expedited (1 to 5 minutes retrieval) – $0.03 per GB and $0.01 per
request
• Standard (3 to 5 hours) - $0.01 per GB and 0.05 per 1000 requests
• Bulk (5 to 12 hours) - $0.0025 per GB and $0.025 per 1000 requests

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glacier - Vault Policies & Vault Lock
• Vault is a collection of archives
• Each Vault has:
• ONE vault access policy
• ONE vault lock policy
• Vault Policies are written in JSON
• Vault Access Policy is similar to bucket policy (restrict user / account
permissions)
• Vault Lock Policy is a policy you lock, for regulatory and compliance
requirements.
• The policy is immutable, it can never be changed (that’s why it’s call LOCK)
• Example 1: forbid deleting an archive if less than 1 year old
• Example 2: implement WORM policy (write once read many)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Select & Glacier Select
• Retrieve less data using SQL by performing server side filtering
• Can filter by rows & columns (simple SQL statements)
• Less network transfer, less CPU cost client-side

CSV file

Get CSV with S3 Select

Send filtered dataset


Amazon S3

Server-side filtering
https://aws.amazon.com/blogs/aws/s3-glacier-select/

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Select with Hadoop
• Transfer some data from S3 before analyzing it with your cluster
• Load less data into Hadoop, save network costs, transfer the
data faster
CSV file

Send filtered dataset


Amazon S3 Using S3 Select

Server-side filtering

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB
• Fully Managed, Highly available with replication across 3 AZ
• NoSQL database - not a relational database
• Scales to massive workloads, distributed database
• Millions of requests per seconds, trillions of row, 100s of TB of
storage
• Fast and consistent in performance (low latency on retrieval)
• Integrated with IAM for security, authorization and administration
• Enables event driven programming with DynamoDB Streams
• Low cost and auto scaling capabilities

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB - Basics
• DynamoDB is made of tables
• Each table has a primary key (must be decided at creation time)
• Each table can have an infinite number of items (= rows)
• Each item has attributes (can be added over time – can be null)
• Maximum size of a item is 400KB
• Data types supported are:
• Scalar Types: String, Number, Binary, Boolean, Null
• Document Types: List, Map
• Set Types: String Set, Number Set, Binary Set

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Primary Keys

• Option 1: Partition key only


(HASH)
user_id First Name Age
• Partition key must be unique for
12broiu45 John 46
each item
dfi7503df
• Partition key must be “diverse” so Katie 31

that the data is distributed Partition key


(unique) attributes
• Example: user_id for a users
table

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Primary Keys
• Option 2: Partition key + Sort
Key
• The combination must be user_id game_id Result
unique 12broiu45 1234 win
• Data is grouped by partition key 12broiu45 3456 lose
• Sort key == range key Partition key Sort Key attributes

• Example: users-games table Primary key


• user_id for the partition key
• game_id for the sort key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Partition Keys exercise
• We’re building a movie database
• What is the best partition key to maximize data distribution?
• movie_id
• producer_name
• leader_actor_name
• movie_language

• movie_id has the highest cardinality so it’s a good candidate


• moving_language doesn’t take many values and may be
skewed towards English so it’s not a great partition key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB in Big Data
• Common use cases include: • Anti Pattern
• Mobile apps
• Gaming • Prewritten application tied to a
• Digital ad serving traditional relational database:
• Live voting use RDS instead
• Audience interaction for live events • Joins or complex transactions
• Sensor networks
• Binary Large Object (BLOB)
• Log ingestion
• Access control for web-based data: store data in S3 &
content metadata in DynamoDB
• Metadata storage for Amazon S3 • Large data with low I/O rate:
objects
• E-commerce shopping carts use S3 instead
• Web session management

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Provisioned Throughput
• Table must have provisioned read and write capacity units
• Read Capacity Units (RCU): throughput for reads
• Write Capacity Units (WCU): throughput for writes
• Option to setup auto-scaling of throughput to meet demand
• Throughput can be exceeded temporarily using “burst credit”
• If burst credit are empty, you’ll get a
“ProvisionedThroughputException”.
• It’s then advised to do an exponential back-off retry

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Write Capacity Units
• One write capacity unit represents one write per second for an
item up to 1 KB in size.
• If the items are larger than 1 KB, more WCU are consumed

• Example 1: we write 10 objects per seconds of 2 KB each.


• We need 2 * 10 = 20 WCU
• Example 2: we write 6 objects per second of 4.5 KB each
• We need 6 * 5 = 30 WCU (4.5 gets rounded to the upper KB)
• Example 3: we write 120 objects per minute of 2 KB each
• We need 120 / 60 * 2 = 4 WCU

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Strongly Consistent Read
vs Eventually Consistent Read
• Eventually Consistent Read: If we
application read just after a write, it’s possible we’ll
get unexpected response because of
replication
writes
reads
• Strongly Consistent Read: If we read
DynamoDB just after a write, we will get the correct
Server 1
data

replication replication • By default: DynamoDB uses


Eventually Consistent Reads, but
DynamoDB DynamoDB GetItem, Query & Scan provide a
Server 2 Server 3 “ConsistentRead” parameter you can
set to True

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Read Capacity Units
• One read capacity unit represents one strongly consistent read per
second, or two eventually consistent reads per second, for an item
up to 4 KB in size.
• If the items are larger than 4 KB, more RCU are consumed

• Example 1: 10 strongly consistent reads per seconds of 4 KB each


• We need 10 * 4 KB / 4 KB = 10 RCU
• Example 2: 16 eventually consistent reads per seconds of 12 KB
each
• We need (16 / 2) * ( 12 / 4 ) = 24 RCU
• Example 3: 10 strongly consistent reads per seconds of 6 KB each
• We need 10 * 8 KB / 4 = 20 RCU (we have to round up 6 KB to 8 KB)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB - Throttling
• If we exceed our RCU or WCU, we get
ProvisionedThroughputExceededExceptions
• Reasons:
• Hot keys / partitions: one partition key is being read too many times
(popular item for ex)
• Very large items: remember RCU and WCU depends on size of items
• Solutions:
• Exponential back-off when exception is encountered (already in SDK)
• Distribute partition keys as much as possible
• If RCU issue, we can use DynamoDB Accelerator (DAX)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Partitions Internal
Item: user_id = 1
• You start with one partition
Hashing algorithm on the partition key • Each partition:
• Max of 3000 RCU / 1000 WCU
• Max of 10GB
Partition 1 Partition 2 Partition 3 • To compute the number of
user_id=1 user_id=2 user_id=3
partitions:
• By capacity: (TOTAL RCU / 3000)
user_id=1 user_id=4 user_id=6 + (TOTAL WCU / 1000)
user_id=5 user_id=6 • By size: Total Size / 10 GB
• Total partitions =
CEILING(MAX(Capacity, Size))
2000 RCU 2000 RCU 2000 RCU
800 WCU 800 WCU 800 WCU • WCU and RCU are spread
evenly between partitions
Table settings: Total 6000 RCU / Total 2400 WCU

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Writing Data
• PutItem - Write data to DynamoDB (create data or full replace)
• Consumes WCU

• UpdateItem – Update data in DynamoDB (partial update of


attributes)
• Possibility to use Atomic Counters and increase them

• Conditional Writes:
• Accept a write / update only if conditions are respected, otherwise reject
• Helps with concurrent access to items
• No performance impact

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Deleting Data
• DeleteItem
• Delete an individual row
• Ability to perform a conditional delete

• DeleteTable
• Delete a whole table and all its items
• Much quicker deletion than calling DeleteItem on all items

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Batching Writes
• BatchWriteItem
• Up to 25 PutItem and / or DeleteItem in one call
• Up to 16 MB of data written
• Up to 400 KB of data per item

• Batching allows you to save in latency by reducing the number


of API calls done against DynamoDB
• Operations are done in parallel for better efficiency
• It’s possible for part of a batch to fail, in which case we have the
try the failed items (using exponential back-off algorithm)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Reading Data
• GetItem:
• Read based on Primary key
• Primary Key = HASH or HASH-RANGE
• Eventually consistent read by default
• Option to use strongly consistent reads (more RCU - might take longer)
• ProjectionExpression can be specified to include only certain attributes

• BatchGetItem:
• Up to 100 items
• Up to 16 MB of data
• Items are retrieved in parallel to minimize latency

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Query
• Query returns items based on:
• PartitionKey value (must be = operator)
• SortKey value (=, <, <=, >, >=, Between, Begin) – optional
• FilterExpression to further filter (client side filtering)
• Returns:
• Up to 1 MB of data
• Or number of items specified in Limit
• Able to do pagination on the results
• Can query table, a local secondary index, or a global secondary
index

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB - Scan
• Scan the entire table and then filter out data (inefficient)
• Returns up to 1 MB of data – use pagination to keep on reading
• Consumes a lot of RCU
• Limit impact using Limit or reduce the size of the result and pause
• For faster performance, use parallel scans:
• Multiple instances scan multiple partitions at the same time
• Increases the throughput and RCU consumed
• Limit the impact of parallel scans just like you would for Scans
• Can use a ProjectionExpression + FilterExpression (no change to
RCU)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – LSI (Local Secondary
Index)
• Alternate range key for your table, local to the hash key
• Up to five local secondary indexes per table.
• The sort key consists of exactly one scalar attribute.
• The attribute that you choose must be a scalar String, Number, or Binary
• LSI must be defined at table creation time

user_id game_id game_ts Result Duration

12broiu45 1234 “2018-03-15T17:43:08” win 12


12broiu45 3456 “2018-06-20T19:02:32” lose 33
34oiusd21 4567 “2018-02-11T-04:11:31” lose 45
LSI
Partition key Sort Key
attributes
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – GSI (Global Secondary
Index)
• To speed up queries on non-key attributes, use a Global Secondary Index
• GSI = partition key + optional sort key
• The index is a new “table” and we can project attributes on it
• The partition key and sort key of the original table are always projected (KEYS_ONLY)
• Can specify extra attributes to project (INCLUDE)
• Can use all attributes from main table (ALL)
• Must define RCU / WCU for the index
• Possibility to add / modify GSI (not LSI)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – GSI (Global Secondary
Index)
user_id game_id game_ts

12broiu45 1234 “2018-03-15T17:43:08”

12broiu45 3456 “2018-06-20T19:02:32”


INDEX – queries by game_id
34oiusd21 4567 “2018-02-11T-04:11:31”

game_id game_ts user_id


Partition key Sort Key attributes
1234 “2018-03-15T17:43:08” 12broiu45
TABLE – query by user_id 3456 “2018-06-20T19:02:32” 12broiu45

4567 “2018-02-11T-04:11:31” 34oiusd21

Partition Key Sort Key Attributes


sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB - DAX
• DAX = DynamoDB Accelerator
• Seamless cache for DynamoDB, no applications

application re-write
• Writes go through DAX to DynamoDB
• Micro second latency for cached reads &
queries
• Solves the Hot Key problem (too many reads)
DynamoDB Accelarator (DAX)
• 5 minutes TTL for cache by default
• Up to 10 nodes in the cluster
• Multi AZ (3 nodes minimum recommended for
production)
• Secure (Encryption at rest with KMS, VPC,
IAM, CloudTrail…) Amazon table table table
DynamoDB

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB Streams Create
Update
• Changes in DynamoDB (Create, Update, Delete
Delete) can end up in a DynamoDB Stream
• This stream can be read by AWS Lambda,
and we can then do: Amazon
DynamoDB
table
• React to changes in real time (welcome email
to new users) changelog
• Create derivative tables / views
• Insert into ElasticSearch
Stream
• Could implement Cross Region Replication
using Streams Batch of records
• Stream has 24 hours of data retention
• Configurable batch size (up to 1,000 rows,
6 MB)
Lambda
function

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB Streams Kinesis Adapter
• Use the KCL library to directly
consume from DynamoDB
Streams

Checkpoint progress
KCL

• You just need to add a “Kinesis


Adapter” library Consume messages
KCL
DynamoDB Stream
• The interface and
programming is exactly the
same as Kinesis Streams
Amazon
• That’s the alternative to using DynamoDB table
AWS Lambda

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB TTL (Time to Live)
• TTL = automatically delete an item after an expiry date / time
• TTL is provided at no extra cost, deletions do not use WCU / RCU
• TTL is a background task operated by the DynamoDB service itself
• Helps reduce storage and manage the table size over time
• Helps adhere to regulatory norms
• TTL is enabled per row (you define a TTL column, and add a date
there)
• DynamoDB typically deletes expired items within 48 hours of
expiration
• Deleted items due to TTL are also deleted in GSI / LSI
• DynamoDB Streams can help recover expired items

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Security & Other Features
• Security:
• VPC Endpoints available to access DynamoDB without internet
• Access fully controlled by IAM
• Encryption at rest using KMS
• Encryption in transit using SSL / TLS
• Backup and Restore feature available
• Point in time restore like RDS
• No performance impact
• Global Tables
• Multi region, fully replicated, high performance
• Amazon Database Migration Service (DMS) can be used to migrate to
DynamoDB (from Mongo, Oracle, MySQL, S3, etc…)
• You can launch a local DynamoDB on your computer for development
purposes

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB – Storing large objects
• Max size of an item in DynamoDB = 400 KB
• For large objects, store them in S3 and reference them in
DynamoDB
Amazon S3

2. New Row in DynamoDB 3. Read Row in DynamoDB


Client DynamoDB Client
ID FileName S3Url FileSizeMb
12bd2397a large.pdf S3://mybucket/mypdfs/large.pdf 23

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB - Price comparison for 300
KB Amazon S3

Client DynamoDB Client


Client DynamoDB Client
• Amazon S3 (300KB of storage) • DynamoDB (300 KB of storage)
• $0.0000069 storage per month • $0.195 for 300 WCU per month
• $0.0000050 initial PUT • $0.004940 for 38 RCU per month
• $0.0000004 per GET • $0.000075 storage per month

• DynamoDB (< 1 KB of storage) • Assuming 1 write, 100 reads per month:


• Storage is 11x more expensive
• $0.0006500 for one WCU per month
• WCU + RCU are under-used
• $0.0001300 for one RCU per month
• $0.00000025 storage per month
• Assuming 1 write, 100 reads per month: • Even for items that fit in DynamoDB, if
• $0.00119215 per month under-used, S3 + DynamoDB is a solution

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS ElastiCache Overview
• The same way RDS is to get managed Relational Databases…
• ElastiCache is to get managed Redis or Memcached
• Caches are in-memory databases with really high performance, low
latency
• Helps reduce load off of databases for read intensive workloads
• Helps make your application stateless
• Write Scaling using sharding
• Read Scaling using Read Replicas
• Multi AZ with Failover Capability
• AWS takes care of OS maintenance / patching, optimizations, setup,
configuration, monitoring, failure recovery and backups

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redis Overview
• Redis is an in-memory key-value store
• Super low latency (sub ms)
• Cache survive reboots by default (it’s called persistence)
• Great to host
• User sessions
• Leaderboard (for gaming)
• Distributed states
• Relieve pressure on databases (such as RDS)
• Pub / Sub capability for messaging
• Multi AZ with Automatic Failover for disaster recovery if you don’t
want to lose your cache data
• Support for Read Replicas

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Memcached Overview
• Memcached is an in-memory object store
• Cache doesn’t survive reboots
• Use cases:
• Quick retrieval of objects from memory
• Cache often accessed objects
• Overall, Redis has largely grown in popularity and has better
feature sets than Memcached.
• I would personally only use Redis for caching needs.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Lambda
Serverless data processing

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is Lambda?
• A way to run code snippets “in the
cloud”
• Serverless
• Continuous scaling
• Often used to process data as it’s
moved around

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Example: Serverless Website

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Example:
Order history app

Server logs Client app

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Example:
Transaction rate alarm

Server logs

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Why not just run a server?
• Server management (patches, monitoring, hardware failures,
etc.)

• Servers can be cheap, but scaling gets expensive really fast

• You don’t pay for processing time you don’t use

• Easier to split up development between front-end and back-end

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Main uses of Lambda
• Real-time file processing
• Real-time stream processing
• ETL
• Cron replacement
• Process AWS events

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Supported languages
• Node.js
• Python
• Java
• C#
• Go
• Powershell
• Ruby

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Lambda triggers

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Lambda and Amazon Elasticsearch
Service

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Lambda and Data Pipeline

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Lambda and Redshift

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Lambda + Kinesis
• Your Lambda code receives an event with a
batch of stream records
• You specify a batch size when setting up the trigger
(up to 10,000 records)
• Too large a batch size can cause timeouts!
• Batches may also be split beyond Lambda’s
payload limit (6 MB)
• Lambda will retry the batch until it succeeds or
the data expires
• This can stall the shard if you don’t handle errors
properly
• Use more shards to ensure processing isn’t totally
held up by errors
• Lambda processes shard data synchronously

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Cost Model
• “Pay for what you use”
• Generous free tier (1M
requests / month, 400K
GB-seconds compute
time)
• $0.20 / million requests
• $.00001667 per
GB/second

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Other promises
• High availability
• Unlimited scalability*
• High performance
• But you do specify a timeout!
This can cause problems.
Max is 900 seconds.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Anti-patterns
• Long-running applications
• Dynamic websites
• Stateful applications

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Glue
Table definitions and ETL

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is Glue?
• Serverless discovery and definition
of table definitions and schema
• S3 “data lakes”
• RDS
• Redshift
• Most other SQL databases
• Custom ETL jobs
• Trigger-driven, on a schedule, or on
demand
• Fully managed

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue Crawler / Data Catalog

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue and S3 Partitions
• Glue crawler will extract partitions based on how your S3 data is organized
• Think up front about how you will be querying your data lake in S3
• Example: devices send sensor data every hour
• Do you query primarily by time ranges?
• If so, organize your buckets as yyyy/mm/dd/device
• Do you query primarily by device?
• If so, organize your buckets as device/yyyy/mm/dd

Device 1 Device 2

2018 2019 2018 2019

01 02 … 01 02 … 01 02 … 01 02 …

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue + Hive

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue ETL
• Automatic code generation
• Scala or Python
• Encryption
• Server-side (at rest)
• SSL (in transit)
• Can be event-driven
• Can provision additional “DPU’s” (data processing units) to
increase performance of underlying Spark jobs
• Errors reported to CloudWatch

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue ETL
• Transform data, Clean Data, Enrich Data (before doing analysis)
• Generate ETL code in Python or Scala, you can modify the code
• Can provide your own Spark or PySpark scripts
• Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
• Fully managed, cost effective, pay only for the resources consumed
• Jobs are run on a serverless Spark platform

• Glue Scheduler to schedule the jobs


• Glue Triggers to automate job runs based on “events”

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue ETL - Transformations
• Bundled Transformations:
• DropFields, DropNullFields – remove (null) fields
• Filter – specify a function to filter records
• Join – to enrich data
• Map - add fields, delete fields, perform external lookups
• Machine Learning Transformations:
• FindMatches ML: identify duplicate or matching records in your
dataset, even when the records do not have a common unique
identifier and no fields match exactly.
• Format conversions: CSV, JSON, Avro, Parquet, ORC, XML
• Apache Spark transformations (example: K-Means)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Glue Development Endpoints
• Develop ETL scripts using a notebook
• Then create an ETL job that runs your
script (using Spark and Glue)
• Endpoint is in a VPC controlled by
security groups, connect via:
• Apache Zeppelin on your local machine
• Zeppelin notebook server on EC2 (via
Glue console)
• SageMaker notebook
• Terminal window
• PyCharm professional edition
• Use Elastic IP’s to access a private
endpoint address

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Running Glue jobs
• Time-based schedules (cron style)
• Job bookmarks
• Persists state from the job run
• Prevents reprocessing of old data
• Allows you to process new data only when re-running on a
schedule
• Works with S3 sources in a variety of formats
• Works with relational databases via JDBC (if PK’s are in
sequential order)
• Only handles new rows, not updated rows
• CloudWatch Events
• Fire off a Lambda function or SNS notification when ETL
succeeds or fails
• Invoke EC2 run, send event to Kinesis, activate a Step Function

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue cost model
• Billed by the minute for crawler
and ETL jobs
• First million objects stored and
accesses are free for the Glue
Data Catalog
• Development endpoints for
developing ETL code charged by
the minute

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue Anti-patterns
• Multiple ETL engines

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
No longer an anti-pattern: streaming
• As of April 2020, Glue ETL supports serverless streaming ETL
• Consumes from Kinesis or Kafka
• Clean & transform in-flight
• Store results into S3 or other data stores
• Runs on Apache Spark Structured Streaming

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR
Elastic MapReduce

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is EMR?
• Elastic MapReduce
• Managed Hadoop framework on EC2
instances
• Includes Spark, HBase, Presto, Flink,
Hive & more
• EMR Notebooks
• Several integration points with AWS

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
An EMR Cluster

Core
node
• Master node: manages the cluster
• Single EC2 instance
• Core node: Hosts HDFS data and runs tasks
Master Task • Can be scaled up & down, but with
node node some risk
• Task node: Runs tasks, does not host data
• No risk of data loss when removing
• Good use of spot instances
Core
node

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR Usage
• Transient vs Long-Running Clusters
• Can spin up task nodes using Spot instances for temporary capacity
• Can use reserved instances on long-running clusters to save $
• Connect directly to master to run jobs
• Submit ordered steps via the console

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR / AWS Integration
• Amazon EC2 for the instances that comprise the nodes in the
cluster
• Amazon VPC to configure the virtual network in which you
launch your instances
• Amazon S3 to store input and output data
• Amazon CloudWatch to monitor cluster performance and
configure alarms
• AWS IAM to configure permissions
• AWS CloudTrail to audit requests made to the service
• AWS Data Pipeline to schedule and start your clusters

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR Storage
• HDFS
• EMRFS: access S3 as if it were HDFS
• EMRFS Consistent View – Optional for
S3 consistency
• Uses DynamoDB to track consistency
• Local file system
• EBS for HDFS

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR promises
• EMR charges by the hour
• Plus EC2 charges
• Provisions new nodes if a
core node fails
• Can add and remove tasks
nodes on the fly
• Can resize a running cluster’s
core nodes

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
So… what’s Hadoop?

MapReduce

YARN

HDFS

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Apache Spark

MapReduce Spark

YARN

HDFS

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
How Spark Works Executor
- Cache
-Tasks

Driver Program Cluster Manager Executor


-Spark Context (Spark, YARN) - Cache
-Tasks

Executor
- Cache
-Tasks

...

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Spark Components

Spark Streaming Spark SQL MLLib GraphX

SPARK CORE

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Spark Structured Streaming

val inputDF = spark.readStream.json("s3://logs")


inputDF.groupBy($"action", window($"time", "1 hour")).count()
.writeStream.format("jdbc").start("jdbc:mysql//...")

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Spark Streaming + Kinesis

Spark Dataset
Kinesis
implemented
Producer(s)
from KCL

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Spark + Redshift
• spark-redshift package allows Spark datasets from Redshift
• It’s a Spark SQL data source
• Useful for ETL using Spark

(Spark)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Apache Hive

Hive

MapReduce Tez

Hadoop YARN

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Why Hive?
• Uses familiar SQL syntax (HiveQL)
• Interactive
• Scalable – works with “big data” on a
cluster
• Really most appropriate for data warehouse
applications
• Easy OLAP queries – WAY easier than
writing MapReduce in Java
• Highly optimized
• Highly extensible
• User defined functions
• Thrift server
• JDBC / ODBC driver

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
The Hive Metastore
• Hive maintains a “metastore” that imparts a structure you
define on the unstructured data that is stored on HDFS
etc.
CREATE TABLE ratings (
userID INT,
movieID INT,
rating INT,
time INT)
ROW FORMAT DELIMTED
FIELDS TERMINATED BY ’\t’
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH ‘${env:HOME}/ml-100k/u.data’


OVERWRITE INTO TABLE ratings;

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
External Hive Metastores
• Metastore is stored in MySQL
on the master node by default
• External metastores offer
better resiliency / integration
• AWS Glue Data Catalog Hive
metastore
• Amazon RDS

or

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Other Hive / AWS integration points
• Load table partitions from S3
• Write tables in S3
• Load scripts from S3
• DynamoDB as an external table

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Apache Pig
• Writing mappers and reducers
by hand takes a long time.
• Pig introduces Pig Latin, a
scripting language that lets
you use SQL-like syntax to
define your map and reduce
steps.
• Highly extensible with user-
defined functions (UDF’s)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
How Pig Works

MapReduce

YARN

HDFS

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Pig / AWS Integration
• Ability to use multiple file systems
(not just HDFS)
• i.e., query data in S3
• Load JAR’s and scripts from S3

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
HBase
• Non-relational, petabyte-scale database
• Based on Google’s BigTable, on top of HDFS
• In-memory
• Hive integration

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Sounds a lot like DynamoDB
• Both are NoSQL databases intended for the same sorts of things
• But if you’re all-in with AWS anyhow, DynamoDB has advantages
• Fully managed (auto-scaling)
• More integration with other AWS services
• Glue integration
• HBase has some advantages though:
• Efficient storage of sparse data
• Appropriate for high frequency counters (consistent reads & writes)
• High write & update throughput
• More integration with Hadoop

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
HBase / AWS integration
• Can store data (StoreFiles
and metadata) on S3 via
EMRFS
• Can back up to S3

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Presto
• It can connect to many different “big data” databases and data
stores at once, and query across them
• Interactive queries at petabyte scale
• Familiar SQL syntax
• Optimized for OLAP – analytical queries, data warehousing
• Developed, and still partially maintained by Facebook
• This is what Amazon Athena uses under the hood
• Exposes JDBC, Command-Line, and Tableau interfaces

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Presto connectors
• HDFS
• S3
• Cassandra
• MongoDB
• HBase
• SQL
• Redshift
• Teradata

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Apache Zeppelin

• If you’re familiar with iPython notebooks – it’s like that


• Lets you interactively run scripts / code against your data
• Can interleave with nicely formatted notes
• Can share notebooks with others on your cluster
• Spark, Python, JDBC, HBase, Elasticsearch + more

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Zeppelin + Spark
• Can run Spark code interactively (like
you can in the Spark shell)
• This speeds up your development cycle
• And allows easy experimentation and
exploration of your big data
• Can execute SQL queries directly
against SparkSQL
• Query results may be visualized in
charts and graphs
• Makes Spark feel more like a data
science tool!

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR Notebook
• Similar concept to Zeppelin, with
more AWS integration
• Notebooks backed up to S3
• Provision clusters from the
notebook!
• Hosted inside a VPC
• Accessed only via AWS console

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Hue
• Hadoop User Experience
• Graphical front-end for
applications on your EMR
cluster
• IAM integration: Hue Super-
users inherit IAM roles
• S3: Can browse & move data
between HDFS and S3

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Splunk
• Splunk / Hunk “makes machine data accessible, usable, and
valuable to everyone”
• Operational tool – can be used to visualize EMR and S3 data
using your EMR Hadoop cluster.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Flume
• Another way to stream data into your cluster
• Made from the start with Hadoop in mind
• Built-in sinks for HDFS and HBase
• Originally made to handle log aggregation

Web
servers Source Sink HBase

Channel

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
MXNet
• Like Tensorflow, a library for
building and accelerating neural
networks
• Included on EMR

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3DistCP
• Tool for copying large amounts of
data
• From S3 into HDFS
• From HDFS into S3
• Uses MapReduce to copy in a
distributed manner
• Suitable for parallel copying of
large numbers of objects
• Across buckets, across accounts

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Other EMR / Hadoop Tools
• Ganglia (monitoring)
• Mahout (machine learning)
• Accumulo (another NoSQL database)
• Sqoop (relational database connector)
• HCatalog (table and storage management for Hive metastore)
• Kinesis Connector (directly access Kinesis streams in your scripts)
• Tachyon (accelerator for Spark)
• Derby (open-source relational DB in Java)
• Ranger (data security manager for Hadoop)
• Install whatever you want

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR Security
• IAM policies
• Kerberos
• SSH
• IAM roles

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR: Choosing Instance Types
• Master node:
• m4.large if < 50 nodes, m4.xlarge if > 50 nodes
• Core & task nodes:
• m4.large is usually good
• If cluster waits a lot on external dependencies (i.e. a web crawler), t2.medium
• Improved performance: m4.xlarge
• Computation-intensive applications: high CPU instances
• Database, memory-caching applications: high memory instances
• Network / CPU-intensive (NLP, ML) – cluster computer instances
• Spot instances
• Good choice for task nodes
• Only use on core & master if you’re testing or very cost-sensitive; you’re
risking partial data loss

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Machine
Learning
ML with linear and logistic regression

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Machine Learning 101
• Machine learning systems
predict some unknown property
of an item, given its other
properties
• Examples:
• How much will this house sell for?
• What is this a picture of?
• Is this biopsy result malignant?
• Is this financial transaction
fraudulent?

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Supervised Learning
• Supervised machine learning systems
are trained
• The property we want to predict is called
a label
• Our training data set contains labels
known to be correct, together with the
other attributes of the data (i.e., known
house sale price given its location, # of
bedrooms, square feet, etc.)
• This training data is used to build a
model that can then make predictions of
unknown labels

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Train / Test
• Your training data can be randomly split into a training set and
a test set
• Only the training set is used to train the model
• The model is then used on the test set
• We can then measure the accuracy of the predicted labels vs.
their actual labels

Training Set Test Set

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Types of models in Amazon ML
Example Model type
What price will this house sell for? Regression
What is this a picture of? Multiclass Classification
Is this biopsy result malignant? Binary Classification
Is this financial transaction fraudulent? Binary Classification

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Confusion Matrix
• A way to visualize the accuracy of multiclass classification
predictive models

Dog 1.00 0.00 0.00


True label

Cat 0.00 0.62 0.38

Fish 0.00 0.00 1.00

Dog Cat Fish


Predicted label
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Hyperparameters
• Machine learning models often depend on tuning the
parameters of the model itself
• This is called hyperparameter tuning
• Parameters in Amazon ML include:
• Learning rate
• Model size
• Number of passes
• Data shuffling
• Regularization

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Machine Learning (ML)
• Provides visualization tools & wizards to
make creating a model easy
• You point it to training data in S3, Redshift,
or RDS
• It builds a model than can make predictions
using batches or a low-latency API
• Can do train/test and evaluate your model
• Fully managed
• Honestly it’s a bit outdated now

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
“Ideal Usage Patterns”
• Flag suspicious transactions (fraud
detection)
• Forecasting product demand
• Personalization – predict items a
user will be interested in
• Predict user activity (we’ll do this)
• Classify social media (does this
Tweet require my attention?)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon ML: Cost Model
• “Pay for what you use”
• Charged for compute time
• Number of predictions
• Memory used to run your model
• Compute-hours for training

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon ML: Promises & Limitations
• No downtime
• Up to 100GB training data
(more via support ticket)
• Up to 5 simultaneous jobs
(more via support ticket)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon ML: Anti-Patterns
• Terabyte-scale data
• Unsupported learning tasks
• Sequence prediction
• Unsupervised clustering
• Deep learning
• EMR / Spark is an (unmanaged)
alternative.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon SageMaker
Scalable, fully-managed machine learning

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SageMaker modules

Build

Train

Deploy
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SageMaker is powerful
• Tensorflow
• Apache MXNet
• GPU accelerated deep
learning
• Scaling effectively
unlimited
• Hyperparameter tuning
jobs

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Jupyter Notebooks

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SageMaker Security
• Code stored in “ML storage
volumes”
• Controlled by security groups
• Optionally encrypted at rest
• All artifacts encrypted in transit
and at rest
• API & console secured by SSL
• IAM roles
• Encrypted S3 buckets for data
• KMS integration for SageMaker
notebooks, training jobs,
endpoints

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SageMaker Operations

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Deep Learning 101
And AWS Best Practices

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
The biological inspiration
• Neurons in your cerebral cortex are
connected via axons
• A neuron “fires” to the neurons it’s
connected to, when enough of its input
signals are activated.
• Very simple at the individual neuron
level – but layers of neurons connected
in this way can yield learning behavior.
• Billions of neurons, each with thousands
of connections, yields a mind

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Cortical columns
• Neurons in your cortex seem to be
arranged into many stacks, or “columns”
that process information in parallel
• “mini-columns” of around 100 neurons
are organized into larger “hyper-
columns”. There are 100 million mini-
columns in your cortex
• This is coincidentally similar to how GPU’s
(credit: Marcel Oberlaender et al.)
work…

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Deep Neural Networks
softmax
Σ Σ Σ
Bias

Σ
Neuron
(1.0)
Σ Σ Σ
Bias
Weight Weight
Neuron
1 2
(1.0)

Input 1 Input 2

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Deep Learning Frameworks
• Tensorflow / Keras
• MXNet
model = Sequential()

model.add(Dense(64, activation='relu', input_dim=20))


model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9,
nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd, metrics=['accuracy'])

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Types of Neural Networks
• Feedforward Neural Network
• Convolutional Neural Networks
(CNN)
• Image classification (is there a
stop sign in this image?)
• Recurrent Neural Networks
(RNNs)
Σ Σ Σ Σ
• Deals with sequences in time
(predict stock prices, understand
words in a sentence, etc)
• LSTM, GRU

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Deep Learning on EC2 / EMR
• EMR supports Apache MXNet and
GPU instance types
• Appropriate instance types for deep
learning:
• P3: 8 Tesla V100 GPU’s
• P2: 16 K80 GPU’s
• G3: 4 M60 GPU’s (all Nvidia chips)
• Deep Learning AMI’s

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Data Pipeline
A high-level overview

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Data Pipeline example

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Data Pipeline Features
• Destinations include S3, RDS,
DynamoDB, Redshift and EMR
• Manages task dependencies
• Retries and notifies on failures
• Cross-region pipelines
• Precondition checks
• Data sources may be on-premises
• Highly available

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Data Pipeline Activities
• EMR
• Hive
• Copy
• SQL
• Scripts

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Step Functions
A high-level overview

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Step Functions
• Use to design workflows
• Easy visualizations
• Advanced Error Handling and Retry
mechanism outside the code
• Audit of the history of workflows
• Ability to “Wait” for an arbitrary amount of
time
• Max execution time of a State Machine is 1
year

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Step Functions – Examples
Train a Machine Learning Model

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Step Functions – Examples
Tune a Machine Learning Model

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Step Functions – Examples
Manage a Batch Job

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Analytics
Querying streams of data

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Conceptually…

Analytics
Tools

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
In more depth…

Input SELECT STREAM Output


Stream(s) (ItemID, count(*) Stream(s)
FROM SourceStream
GROUP BY ItemID;

Reference Error
table Stream

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Common use-cases
• Streaming ETL
• Continuous metric generation
• Responsive analytics

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Analytics
• Pay only for resources consumed (but it’s not cheap)
• Serverless; scales automatically
• Use IAM permissions to access streaming source and
destination(s)
• Schema discovery

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
RANDOM_CUT_FOREST
• SQL function used for anomaly
detection on numeric columns in a
stream
• They’re especially proud of this
because they published a paper on it
• It’s a novel way to identify outliers in a
data set so you can handle them
however you need to
• Example: detect anomalous subway
ridership during the NYC marathon

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Elasticsearch
Service
Petabyte-scale analysis and reporting

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is Elasticsearch?
• The Elastic Stack
• A search engine
• An analysis tool
• A visualization tool (Kibana)
• A data pipeline (Beats /
LogStash)
• You can use Kinesis too
• Horizontally scalable

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is Kibana?

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Elasticsearch applications
• Full-text search
• Log analytics
• Application monitoring
• Security analytics
• Clickstream analytics

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Elasticsearch concepts

documents types indices


Documents are the things you’re A type defines the schema and An index powers search into all
searching for. They can be more mapping shared by documents documents within a collection of
than text – any structured JSON that represent the same sort of types. They contain inverted
data works. Every document has a thing. (A log entry, an indices that let you search across
unique ID, and a type. encyclopedia article, etc.) everything within them at once.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
An index is split into shards
Documents are hashed to a particular shard.

1 2 3 …

Shakespeare

Each shard may be on a different node in a cluster.


Every shard is a self-contained Lucene index of its own.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redundancy
This index has two primary shards and two replicas.
Your application should round-robin requests amongst nodes.

Node 1 Node 2 Node 3

Primary Primary
Replica 0 Replica 0 Replica 1 Replica 1
1 0

Write requests are routed to the primary shard, then replicated


Read requests are routed to the primary or any replica

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Elasticsearch Service
• Fully-managed (but not serverless)
• Scale up or down without downtime
• But this isn’t automatic
• Pay for what you use
• Instance-hours, storage, data transfer
• Network isolation
• AWS integration
• S3 buckets (via Lambda to Kinesis)
• Kinesis Data Streams
• DynamoDB Streams
• CloudWatch / CloudTrail
• Zone awareness

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon ES options
• Dedicated master node(s)
• Choice of count and instance types
• “Domains”
• Snapshots to S3
• Zone Awareness

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon ES Security
• Resource-based policies
• Identity-based policies
• IP-based policies
• Request signing
• VPC
• Cognito

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Securing Kibana
• Cognito AWS Cloud

• Getting inside a VPC from VPC

outside is hard… Subnet

• Nginx reverse proxy on EC2


forwarding to ES domain Reverse
proxy

• SSH tunnel for port 5601


• VPC Direct Connect
• VPN On-Premise

Client Active
Directory

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon ES anti-patterns
• OLTP
• No transactions
• RDS or DynamoDB is better
• Ad-hoc data querying
• Athena is better
• Remember Amazon ES is primarily
for search & analytics

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Athena
Serverless interactive queries of S3 data

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is Athena?
• Interactive query service for S3 (SQL)
• No need to load data, it stays in S3
• Presto under the hood
• Serverless!
• Supports many data formats
• CSV (human readable)
• JSON (human readable)
• ORC (columnar, splittable)
• Parquet (columnar, splittable)
• Avro (splittable)
• Unstructured, semi-structured, or structured

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Some examples
• Ad-hoc queries of web logs
• Querying staging data before
loading to Redshift
• Analyze CloudTrail / CloudFront /
VPC / ELB etc logs in S3
• Integration with Jupyter, Zeppelin,
RStudio notebooks
• Integration with QuickSight
• Integration via ODBC / JDBC with
other visualization tools

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Athena + Glue

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Athena cost model
• Pay-as-you-go
• $5 per TB scanned
• Successful or cancelled queries
count, failed queries do not.
• No charge for DDL
(CREATE/ALTER/DROP etc.)
• Save LOTS of money by using
columnar formats
• ORC, Parquet
• Save 30-90%, and get better
performance
• Glue and S3 have their own
charges

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Athena Security
• Access control
• IAM, ACLs, S3 bucket policies
• AmazonAthenaFullAccess /
AWSQuicksightAthenaAccess
• Encrypt results at rest in S3 staging directory
• Server-side encryption with S3-managed key
(SSE-S3)
• Server-side encryption with KMS key (SSE-KMS)
• Client-side encryption with KMS key (CSE-KMS)
• Cross-account access in S3 bucket policy
possible
• Transport Layer Security (TLS) encrypts in-
transit (between Athena and S3)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Athena anti-patterns
• Highly formatted reports /
visualization
• That’s what QuickSight is for
• ETL
• Use Glue instead

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Redshift
Fully-managed, petabyte-scale data warehouse

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is Redshift?
• Fully-managed, petabyte scale data
warehouse service
• 10X better performance than other
DW’s
• Via machine learning, massively parallel
query execution, columnar storage
• Designed for OLAP, not OLTP
• Cost effective
• SQL, ODBC, JDBC interfaces
• Scale up or down on demand
• Built-in replication & backups
• Monitoring via CloudWatch / CloudTrail

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift Use-Cases
• Accelerate analytics workloads
• Unified data warehouse & data lake
• Data warehouse modernization
• Analyze global sales data
• Store historical stock trade data
• Analyze ad impressions & clicks
• Aggregate gaming data
• Analyze social trends

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift architecture

Client Client Client

JDBC / ODBC

Leader Node

Compute Node Compute Node



Node Slices Node Slices

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift Spectrum
• Query exabytes of unstructured data in
S3 without loading
• Limitless concurrency
• Horizontal scaling
• Separate storage & compute
resources
• Wide variety of data formats
• Support of Gzip and Snappy
compression

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift Performance
• Massively Parallel Processing (MPP)
• Columnar Data Storage
• Column Compression

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift Durability
• Replication within cluster
• Backup to S3
• Asynchronously replicated to
another region
• Automated snapshots
• Failed drives / nodes
automatically replaced
• However – limited to a single
availability zone (AZ)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Scaling Redshift
• Vertical and horizontal scaling
on demand
• During scaling:
• A new cluster is created while your
old one remains available for
reads
• CNAME is flipped to new cluster
(a few minutes of downtime)
• Data moved in parallel to new
compute nodes

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift Distribution Styles
• AUTO
• Redshift figures it out based on size of data
• EVEN
• Rows distributed across slices in round-robin
• KEY
• Rows distributed based on one column
• ALL
• Entire table is copied to every node

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EVEN distribution

Compute Node Compute Node

Node Slices Node Slices

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
KEY distribution

Key 1 Key 2 Key 3 Key 4 Key 5 Key 6

Compute Node Compute Node

Node Slices Node Slices

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
ALL distribution

Compute Node Compute Node

Node Slices Node Slices

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift Sort Keys
• Rows are stored on disk in sorted
order based on the column you
designate as a sort key
• Like an index
• Makes for fast range queries
• Choosing a sort key
• Recency? Filtering? Joins?
• Single vs. Compound vs Interleaved
sort keys

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Sort Keys: Single Column

Date Genre Movie


3/18/2019 Comedy Monty Python and the Holy Grail
3/18/2019 Adventure Indiana Jones and the Temple of Doom
3/18/2019 Drama Interstellar
3/18/2019 Drama The Dark Knight
3/19/2019 Fantasy The Lord of the Rings
3/19/2019 Drama 12 Angry Men
3/19/2019 Adventure Inception

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Sort Keys: Compound

Date Genre Movie


3/18/2019 Adventure Indiana Jones and the Temple of Doom
3/18/2019 Comedy Monty Python and the Holy Grail
3/18/2019 Drama Interstellar
3/18/2019 Drama The Dark Knight
3/19/2019 Adventure Inception
3/19/2019 Drama 12 Angry Men
3/19/2019 Fantasy The Lord of the Rings

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Sort Keys: Interleaved
Date Genre Movie Date Genre Movie
3/18/2019 Adventure Indiana Jones and the Temple of Doom 3/19/2019 Drama 12 Angry Men
3/18/2019 Comedy Monty Python and the Holy Grail 3/19/2019 Adventure Inception
3/18/2019 Drama Interstellar 3/18/2019 Adventure Indiana Jones and the Temple of Doom
3/18/2019 Drama The Dark Knight 3/18/2019 Drama Interstellar
3/19/2019 Adventure Inception 3/18/2019 Comedy Monty Python and the Holy Grail
3/19/2019 Drama 12 Angry Men 3/18/2019 Drama The Dark Knight
3/19/2019 Fantasy The Lord of the Rings 3/19/2019 Fantasy The Lord of the Rings

Date Genre Movie


3/18/2019 Adventure Indiana Jones and the Temple of Doom
3/19/2019 Adventure Inception
3/18/2019 Comedy Monty Python and the Holy Grail
3/18/2019 Drama Interstellar
3/18/2019 Drama The Dark Knight
3/19/2019 Drama 12 Angry Men
3/19/2019 Fantasy The Lord of the Rings

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Importing / Exporting data
• COPY command
• Parallelized; efficient
• From S3, EMR, DynamoDB, remote
hosts
• S3 requires a manifest file and IAM role
• UNLOAD command
• Unload from a table into files in S3
• Enhanced VPC routing

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
COPY command: More depth
• Use COPY to load large amounts of data from outside of
Redshift
• If your data is already in Redshift in another table,
• Use INSERT INTO …SELECT
• Or CREATE TABLE AS
• COPY can decrypt data as it is loaded from S3
• Hardware-accelerated SSL used to keep it fast
• Gzip, lzop, and bzip2 compression supported to speed it
up further
• Automatic compression option
• Analyzes data being loaded and figures out optimal compression
scheme for storing it
• Special case: narrow tables (lots of rows, few columns)
• Load with a single COPY transaction if possible
• Otherwise hidden metadata columns consume too much space

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift copy grants for cross-region
snapshot copies
• Let’s say you have a KMS-encrypted Redshift cluster and a
snapshot of it
• You want to copy that snapshot to another region for backup
• In the destination AWS region:
• Create a KMS key if you don’t have one already
• Specify a unique name for your snapshot copy grant
• Specify the KMS key ID for which you’re creating the copy grant
• In the source AWS region:
• Enable copying of snapshots to the copy grant you just created

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DBLINK
• Connect Redshift to
PostgreSQL (possibly in
RDS)
• Good way to copy and sync PostgreSQL instance

data between PostgreSQL


and Redshift

CREATE EXTENSION postgres_fdw;


CREATE EXTENSION dblink;
CREATE SERVER foreign_server
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host '<amazon_redshift _ip>', port '<port>', dbname '<database_name>', sslmode
'require');
CREATE USER MAPPING FOR <rds_postgresql_username>
SERVER foreign_server
OPTIONS (user '<amazon_redshift_username>', password '<password>');

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Integration with other services
• S3
• DynamoDB
• EMR / EC2
• Data Pipeline
• Database Migration Service

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift Workload Management (WLM)
• Prioritize short, fast queries vs. long, slow queries
• Query queues
• Via console, CLI, or API

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Concurrency Scaling
• Automatically adds cluster capacity to
handle increase in concurrent read
queries
• Support virtually unlimited concurrent
users & queries
• WLM queues manage which queries
are sent to the concurrency scaling
cluster

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Automatic Workload Management
• Creates up to 8 queues
• Default 5 queues with even memory
allocation
• Large queries (ie big hash joins) ->
concurrency lowered
• Small queries (ie inserts, scans,
aggregations) -> concurrency raised
• Configuring query queues
• Priority
• Concurrency scaling mode
• User groups
• Query groups
• Query monitoring rules

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Manual Workload Management
• One default queue with concurrency level of 5 (5 queries at
once)
• Superuser queue with concurrency level 1
• Define up to 8 queues, up to concurrency level 50
• Each can have defined concurrency scaling mode, concurrency level,
user groups, query groups, memory, timeout, query monitoring rules
• Can also enable query queue hopping
• Timed out queries “hop” to next queue to try again

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Short Query Acceleration (SQA)
• Prioritize short-running queries over longer-running ones
• Short queries run in a dedicated space, won’t wait in queue
behind long queries
• Can be used in place of WLM queues for short queries
• Works with:
• CREATE TABLE AS (CTAS)
• Read-only queries (SELECT statements)
• Uses machine learning to predict a query’s execution time
• Can configure how many seconds is “short”

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Resizing Redshift Clusters
• Elastic resize
• Quickly add or remove nodes of same type
• (It *can* change node types, but not without dropping
connections – it creates a whole new cluster)
• Cluster is down for a few minutes
• Tries to keep connections open across the
downtime
• Limited to doubling or halving for some dc2 and
ra3 node types.
• Classic resize
• Change node type and/or number of nodes
• Cluster is read-only for hours to days
• Snapshot, restore, resize
• Used to keep cluster available during a classic
resize
• Copy cluster, resize new cluster

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
VACUUM command
• Recovers space from deleted
rows
• VACUUM FULL
• VACUUM DELETE ONLY
• VACUUM SORT ONLY
• VACUUM REINDEX

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
New Redshift features for 2020
• RA3 nodes with managed storage
• Enable independent scaling of compute and storage
• Redshift data lake export
• Unload Redshift query to S3 in Apache Parquet format
• Parquet is 2x faster to unload and consumes up to 6X less storage
• Compatible with Redshift Spectrum, Athena, EMR, SageMaker
• Automatically partitioned

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift anti-patterns
• Small data sets
• Use RDS instead
• OLTP
• Use RDS or DynamoDB instead
• Unstructured data
• ETL first with EMR etc.
• BLOB data
• Store references to large binary
files in S3, not the files themselves.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon RDS
Relational Database Service

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is RDS?
• Hosted relational database
• Amazon Aurora
• MySQL
• PostgreSQL
• MariaDB
• Oracle
• SQL Server
• Not for “big data”
• Might appear on exam as an example of what not to use
• Or in the context of migrating from RDS to Redshift etc.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
ACID
• RDS databases offer full
ACID compliance
• Atomicity
• Consistency
• Isolation
• Durability

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Aurora
• MySQL and PostgreSQL – compatible
• Up to 5X faster than MySQL, 3X faster than PostgreSQL
• 1/10 the cost of commercial databases
• Up to 64TB per database instance
• Up to 15 read replicas
• Continuous backup to S3
• Replication across availability zones
• Automatic scaling with Aurora Serverless

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Aurora Security
• VPC network isolation
• At-rest with KMS
• Data, backup, snapshots, and replicas
can be encrypted
• In-transit with SSL

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon QuickSight
Business analytics and visualizations in the cloud

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What is QuickSight?
• Fast, easy, cloud-powered business
analytics service
• Allows all employees in an organization
to:
• Build visualizations
• Perform ad-hoc analysis
• Quickly get business insights from data
• Anytime, on any device (browsers, mobile)
• Serverless

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
QuickSight Data Sources
• Redshift
• Aurora / RDS
• Athena
• EC2-hosted databases
• Files (S3 or on-premises)
• Excel
• CSV, TSV
• Common or extended log format
• Data preparation allows limited ETL

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SPICE
• Data sets are imported into SPICE
• Super-fast, Parallel, In-memory
Calculation Engine
• Uses columnar storage, in-memory,
machine code generation
• Accelerates interactive queries on large
datasets
• Each user gets 10GB of SPICE
• Highly available / durable
• Scales to hundreds of thousands of
users

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
QuickSight Use Cases
• Interactive ad-hoc exploration / visualization of data
• Dashboards and KPI’s
• Stories
• Guided tours through specific views of an analysis
• Convey key points, thought process, evolution of an analysis
• Analyze / visualize data from:
• Logs in S3
• On-premise databases
• AWS (RDS, Redshift, Athena, S3)
• SaaS applications, such as Salesforce
• Any JDBC/ODBC data source

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
QuickSight Anti-Patterns
• Highly formatted canned reports
• QuickSight is for ad-hoc queries,
analysis, and visualization
• ETL
• Use Glue instead, although
QuickSight can do some
transformations

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
QuickSight Security
• Multi-factor authentication on your
account
• VPC connectivity
• Add QuickSight’s IP address range to
your database security groups
• Row-level security
• Private VPC access
• Elastic Network Interface, AWS Direct
Connect

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
QuickSight User Management
• Users defined via IAM, or
email signup
• Active Directory integration
with QuickSight Enterprise
Edition

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
QuickSight Pricing
• Annual subscription
• Standard: $9 / user /month
• Enterprise: $18 / user / month
• Extra SPICE capacity (beyond 10GB)
• $0.25 (standard) $0.38 (enterprise) / GB / user / month
• Month to month
• Standard: $12 / user / month
• Enterprise: $24 / user / month
• Enterprise edition
• Encryption at rest
• Microsoft Active Directory integration

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
QuickSight Dashboards

Image: AWS Big Data Blog

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Quicksight Machine Learning Insights
• ML-powered anomaly detection
• Uses Random Cut Forest
• Identify top contributors to significant changes in
metrics
• ML-powered forecasting
• Also uses Random Cut Forest
• Detects seasonality and trends
• Excludes outliers and imputes missing values
• Autonarratives
• Adds “story of your data” to your dashboards
• Suggested Insights
• “Insights” tab displays read-to-use suggested
insights

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
QuickSight Visual Types
• AutoGraph
• Bar Charts
• For comparison and
distribution (histograms)
• Line graphs
• For changes over time
• Scatter plots, heat maps
• For correlation
• Pie graphs, tree maps
• For aggregation
• Pivot tables
• For tabular data
• Stories

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Additional Visual Types
• KPIs
• Geospatial Charts (maps)
• Donut Charts
• Gauge Charts
• Word Clouds

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Bar Charts: Comparison, Distribution

Image: DanielPenfield, Wikipedia CC BY-SA 3.0

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Line Charts: Changes over Time

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Scatter Plots: Correlation

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Heat Maps: Correlation

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Pie Charts: Aggregation

Image: M.W. Toews, Wikipedia, CC BY-SA 4.0

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Donut Charts: Percentage of Total
Amount

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Gauge Charts: Compare values in a
measure

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Tree Maps: Heirarchical Aggregation

Image: Harvard-MIT Observatory of Economic Complexity, Wikipedia, CC-BY-SA 3.0

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Pivot Tables: Tabular Data

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
KPI’s: compare key value to its target
value

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Geospatial Charts

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Word Clouds: word or phrase frequency

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Alternative Visualization Tools
• Web-based
visualizations tools
(deployed to the
public)
• D3.js
• Chart.js
• Highchart.js
• Business Intelligence
Tools
• Tableau
• MicroStrategy

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Why encryption?
Encryption in flight (SSL)
• Data is encrypted before sending and decrypted after receiving
• SSL certificates help with encryption (HTTPS)
• Encryption in flight ensures no MITM (man in the middle attack) can happen

aGVsbG8gd29 HTTPS Website


HTTPS You ybGQgZWh… (AWS)

U: admin aGVsbG8gd29
P: supersecret ybGQgZWh…

aGVsbG8gd29 SSL Encryption SSL Decryption U: admin


ybGQgZWh… P: supersecret

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Why encryption?
Server side encryption at rest
• Data is encrypted after being received by the server
• Data is decrypted before being sent
• It is stored in an encrypted form thanks to a key (usually a data key)
• The encryption / decryption keys mustObject
be managed somewhere and
the server must have access to it
AWS Service (ex: EBS)
Object
HTTP/S HTTP/S

+ encryption decryption

Data key
Data key
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Why encryption?
Client side encryption
• Data is encrypted by the client and never decrypted by the server
• Data will be decrypted by a receiving client
• The server should not be able to decrypt the data
• Could leverage Envelope Encryption
Object Client Encryption Any store (FTP, Object Client Decryption
S3, etc..)

+ encryption
+ decryption

Client side data key Client side data key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3 Encryption for Objects (Reminder)

• There are 4 methods of encrypting objects in S3


• SSE-S3: encrypts S3 objects using keys handled & managed by AWS
• SSE-KMS: leverage AWS Key Management Service to manage
encryption keys
• SSE-C: when you want to manage your own encryption keys
• Client Side Encryption

• It’s important to understand which ones are adapted to which


situation for the exam

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SSE-S3
• SSE-S3: encryption using keys handled & managed by AWS S3
• Object is encrypted server side
• AES-256 encryption type
• Must set header: “x-amz-server-side-encryption": "AES256"

Object AWS S3

Object
HTTP/S + Header

+ encryption

Bucket

S3 Managed Data Key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SSE-KMS
• SSE-KMS: encryption using keys handled & managed by KMS
• KMS Advantages: user control + audit trail
• Object is encrypted server side
• Must set header: “x-amz-server-side-encryption": ”aws:kms"

Object AWS S3

Object
HTTP/S + Header

+ encryption

Bucket

KMS Customer Master Key


(CMK)
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SSE-C
• SSE-C: server-side encryption using data keys fully managed by the customer outside of AWS
• Amazon S3 does not store the encryption key you provide
• HTTPS must be used
• Encryption key must provided in HTTP headers, for every HTTP request made

Object
Object AWS S3

HTTPS only +
Data Key in Header

+ + encryption

Bucket

Client side data key Client-provided data key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Client Side Encryption
• Client library such as the Amazon S3 Encryption Client
• Clients must encrypt data themselves before sending to S3
• Clients must decrypt data themselves when retrieving from S3
• Customer fully manages the keys and encryption cycle
Client - S3 Encryption SDK AWS S3
Object

HTTP/S

+ encryption

Bucket

Client side data key

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Encryption in transit (SSL)
• AWS S3 exposes:
• HTTP endpoint: non encrypted
• HTTPS endpoint: encryption in flight

• You’re free to use the endpoint you want, but HTTPS is


recommended
• HTTPS is mandatory for SSE-C
• Encryption in flight is also called SSL / TLS

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS KMS (Key Management Service)
• Anytime you hear “encryption” for an AWS service, it’s most likely
KMS
• Easy way to control access to your data, AWS manages keys for us
• Fully integrated with IAM for authorization
• Seamlessly integrated into:
• Amazon EBS: encrypt volumes
• Amazon S3: Server side encryption of objects
• Amazon Redshift: encryption of data
• Amazon RDS: encryption of data
• Amazon SSM: Parameter store
• Etc…
• But you can also use the CLI / SDK

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS KMS 101
• Anytime you need to share sensitive information… use KMS
• Database passwords
• Credentials to external service
• Private Key of SSL certificates
• The value in KMS is that the CMK used to encrypt data can never be
retrieved by the user, and the CMK can be rotated for extra security
• Never ever store your secrets in plaintext, especially in your code!
• Encrypted secrets can be stored in the code / environment variables
• KMS can only help in encrypting up to 4KB of data per call
• If data > 4 KB, use envelope encryption
• To give access to KMS to someone:
• Make sure the Key Policy allows the user
• Make sure the IAM Policy allows the API calls

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS KMS (Key Management Service)
• Able to fully manage the keys & policies:
• Create
• Rotation policies
• Disable
• Enable
• Able to audit key usage (using CloudTrail)
• Three types of Customer Master Keys (CMK):
• AWS Managed Service Default CMK: free
• User Keys created in KMS: $1 / month
• User Keys imported (must be 256-bit symmetric key): $1 / month
• + pay for API call to KMS ($0.03 / 10000 calls)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
How does KMS work?
API – Encrypt and Decrypt
KMS
Client (CLI / SDK) Encrypt API Check IAM permissions

IAM
Secret (ex: password) CMK
< 4 KB Perform encryption
Send encrypted secret
Encrypted Secret

CMK Check IAM permissions


Client (CLI / SDK) Decrypt API
Perform decryption
IAM
Send decrypted secret
Secret in plaintext

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Encryption in AWS Services
• Requires migration (through Snapshot / Backup):
• EBS Volumes
• RDS databases
• ElastiCache
• EFS network file system

• In-place encryption:
• S3

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
CloudHSM
• KMS => AWS manages the software for encryption
• CloudHSM => AWS provisions encryption hardware
• Dedicated Hardware (HSM = Hardware Security Module)
• You manage your own encryption keys entirely (not AWS)
• HSM device is tamper resistant, FIPS 140-2 Level 3 compliance
• CloudHSM clusters are spread across Multi AZ (HA) – must setup
• Supports both symmetric and asymmetric encryption (SSL/TLS keys)
• No free tier available
• Must use the CloudHSM Client Software
• Redshift supports CloudHSM for database encryption and key management
• Good option to use with SSE-C encryption

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
CloudHSM Diagram

AWS manages the Hardware

SSL Connection
User manages the Keys

AWS CloudHSM
CloudHSM Client

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
CloudHSM vs KMS
Feature AWS KMS AWS CloudHSM
Tenancy Uses multi-tenant key storage Single tenant key storage,
dedicated to one customer
Keys Keys owned and managed by Customer managed HSM
AWS
Encryption Supports only symmetric key Supports both symmetric and
encryption asymmetric encryption
Cryptographic Acceleration None SSL/TLS Acceleration
Oracle TDE Acceleration
Key Storage and Management Accessible from multiple Deployed and managed from a
regions customer VPC.
Centralized management from Accessible and can be shared
IAM across VPCs using VPC peering
Free Tier Availability Yes No

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - Kinesis
• Kinesis Data Streams
• SSL endpoints using the HTTPS protocol to do encryption in flight
• AWS KMS provides server-side encryption [Encryption at rest]
• For client side-encryption, you must use your own encryption libraries
• Supported Interface VPC Endpoints / Private Link – access privately
• KCL – must get read / write access to DynamoDB table
• Kinesis Data Firehose:
• Attach IAM roles so it can deliver to S3 / ES / Redshift / Splunk
• Can encrypt the delivery stream with KMS [Server side encryption]
• Supported Interface VPC Endpoints / Private Link – access privately
• Kinesis Data Analytics
• Attach IAM role so it can read from Kinesis Data Streams and reference
sources and write to an output destination (example Kinesis Data Firehose)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - SQS
• Encryption in flight using the HTTPS endpoint
• Server Side Encryption using KMS
• IAM policy must allow usage of SQS
• SQS queue access policy

• Client-side encryption must be implemented manually


• VPC Endpoint is provided through an Interface

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security – AWS IoT
• AWS IoT policies:
• Attached to X.509 certificates or Cognito Identities
• Able to revoke any device at any time
• IoT Policies are JSON documents
• Can be attached to groups instead of individual Things.

• IAM Policies:
• Attached to users, group or roles
• Used for controlling IoT AWS APIs

• Attach roles to Rules Engine so they can perform their actions

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security – Amazon S3
• IAM policies
• S3 bucket policies
• Access Control Lists (ACLs)
• Encryption in flight using HTTPS
• Encryption at rest
• Server-side encryption: SSE-S3, SSE-KMS, SSE-C
• Client-side encryption – such as Amazon S3 Encryption Client
• Versioning + MFA Delete
• CORS for protecting websites
• VPC Endpoint is provided through a Gateway
• Glacier – vault lock policies to prevent deletes (WORM)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security – DynamoDB
• Data is encrypted in transit using TLS (HTTPS)
• DynamoDB can be encrypted at rest
• KMS encryption for base tables and secondary indexes
• Only for new tables
• To migrate un-encrypted table, create new table and copy the data
• Encryption cannot be disabled once enabled
• Access to tables / API / DAX using IAM
• DynamoDB Streams do not support encryption
• VPC Endpoint is provided through a Gateway

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - RDS
• VPC provides network isolation
• Security Groups control network access to DB Instances
• KMS provides encryption at rest
• SSL provides encryption in-flight
• IAM policies provide protection for the RDS API
• IAM authentication is supported by PostgreSQL and MySQL
• Must manage user permissions within the database itself
• MSSQL Server and Oracle support TDE (Transparent Data
Encryption)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - Aurora
• (very similar to RDS)
• VPC provides network isolation
• Security Groups control network access to DB Instances
• KMS provides encryption at rest
• SSL provides encryption in-flight
• IAM authentication is supported by PostgreSQL and MySQL
• Must manage user permissions within the database itself

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - Lambda
• IAM roles attached to each Lambda function
• Sources
• Targets
• KMS encryption for secrets
• SSM parameter store for configurations
• CloudWatch Logs
• Deploy in VPC to access private resources

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - Glue
• IAM policies for the Glue service
• Configure Glue to only access JDBC through SSL
• Data Catalog:
• Encrypted by KMS
• Resource Policies to protect Data Catalog resources (similar to S3 bucket
policy)
• Connection passwords: Encrypted by KMS
• Data written by AWS Glue – Security Configurations:
• S3 encryption mode: SSE-S3 or SSE-KMS
• CloudWatch encryption mode
• Job bookmark encryption mode

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - EMR
• Using Amazon EC2 key pair for SSH credentials
• Attach IAM roles to EC2 instances for:
• proper S3 access
• for EMRFS requests to S3
• DynamoDB scans through Hive
• EC2 Security Groups
• One for master node
• Another one for cluster node (core node or task node)
• Encrypts data at-rest: EBS encryption, Open Source HDFS Encryption, LUKS + EMRFS for S3
• In-transit encryption: node to node communication, EMRFS, TLS
• Data is encrypted before uploading to S3
• Kerberos authentication (provide authentication from Active Directory)
• Apache Ranger: Centralized Authorization (RBAC – Role Based Access) – setup on external EC2
• https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security – EMR Encryption (security
config)
• At-rest data encryption for EMRFS:
• Encryption in Amazon S3
(SSE-S3, SSE-KMS, Client-Side encryption)
• Encryption in Local Disks
• At-rest data encryption for local disks:
• Open-source HDFS encryption
• EC2 Instance Store encryption:
NVMe encryption, or LUKS encryption
• EBS volumes:
EBS encryption (KMS) – works with root volume
LUKS encryption – does not work with root
• In-transit encryption:
• Node to node communication
• For EMRFS traffic between S3 and cluster nodes
• TLS encryption

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security – ElasticSearch Service
• Amazon VPC provides network isolation
• ElasticSearch policy to manage security further
• Data security by encrypting data at-rest using KMS
• Encryption in-transit using SSL

• IAM or Cognito based authentication


• Amazon Cognito allow end-users to log-in to Kibana through
enterprise identity providers such as Microsoft Active Directory
using SAML

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - Redshift
• VPC provides network isolation
• Cluster security groups
• Encryption in flight using the JDBC driver enabled with SSL
• Encryption at rest using KMS or an HSM device (establish a
connection)
• Supports S3 SSE using default managed key
• Use IAM Roles for Redshift
• To access other AWS Resources (example S3 or KMS)
• Must be referenced in the COPY or UNLOAD command
(alternatively paste access key and secret key creds)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - Athena
• IAM policies to control access to the service
• Data is in S3: IAM policies, bucket policies & ACLs
• Encryption of data according to S3 standards: SSE-S3, SSE-
KMS, CSE-KMS
• Encryption in transit using TLS between Athena and S3 and
JDBC

• Fine grained access using the AWS Glue Catalog

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Security - Quicksight
• Standard edition:
• IAM users
• Email based accounts
• Enterprise edition:
• Active Directory
• Federated Login
• Supports MFA (Multi Factor Authentication)
• Encryption at rest and in SPICE
• Row Level Security to control which users can see which rows

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS STS – Security Token Service
• Allows to grant limited and temporary access to AWS resources.
• Token is valid for up to one hour (must be refreshed)
• Cross Account Access
• Allows users from one AWS account access resources in another
• Federation (Active Directory)
• Provides a non-AWS user with temporary AWS access by linking users Active
Directory credentials
• Uses SAML (Security Assertion markup language)
• Allows Single Sign On (SSO) which enables users to log in to AWS console without
assigning IAM credentials
• Federation with third party providers / Cognito
• Used mainly in web and mobile applications
• Makes use of Facebook/Google/Amazon etc to federate them

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Cross Account Access
• Define an IAM Role for another AssumeRole API
account to access
• Define which accounts can
access this IAM Role AWS STS

• Use AWS STS (Security Token user


temporary
Service) to retrieve credentials security
and impersonate the IAM Role credential
you have access to permissions
(AssumeRole API)
• Temporary credentials can be
valid between 15 minutes to 1
hour Role (same or
other account) IAM

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
What’s Identity
Federation? user

login
3rd party

• Federation lets users outside of AWS to assume


temporary role for accessing AWS resources.
• These users assume identity provided access role. Gives
credentials
• Federation assumes a form of 3rd party
authentication
• LDAP Access AWS Trust
• Microsoft Active Directory (~= SAML)
• Single Sign On
• Open ID
• Cognito
• Using federation, you don’t need to create IAM
users (user management is outside of AWS)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SAML Federation
For Enterprises
• To integrate Active Directory / ADFS with AWS (or any SAML 2.0)
• Provides access to AWS Console or CLI (through temporary
creds)
• No need to create an IAM user for each of your employees

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_saml.html https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_enable-console-saml.html

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Custom Identity Broker Application
For Enterprises
• Use only if identity provider is not compatible with SAML 2.0
• The identity broker must determine the appropriate IAM
policy

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_
common-scenarios_federated-users.html
sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Cognito - Federated Identity Pools
For Public Applications Identity Provider
Twitter
• Goal: SAML
• Provide direct access to AWS Resources OpenID
from the Client Side CUP Google Facebook ….
• How:
• Log in to federated identity provider – or
remain anonymous login token verify token
• Get temporary AWS credentials back from
the Federated Identity Pool Authenticate to FIP
• These credentials come with a pre-defined
IAM policy stating their permissions
• Example: App temp AWS credentials Federated
• provide (temporary) access to write to S3
bucket using Facebook Login call Identity
• Note: Get credentials
• Web Identity Federation is an alternative to
using Cognito but AWS recommends
against it Amazon S3
bucket

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Policies – leveraging AWS variables
• https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_polici
es_variables.html
• ${aws:username} : to restrict users to tables / buckets
• ${aws:principaltype} : account, user, federated, or assumed role
• ${aws:PrincipalTag/department} : to restrict using Tags

• https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_polici
es_iam-condition-keys.html#condition-keys-wif
• ${aws:FederatedProvider} : which IdP was used for the user (Cognito,
Amazon..)
• ${www.amazon.com:user_id} , ${cognito-identity.amazonaws.com:sub} …
• ${saml:sub}, ${sts:ExternalId}

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Policies - Advanced
• For S3 - let’s analyze the policies at:
https://docs.aws.amazon.com/AmazonS3/latest/dev/example-
bucket-policies.html

• For DynamoDB – let’s analyze the policies at:


https://docs.aws.amazon.com/amazondynamodb/latest/developergui
de/specifying-conditions.html

• Note for RDS – IAM policies don’t help with in-database security, as
it’s a proprietary technology and we are responsible for users &
authorization

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS CloudTrail
• Provides governance, compliance and audit for your AWS Account
• CloudTrail is enabled by default!
• Get an history of events / API calls made within your AWS Account
by:
• Console
• SDK
• CLI
• AWS Services
• Can put logs from CloudTrail into CloudWatch Logs
• If a resource is deleted in AWS, look into CloudTrail first!

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
CloudTrail continued…
• CloudTrail shows the past 90 days of activity
• The default UI only shows “Create”, “Modify” or “Delete” events
• CloudTrail Trail:
• Get a detailed list of all the events you choose
• Ability to store these events in S3 for further analysis
• Can be region specific or global
• CloudTrail Logs have SSE-S3 encryption when placed into S3
• Control access to S3 using IAM, Bucket Policy, etc…

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
VPC
Endpoints VPC
• Endpoints allow you to connect to AWS Private subnet
Services using a private network instead
of the public www network EC2 Instance
• They scale horizontally and are redundant
• They remove the need of IGW, NAT, etc… VPC Endpoint
to access AWS Services PrivateLink

• Gateway: provisions a target and must be


used in a route table
ONLY S3 and DynamoDB
Private network
• Interface: provisions an ENI (private IP
address) as an entry point (must attach
security group) – most AWS services
Also called VPC PrivateLink
Amazon Simple Queue
Service

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Everything Else

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
IoT
IoT Topic

IoT Rules

IoT Rules Actions

Kinesis DynamoDB SQS


+ many others

SNS S3 AWS Lambda

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Streams
Firehose

AWS Lambda

SDK

Kinesis Producer Library (KPL)


Kinesis Consumer Library
Amazon Kinesis
(KCL)
Amazon Kinesis
Streams Streams

Kinesis Agent

SDK

Kinesis Connector Library


sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Firehose
SDK Amazon S3
Kinesis Producer Library (KPL)
Lambda function

Kinesis Agent Redshift

Kinesis Data Streams


ElasticSearch
CloudWatch Logs & Events
Amazon Kinesis
Data Firehose

IoT rules actions

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Kinesis Data Analytics
Kinesis Data Analytics

Kinesis Data Streams Kinesis Data Streams

Kinesis Data Firehose Kinesis Data Firehose

Reference Data (JSON, CSV) in S3 AWS Lambda Function

AWS Lambda
Record pre-processing

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
SQS

SQS
SDK-based application
EC2, ECS, etc…
SDK-based application
EC2, ECS, etc…
IoT Core (Rules Engine)
AWS Lambda Function
S3 events (new files)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
S3
Snowball Data Pipeline

Snowball Edge IoT Core AWS Lambda

Firehose AWS DMS SQS Queue

Redshift EMR S3 bucket SNS Topic

Athena Glue

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
DynamoDB
AWS Data Pipeline
AWS Lambda Function

Kinesis Client Library


DynamoDB Streams
Client SDK

DynamoDB

EMR (Hive)
Database Migration Service
(DMS)
Glue

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Glue
Redshift Spectrum
Glue Data Catalog
Sources

Athena
DynamoDB

EMR + Hive
Glue Crawlers
Amazon S3

JDBC (ex: RDS) Glue

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EMR
Glue Data Catalog

Amazon S3 / EMRFS

DynamoDB
EMR

Apache Ranger on EC2

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon Machine Learning (ML)
(deprecated)

Amazon S3
Predictions API

Redshift Amazon ML

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Amazon SageMaker

Notebook

Amazon S3 Deploy model

Amazon SageMaker

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Data Pipeline

Amazon S3

JDBC (ex: RDS)

EMR / Hive
Data Pipeline

DynamoDB

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
ElasticSearch Service

Kinesis Data Firehose

IoT Core

AWS ElasticSearch Service


CloudWatch Logs

IAM Cognito

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Athena
Quicksight

Amazon S3

Amazon S3

Glue Data Catalog Athena

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Redshift Quicksight

COPY / LOAD / UNLOAD


Amazon S3
Redshift Spectrum
Redshift

DBLINK

PostgreSQL

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Quicksight

RDS / JDBC

Redshift
Quicksight

Athena

Amazon S3

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
AWS Instance Types
• General Purpose: T2, T3, M4, M5
• Compute Optimized: C4, C5
• Batch processing, Distributed analytics, Machine / Deep Learning Inference
• Memory Optimized: R4, R5, X1, Z1d
• High performance database, In memory database, Real time big data
analytics
• Accelerated Computing: P2, P3, G3, F1
• GPU instances, Machine or Deep Learning, High Performance Computing
• Storage Optimized: H1, I3, D2
• Distributed File System (HDFS), NFS, Map Reduce, Apache Kafka, Redshift

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
EC2 in Big Data
• On demand, Spot & Reserved instances:
• Spot: can tolerate loss, low cost => checkpointing feature (ML, etc)
• Reserved: long running clusters, databases (over a year)
• On demand: remaining workloads
• Auto Scaling:
• Leverage for EMR, etc
• Automated for DynamoDB, Auto Scaling Groups, etc…
• EC2 is behind EMR
• Master Nodes
• Compute Nodes (contain data) + Tasks Nodes (do not contain data)

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Preparing for the exam

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Exam Tips
The strategic aspect…

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Take your time reading the question
• Look for key words about requirements

A large news website needs to produce personalized


recommendations for articles to its readers, by training a
machine learning model on a daily basis using historical
click data. The influx of this data is fairly constant, except
during major elections when traffic to the site spikes
considerably.

Which system would provide the most cost-effective and


reliable solution?

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Pace yourself
• You have 170 minutes and about 65
questions
• That’s about 2 ½ minutes per question!
• Try not to get stressed out… that’s enough
time to read and understand each question.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Flag questions for later review
• If you’re stumped on something, don’t
spend too much time on it
• Select your best guess, and mark it for
review
• Then use any time you have at the end
to go back and reconsider
• Flag questions you’re not totally sure
about, too.

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Arrive prepared
• Get a good night’s sleep
• Do whatever you need to do to stay
alert – this test requires stamina
• Go to the bathroom before arriving
• Arrive early – the exam location may
be hard to find

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
Additional prep resources
• AWS Big Data White Paper (“Big Data Analytics Options on
AWS”)
• AWS’s free online prep course
• White papers on Kinesis, Database Migration Service, Migrating
Applications to AWS
• Exam overview from AWS
• This shouldn’t be your first certification exam
• Take our practice exam

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
State of learning checkpoint
• Let’s look how far we’ve gone on our learning journey

• https://aws.amazon.com/certification/certified-big-data-specialty/

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide
NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com
How will the exam work?
• You’ll have to register online at https://www.aws.training/
• Fee for the exam is 300 USD
• Provide two identity documents (ID, Credit Card, details are in emails sent to you)
• No notes are allowed, no pen is allowed, no speaking
• ~65 questions will be asked in 170 minutes
• At the end you can optionally review all the questions / answers

• You will know right away if you passed / failed the exams
• You will not know which answers were right / wrong
• You will know the overall score a few days later (email notification)
• The pass score is not provided, but some people passed with 60%
• If you fail, you can retake the exam again 14 days later

sundog-education.com
datacumulus.com
© 2019 All Rights Reserved Worldwide

You might also like